Network and Parallel Computing: IFIP International Conference, NPC 2010, Zhengzhou, China, September 13-15, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Chen Ding | Zhiyuan Shao | Ran Zheng

11 downloads 1281 Views 10MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany

6289

Chen Ding Zhiyuan Shao Ran Zheng (Eds.)

Network and Parallel Computing IFIP International Conference, NPC 2010 Zhengzhou, China, September 13-15, 2010 Proceedings

13

Volume Editors Chen Ding University of Rochester P.O. Box 270226, Rochester, NY, 14627, USA E-mail: [email protected] Zhiyuan Shao School of Computer Science and Technology Huazhong University of Science and Technology Wuhan, 430074, China E-mail: [email protected] Ran Zheng School of Computer Science and Technology Huazhong University of Science and Technology Wuhan, 430074, China E-mail: [email protected]

Library of Congress Control Number: 2010933532

CR Subject Classification (1998): D.1, C.2.4, D.2, F.2, D.4, H.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-642-15671-1 Springer Berlin Heidelberg New York 978-3-642-15671-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © IFIP International Federation for Information Processing 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

The IFIP International Conference on Network and Parallel Computing is an international conference, aimed at providing an exciting platform and forum for researchers and developers from academia and industry to present their latest research in the ﬁeld of parallel computing systems and applications. This year NPC received 89 submissions from authors in 11 countries. The papers were reviewed by a 51-member Program Committee, with 26 members from USA, 9 from mainland China, and the rest from Canada, Hong Kong, Taiwan, Korea, Japan, UK, and France. Each paper received three to six reviews. Based on a total of 287 reviews, the Program Co-chairs accepted papers into two categories: Select papers: 23 papers passed the most stringent selection. Each paper is up to 15 pages in the conference proceedings. The acceptance rate for select papers is 25.8%. Regular papers: 11 additional papers are of suﬀcient publishable quality. Each paper is up to 10 pages. For the main conference, we invited three distinguished speakers: – Xiaodong Zhang (Ohio State University, USA): Building a Domain-Knowledge Guided System Software Environment to Achieve High Performance of Multicore Processors – Huaimin Wang (National University of Defense Technology, China): InternetBased Virtual Computing Environment – Jesse Fang (Intel Labs, China) One workshop was held in conjunction with the NPC 2010 conference: the International Workshop on Network on Chip (IWNoC 2010). The workshop was chaired by Huaxi Gu and Jiang Xu. The proceedings of the workshop are included in this volume. We would like to thank all the authors of submitted papers for their work and their interest in the conference. We would like to express our sincere appreciation to all members of the Program Committee. Of the 296 reviews we assigned, 287 (97%) were completed. It was from these reviews that we identiﬁed a set of submissions that were clear, relevant, and described high-quality work in parallel systems and applications. Through these reviews hundreds of authors received objective and often detailed feedback from a diverse group of experts. In addition, we would like to thank the General Chairs Hai Jin and Jean-Luc Gaudiot and the Steering Committee members Kemal Ebcioglu and Guang Gao for their invaluable advice and guidance, as well as Alfred Hofmann and Ursula Barth of the LNCS editorial team for their prompt and patient response to our questions and requests. The conference proceedings would not have been possible without

VI

Preface

the support of these individuals and organizations. At closing, it is our hope that all of these eﬀorts have helped to improve and promote parallel-computing research in China, other Asian countries, the USA and beyond. September 2010

Chen Ding Zhiyuan Shao Ran Zheng

Organization

Executive Committee General Co-chairs

Steering Committee Chair

Jean-Luc Gaudiot (University of California-Irvine, USA) Hai Jin (Huazhong University of Science and Technology, China)

Kemal Ebcioglu (Global Supercomputing Corporation, USA)

Program Co-chairs

Chen Ding (University of Rochester, USA) Zhiyuan Shao (Huazhong University of Science and Technology, China)

Publication Chair

Ran Zheng (Huazhong University of Science and Technology, China)

Publicity Chair

Wenbin Jiang (Huazhong University of Science and Technology, China)

Workshop Chair

Chen Yu (Huazhong University of Science and Technology, China)

Finance Chair

Na Zhao (Huazhong University of Science and Technology, China)

Registration Chair

Yingshu Liu (Huazhong University of Science and Technology, China)

Web Chair

Xuejiao Xie (Huazhong University of Science and Technology, China)

Program Committee Ishfaq Ahmad Luc Boug´e Sun Chan Wenguang Chen Arun Chauhan Randy Chow Yeh-Ching Chung

University of Texas at Arlington, USA IRISA/ENS Cachan Brittany, France Intel Corp. Tsinghua University, China Indiana University, USA University of Florida, USA National Tsing Hua University, Taiwan

VIII

Organization

Xiaobing Feng Bjoern Franke Yaoqing Gao Hwansoo Han Wei-Chung Hsu Weijia Jia Song Jiang Guohua Jin Francis C.M. Lau Kuan-Ching Li Xiaoming Li Xiao-Feng Li Zhiyuan Li Shih-wei Liao Shinming Liu Paul Lu Yingwei Luo Jun Ni Dimitrios Nikolopoulos Sven-Bodo Scholz Xipeng Shen Evgenia Smirni Jaspal Subhlok Makoto Takizawa Chunqiang Tang Xinmin Tian Clark Verbrugge Guojun Wang Zhenlin Wang Chengyong Wu Youfeng Wu Nong Xiao Chao-Tung Yang Laurence T. Yang Qing Yi Yijun Yu Xin Yuan Chao Zhang Weizhe Zhang Xiangyu Zhang Chengliang Zhang Yuan Zhao Weiming Zheng Yutao Zhong Xiaotong Zhuang

Institute of Computing Technology, China University of Edinburgh, UK IBM Toronto, Canada Sungkyunkwan University, Korea National Chiao Tung University, Taiwan City University of Hong Kong, Hong Kong Wayne State University, USA Rice University, USA Hong Kong Polytechnic University, Hong Kong Providence University, Taiwan University of Delaware, USA Intel Corp., China Purdue University, USA Google Corp. HP Corp. University of Alberta, Canada Peking University, China University of Iowa, USA Foudation for Research and Technology Hellas (FORTH), Greece University of Hertfordshire, UK The College of William and Mary, USA The College of William and Mary, USA University of Houston, USA Seikei University, Japan IBM T.J. Watson Research Center, USA Intel, USA McGill University, Canada Florida Atlantic University, USA Michigan Technological University, USA Institute of Computing Technology, China Intel Corp. National University of Defense Technology, China Tunghai University, Taiwan St. Francis Xavier University, Canada University of Texas at San Antonio, USA Open University, UK Florida State University, USA Intel China Research Center, China Harbin Institute of Technology, China Purdue University, USA Microsoft Corp. IBM T.J. Watson Research Center, USA Tsinghua University, China George Mason University, USA IBM T.J. Watson Research Center, USA

Table of Contents

Keynote Speech Building a Domain-Knowledge Guided System Software Environment to Achieve High-Performance of Multi-core Processors . . . . . . . . . . . . . . . . Xiaodong Zhang Internet-Based Virtual Computing Environment . . . . . . . . . . . . . . . . . . . . . Huaimin Wang

1 2

Session 1: Parallelization and Optimization Vectorization for Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li

3

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Zhang, Lina Yuan, Tingpeng Wu, Wen Peng, and Quanlong Li

18

Optimization of Triangular Matrix Functions in BLAS Library on Loongson2F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yun Xu, Mingzhi Shao, and Da Teng

35

Exposing Tunable Parameters in Multi-threaded Numerical Code . . . . . . Apan Qasem, Jichi Guo, Faizur Rahman, and Qing Yi LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . Feng Mao and Xipeng Shen FDTM: Block Level Data Migration Policy in Tiered Storage System . . . Xiaonan Zhao, Zhanhuai Li, and Leijie Zeng

46

61 76

Session 2: Parallel Algorithms Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao Xu, Weizhe Zhang, Hongli Zhang, and Binxing Fang

91

Power Eﬃcient Scheduling for Hard Real-Time Systems on a Multiprocessor Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter J. Nistler and Jean-Luc Gaudiot

106

X

Table of Contents

Storage Device Performance Prediction with Selective Bagging Classiﬁcation and Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Zhang, Guiquan Liu, Xuechen Zhang, Song Jiang, and Enhong Chen

121

Embedding Algorithms for Bubble-Sort, Macro-Star, and Transposition Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HyeongOk Lee, Hyun Sim, JungHyun Seo, and Mihye Kim

134

An Eﬃcient Simulation Algorithm for Cache of Random Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuchang Zhou

144

DABGPM: A Double Auction Bayesian Game-Based Pricing Model in Cloud Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shifeng Shang, Jinlei Jiang, Yongwei Wu, Zhenchun Huang, Guangwen Yang, and Weimin Zheng

155

Session 3: Network NPA-BT: A Network Performance Aware BitTorrent Traﬃc Optimization Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changyou Xing, Li Yang, and Ming Chen

165

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiqiang Song, Eugene Y. Tang, and Leibo Liu

177

ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Jin, Yu Zhang, Yuanda Cao, Xing Pu, and Jiaxin Li

190

Identifying Nearest Neighbor Nodes and Connectivity in Three-Dimensional Wireless Sensor Networks Using Poisson Point Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanhuai Qu, Jianan Fang, and Shuai Zhang A Novel Trust Evaluation Model for Mobile P2P Networks . . . . . . . . . . . . Xu Wu

200

210

Session 4: Parallelization and Optimization (Cluster) Evaluating and Optimizing I/O Virtualization in Kernel-Based Virtual Machine (KVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binbin Zhang, Xiaolin Wang, Rongfeng Lai, Liang Yang, Zhenlin Wang, Yingwei Luo, and Xiaoming Li

220

Table of Contents

Distributed Stream Processing with DUP . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Christian Bader, Tilo Eißler, Nathan Evans, Chris GauthierDickey, Christian Grothoﬀ, Krista Grothoﬀ, Jeﬀ Keene, Harald Meier, Craig Ritzdorf, and Matthew J. Rutherford

XI

232

CCIndex: a Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries . . . . . . . . . . . . . . . . . . . . . . . . . Yongqiang Zou, Jia Liu, Shicai Wang, Li Zha, and Zhiwei Xu

247

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Zhou, Jianfeng Zhan, Dan Meng, and Zhihong Zhang

262

Diﬀerentiated Replication Strategy in Data Centers . . . . . . . . . . . . . . . . . . Tung Nguyen, Anthony Cutway, and Weisong Shi Eﬃcient Pipelining Parallel Methods for Image Compositing in Sort-Last Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Fang, Guangzhong Sun, Peng Zheng, Tiening He, and Guoliang Chen

277

289

Session 5: GPU and Multicore memCUDA: Map Device Memory to Host Memory on GPGPU Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hai Jin, Bo Li, Ran Zheng, Qin Zhang, and Wenbing Ao Adaptive Line Size Cache for Irregular References on Cell Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qian Cao, Chongchong Zhao, Junxiu Chen, Yunxing Zhang, and Yi Chen

299

314

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Mi, Xiaobing Feng, Jingling Xue, and Yaocang Jia

329

Energy-Eﬃcient Scheduling of Real-Time Periodic Tasks in Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaodong Wu, Yuan Lin, Jian-Jun Han, and Jean-Luc Gaudiot

344

The Core Degree Based Tag Reduction on Chip Multiprocessor to Balance Energy Saving and Performance Overhead . . . . . . . . . . . . . . . . . . . Long Zheng, Mianxiong Dong, Hai Jin, Minyi Guo, Song Guo, and Xuping Tu

358

XII

Table of Contents

Session 6: Cloud and Grid Infrastructure Improve Throughput of Storage Cluster Interconnected with a TCP/IP Network Using Intelligent Server Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . Xuechen Zhang, Guiquan Liu, and Song Jiang Evaluate the Performance and Scalability of Image Deployment in Virtual Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kejiang Ye, Xiaohong Jiang, Qinming He, Xing Li, and Jianhai Chen A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Zhang and Yunxia Pei JAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan Peng, Xiaoyi Lu, Boqun Cheng, and Li Zha User-Centric Privacy Preservation in Data-Sharing Applications . . . . . . . Feng Gao, Jingsha He, and Shufen Peng Software Metrics Reduction for Fault-Proneness Prediction of Software Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunfeng Luo, Kerong Ben, and Lei Mi

373

390

402

412 423

432

Session 7: Network on Chip A Methodology for Design of Unbuﬀered Router Microarchitecture for S-Mesh NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hao Liu, Feifei Cao, Dongsheng Liu, Xuecheng Zou, and Zhigang Zhang

442

A Worst Case Performance Model for TDM Virtual Circuit in NoCs . . . . Zhipeng Chen and Axel Jantsch

452

Convex-Based DOR Routing for Virtualization of NoC . . . . . . . . . . . . . . . Guang Sun, Yuanyuan Zhang, Yong Li, Li Su, Depeng Jin, and Lieguang Zeng

462

MPSoC Architecture-Aware Automatic NoC Topology Design . . . . . . . . . Rachid Dafali and Jean-Philippe Diguet

470

ERA: An Eﬃcient Routing Algorithm for Power, Throughput and Latency in Network-on-Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Varsha Sharma, Rekha Agarwal, Manoj S. Gaur, Vijay Laxmi, and Vineetha V. Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

481

491

Building a Domain-Knowledge Guided System Software Environment to Achieve High-Performance of Multi-core Processors Xiaodong Zhang The Ohio State University, USA

Although multi-core processors have become dominant computing units in basic system platforms from laptops to supercomputers, software development for effectively running various multi-threaded applications on multi-cores has not made much progress, and effective solutions are still limited to high performance applications relying on exiting parallel computing technology. In practice, majority multi-threaded applications are highly concurrent programs demanding high throughput, such as concurrent database transactions, massive and independent query requests in Web servers and search engines, and executing many-tasks for scientific applications in a multiprogramming mode. To best utilize the increasingly rich computing and cache resources in multi-core processors (many-cores in the near future) we must address several serious and difficult challenges. First, there are several critical hardware resources for multi-threads to share, such as the last level caches and the memory buses. However, the shared resource management is Largely controlled by hardware. Second, OS scheduler has little knowledge about applications' data demanding and access behavior, making sub-optimal task assignment decisions. Finally, the space allocation for each thread in the shared-cache is demand-based, often causing access conflicts and pollution, significantly degrading overall execution performance. We have developed a runtime environment connecting multi-threaded applications and system software in a collaborative way, where operating system is guided by application domain knowledge including data access locality and execution behavior to schedule tasks and allocate shared hardware resources for each running thread. We evaluate our environment by concurrent database transactions and multi-threaded scientific computing programs, and show strong performance and Throughput improvement by minimizing cache conflicts and misses in the last level caches of multi-cores. We further develop our system as a general framework to automatically manage multithreaded applications on multi-core processors.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, p. 1, 2010. © IFIP International Federation for Information Processing 2010

Internet-Based Virtual Computing Environment Huaimin Wang National Lab. For Parallel and Distributed Processing National University of Defense Technology Changsha, Hunan, P.R. China

Different from the traditional resources, resources over the Internet have such natural characteristics as growth, autonomy and diversity, which have brought grand challenges to their efficient sharing and comprehensive utilization. As there are the essential differences between the Internet computing environment and the traditional computer environment, the resource management methods for traditional computers is no longer suitable for the internet any more. The emergence and development of Web 2.0 and P2P technologies in recent years manifests that the resource management over Internet should adapt to the natural characteristics of the Internet resources and virtualization is one of the potential means to deal with those challenges. However, virtualization in Internet computing environment is essentially different from that in traditional computer environment. There are three scientific issues to be explored, i.e., on-demand aggregation in the open Internet environment, autonomic collaboration among distributed autonomous resources, computational properties of the aggregation and collaboration. We propose the Internet-based Virtual Computing Environment (iVCE), which aims at providing harmonious, trustworthy and transparent computing environment and services for end-users and applications by the means of novel computing models and innovative mechanisms, such as the on-demand resource aggregation and autonomic resource collaboration. The iVCE tries to virtualize and autonomicize Internet resources, design new models and mechanisms adapting well to the Internet computing environment, and promote the sharing and utilization of the Internet resources. Many progresses have been achieved in the model, technical framework and mechanisms of iVCE in the past several years. The concept model of iVCE includes autonomic element (AE), virtual commonwealth (VC) and virtual executor (VE). The architecture of iVCE is designed based on these models. Many mechanisms have been proposed in such areas as resource virtualization, on-demand aggregation, autonomic collaboration, trustworthy enhancement and programming language facility. The iVCE middleware has been designed and implemented, providing a set of services to support the flexible sharing and collaboration of Internet resources. The iVCE middleware has also provided some tools, such as virtual network storage, virtual network memory and virtual cluster for users. Many applications have been run on the iVCE middleware. iVCE will support the multi-scale sharing and collaboration of resources in various granularities, including the large-scale data center, client resources and various resources on the edge of Internet. The productivity and trustworthiness of iVCE will be focused on to investigate in the future. The integration of multiple virtual commonwealths is also part of our future work.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, p. 2, 2010. © IFIP International Federation for Information Processing 2010

Vectorization for Java Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li China Runtime Technologies Lab, Intel China Research Center {jiu-tao.nie,bu.qi.cheng,shisheng.li, ligang.wang,xiao.feng.li}@intel.com

Abstract. Java is one of the most popular programming languages in today’s software development, but the adoption of Java in some areas like high performance computing, gaming, and media processing is not as universal as in general-purpose computing. A major drawback preventing it from being extensively adopted in those areas is its lower performance than the traditional or domain-specific languages. This paper describes two approaches to improve Java’s usability in those areas by introducing vector processing capability to Java. The first approach is to provide a Java vectorization interface (JVI) that developers can program with, to explicitly expose the programs’ data parallelism. The other approach is to use automatic vectorization to generate vector instructions for Java programs. It does not require programmers to modify the original source code. We evaluate the two vectorization approaches with SPECjvm2008 benchmark. The performances of scimark.fft and scimark.lu are improved up to 55% and 107% respectively when running in single thread. We also investigate some factors that impact the vectorization effects, including the memory bus bandwidth and the superscalar micro-architecture. Keywords: Java, vectorization, dependence graph, memory bus bandwidth, superscalar micro-architecture.

1 Introduction Java as a programming language has modern features for software productivity, security and portability. Java also has a comprehensive standard library covering almost all kinds of application needs. It is one of the most widely used programming languages from mobile phone through server. Due to its position in modern computation environment, a great deal of work has been done to improve Java performance. With the introduction of just-in-time (JIT) compilation, the performance of Java programs has been improved significantly. More and more optimization techniques in static compilers have been adopted by the JIT compilers of Java. In many situations, the execution speed of Java programs can be comparable to the equivalent C programs. However, in terms of performance, there is still one major absence in Java world compared to C/C++ world, i.e., the support to leverage the powerful vector processing units of modern microprocessors. A C/C++ programmer can benefit from the vector instructions in various approaches, such as C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 3–17, 2010. © IFIP International Federation for Information Processing 2010

4

J. Nie et al.

inserting inline assembly, calling vector intrinsics, writing programs with vector API, etc. However, none of those approaches is available to Java programmers. To bring the benefit of vector computation capability into Java world, we develop two complementary approaches: a library-based programming approach and a compilation-based automatic approach. In this paper, we describe the two approaches and discuss the issues we meet with Java vectorization. The main contributions of this paper include: 1.

2. 3.

4.

We define a generic set of Java vectorization interface (JVI) with Java class library and implement JVI support in a Java virtual machine. JVI covers the vector operation spectrum of common IA32 and EM64T microprocessors. We implement the automatic vectorization in a Java virtual machine that tries to vectorize Java applications automatically. We evaluate the two vectorization approaches with SPECjvm2008 benchmark, and the performance of scimark.fft and scimark.lu is doubled (up to 55% and 107% respectively) when running in single thread. We investigate the factors that impact the vectorization effect. Our study shows that memory bus bandwidth can be a limiting factor for vectorization to scale up on multicore platforms. Superscalar micro-architecture can also hide the benefit of vectorization.

The rest of the paper is organized as follows. Section 2 discusses related work in program vectorization. Section 3 and Section 4 are the main body of this paper describing our Java vectorization work based on JVI and on automatic vectorization respectively. Section 5 gives the experimental results and discusses the issues. We conclude our work in Section 6.

2 Related Work The latest published vectorization work for Java is an SLP [2] automatic vectorizer implemented in Jikes RVM [3]. It uses a modified tree-pattern matching algorithm to identify similar successive instructions and turn them into equivalent vector instructions. It relies on loop-unrolling to vectorize computations of different iterations as other pure SLP algorithms do. The implementation is in the initial stage. It does not generate real single instruction multiple data (SIMD) instructions. Instead, it uses 32bit integer instructions to simulate the simple vector computations of 16-bit short and 8-bit byte types. In recent years, lots of work on automatic vectorization has been devoted into the GCC compiler [4, 5]. Java users can benefit from this work by compiling Java programs into native code with Gnu Compiler for Java (GCJ), which uses GCC's middle and back end to compile both Java source code and Java bytecode into native code. The vectorizer implemented in GCC supports simple loop-based automatic vectorization [6] and interleaved memory accesses in loop [7, 8]. A limitation of the current GCC’s vectorizer is that the memory accessing strides must be constants whose values are powers of 2. Our automatic vectorization algorithm does not have this limitation. It unrolls the part of the loop that cannot be vectorized. If the unrolled instructions access consecutive addresses, a complementing SLP vectorizer can

Vectorization for Java

5

further group them into vector instructions. A later important improvement on GCC's vectorizer is to integrate SLP vectorization into the previous loop-based vectorizer, which results in a loop-aware SLP vectorizer [9]. The improved one is more flexible to handle various memory accessing patterns. The Intel C++ compiler (ICC) provides four levels of vectorization supports: inline assembly, intrinsics, C++ vector library and automatic vectorization. The C++ vector library provides vector classes with overloaded operators. Operations on those types are translated into vector IR by the compiler's front end. Our Java vector class library does the same thing except that no overloaded operators are provided since Java does not support operator overloading. Novell implements a vector class library in its .Net framework, called Mono.Simd, to provide the vector API supports for C# programming. The APIs are mapped directly to the hardware vector operations. Due to the directly mapping, Mono.Simd might be tightly coupled with certain hardware versions.

3 Java Vectorization Interface (JVI) Both the JVI based vectorization and the automatic vectorization need the support of a just-in-time (JIT) compiler. We implement the two vectorization approaches in Jitrino, the optimizing JIT compiler of Apache Harmony. Jitrino has two levels of intermediate representation (IR), HIR and LIR. It supports most modern facilities for optimization, such as the static single assignment (SSA) form, the control flow graph and the loop tree. It also contains many commonly used compiler optimizations. The infrastructure of Jitrino with our vectorization work is shown in Figure 1. The flow on the right side illustrates the process from the Java source code to the native machine code. Boxes represent transformations, and ellipses represent programs in different forms. The JIT compiler, i.e. Jitrino is encircled by the dotted box, in which, the dashed boxes B, C and F are transformations modified for vectorization, and the other two dashed boxes D and E are modules added for vectorization and depended on by various transformations. The upper-left dashed box A is the JVI class library that can be used by programmers in their Java source code. In the figure, A and B are for JVI based vectorization, and C is for automatic vectorization. D, E and F provide support for both vectorization approaches. This section introduces the design and implementation of the JVI based vectorization. 3.1 JVI Design JVI is designed to be an abstract vector interface independent to concrete vector instruction sets. Programs written with it can be compiled to use different vector instructions, such as SSEx and AVX of Intel processors. JVI comprises a set of classes representing vector types of various primitive Java types, including 8, 16, 32 and 64 bits signed integers, as well as 32-bit and 64-bit floating point numbers. They are packaged into the name space com.intel.jvi. Currently, the default size of a vector is 128-bit.

6

J. Nie et al.

JVI class library A

Java source code Programmers write vectorized program with JVI

Java compiler Java bytecode

JIT compiler (Jitrino) with vectorization support

B

High level IR (HIR)

D Definition of vector type and operators for HIR

E Description of instructions for LIR

SIMD

Translate bytecode to HIR

HIR optimization passes C

Automatic vectorization

F

Translate HIR to LIR Low level IR(LIR) LIR optimization Native code encoder

Native code Uses or depends on Modules and transformations related to vectorization

Fig. 1. Infrastructure of vectorization implementation in Jitrino

Each vector class exposes a set of methods acting as vector operators. These “operators” cover most hardware supported vector operations, including initialization from scalar(s), arithmetic and logical computation, bit shifting, comparison, conditional selection, memory accessing and element shuffling. They provide enough flexibility for programmers to write efficient vectorized code. The following is a piece of the source code a JVI class representing the vector type of double. public class F64vec2 { public static F64vec2 make(double d0, double d1) { return fail (); } public static F64vec2 load(double[] a, int i) { return fail (); } public F64vec2 add(F64vec2 B) { return fail (); } public F64vec2 sub(F64vec2 B) { return fail (); } public F64vec2 mul(F64vec2 B) { return fail (); } …… …… }

Vectorization for Java

7

The exposed vector methods are only used to tell the front end of the JIT compiler (part B) how to translate them into the vector IR defined in part D. They should never be actually called, so their bodies only throw an exception (with fail()) to indicate that. These methods are designed to be pure functions, i.e. they never rely on or change the state of “this” object. This design forces programmers to treat vector types as primitive types, since internally, these vector types indeed behave as primitive types rather than classes. They can reside in both memory and registers, and can be passed as values in function calls. This is important to avoid dangerous misuse of the JVI library. At the same time, it gives the compiler more freedom to optimize. 3.2 JVI Implementation JVI interface is defined in class library. Programs using JVI must be translated into machine code by JVM. We extend the HIR of Jitrino by defining vector types and operators (see part D) to represent the JVI interface internally. The existing high level optimizations in Jitrino can be applied to the IR directly. Part B in Figure 1 is the front end of Jitrino. It translates Java bytecode into Jitrino HIR. JVI class references and method callings are correspondingly translated into vector types and vector instructions of the extended HIR in this stage. We will use the following code fragment from scimark.lu of SPECjvm2008 to illustrate the translation process. This hot loop consumes more than 90% execution time of the benchmark. for (int jj = j + 1; jj < N; jj++) Aii[jj] -= AiiJ * Aj[jj]; In the code, Aii and Aj are two arrays of double type values, and AiiJ is a double type variable. The manually vectorized version of the loop is as follows: 1 F64vec2 v_aiij = F64vec2.make (AiiJ); 2 for (int jj = j + 1; jj < N - 1; jj += 2) { 3 F64vec2 v_t1 = F64vec2.load (Aj, jj); 4 F64vec2 v_t2 = F64vec2.load (Aii, jj); 5 v_t2.sub (v_aiij.mul (v_t1)).store (Aii, jj); 6 } In the above code, F64vec2.make(AiiJ) creates a vector containing two double type values of AiiJ. F64vec2.load loads two consecutive double type values starting at the given index from the given array. v_t2.sub() and v_aiij.mul() return the results of vector subtraction and vector multiplication correspondingly between the “this” objects and their arguments. XXX.store(Aii, jj) stores two elements of XXX into Aii[jj] and Aii[jj+1]. The bytecode of this code fragment contains the vector type com.intel.jvi.F64vec and six calls of its methods. After translated by part B (in Figure 1), the six method calls become the following HIR instructions, where double<2> is the vector type of double. It is the vector IR defined for com.intel.jvi.F64vec2: I247: I262: I270: I271: I272: I274:

conv t186 -) t187 : double<2> ldind [t199] -) t200 : double<2> ldind [t207] -) t208 : double<2> mul t187, t200 -) t209 : double<2> sub t208, t209 -) t210 : double<2> stind t210 -) [t207]

8

J. Nie et al.

4 Automatic Vectorization Part C in Figure 1 is the automatic vectorization pass we implement in Jitrino. Since most opportunities for data parallelism occur between iterations of loops, traditional vectorization techniques mainly focus on exploiting loop level data parallelism. This kind of techniques is referred to as loop-based vectorization. Another kind of vectorization is called SLP (Superword Level Parallelism) vectorization [9]. It identifies groups of isomorphic instructions exposing superword level parallelism, and combines them into equivalent vector instructions. The loop-based vectorization exploits data parallelism among different executions of the same instruction, while the SLP vectorization exploits data parallelism among different instructions in the straight-line code (usually in the same basic block), so the SLP vectorization can be a complement to the loop-based vectorization. With loop-unrolling, loop level data parallelism can be transformed into superword level parallelism, so the SLP vectorization can also be used to exploit loop level data parallelism with the help of loop unrolling. However, duplicating instructions that can be parallelized multiple times and then re-recognizing them to be isomorphic from all duplicated instructions and finally combining them back into one vector instruction is not as efficient as the loop-based vectorization. The loop-based vectorization only transforms the loop once. To achieve the same effect as the loop-based vectorization, SLP vectorization also needs induction variable analysis and data dependence analysis (it may not do them, but that will cause missing vectorization opportunities). In fact, SLP vectorization is more complex and inefficient than loop-based vectorization for vectoring loops. As a result, we implement the loop-based vectorization in Jitrino and treat SLP vectorization as a complement that may be implemented in the future. Our loop-based vectorization is composed of two parts: vectorization analysis, and vectorization transformation. The first part analyzes and collects necessary information for all loops, and the second part performs the transformation. 4.1 Vectorization Analysis The vectorization analysis is applied to the leaf nodes in the loop tree. Only countable leaf loops with single entry and single exit are taken as vectorization candidates. The countability of a loop is determined by its exit condition. The exit condition of a countable loop must be a comparison between a loop invariant value and an induction variable. Whether a variable is loop invariant and how a variable changes is analyzed by the scalar evolution analysis. This analysis is called on demand for requested single variable, and is also called by data dependence analysis for analyzing array indices. The data dependence analysis builds data dependence graph (DDG) among all instructions of a loop. The dependences due to explicit variable references can be easily retrieved from the use-define chains incorporated in the SSA form IR. To determine dependences due to accessing aliased array elements, we first use the simplest GCD (greatest common division) test to filter out most non-aliased cases quickly. Then, according to the number of index variables in the array accessing expressions, we call the ZIV (zero index variable) test or SIV (single index variable) test [13]. For the case of multiple index variables (MIV), we simply assume there is a dependence since this case is relatively rare but the testing cost is quite high.

Vectorization for Java

9

Different from the algorithms described in [4, 7, 8], our vectorization algorithm tries to vectorize the loop even when there exists cyclic dependence in the loop, or some of the instructions can not be vectorized. This is a novel feature of our vectorization algorithm. As we know, other algorithms just give up the vectorization when there is a dependence circle. With a strongly connected component (SCC) analysis on the DDG, DDG nodes (instructions) can be grouped into different SCCs. Instructions of trivial SCCs (that only contain a single instruction) that can be supported by hardware vector instructions are considered as candidates for vectorization. All other instructions, including those of non-trivial SCCs and those that are not supported by hardware are considered as candidates for loop unrolling. The candidates for loop unrolling and those for vectorization are not split into two separate loops. Two separate loops require additional temporary array and memory accessesing instructions to pass data between them, which may greately degrade the performance. Our algorithm uses pack and extract operations to transfer data among unrolled scalar instructions and the vectorized vector instructions through registers, which is far more efficient than through memory. The pack operation packs a set of scalar values into a vector, which is implemented by a series of SSE instructions. The extract operation extracts a specific scalar element from a vector, which is implemented by a single SSE instruction. To increase vectorization opportunities, our algorithm tries to break dependence circles through dynamic alias testing. The algorithm finds DDG edges between two array accessing instructions in non-trivial SCCs, and generates alias testing code into the pre-header of the loop for all the pairs of arrays. Then, it removes all such DDG edges. As a result, some dependence circles may be broken. If there is no any alias, the control flow is directed into the vectorized code path. Otherwise, it is directed into the original loop. Each instruction selected for vectorization is assigned a vectorization factor (VF). The vectorization factor of an instruction is a number denoting how many scalar operations of that instruction can be parallelized in one corresponding vector operation. For example, suppose the size of vector types are all 128 bits, then the VF of a 32-bit integer instruction is four and the VF of a 64-bit floating point instruction is 2. The vectorization factor of a loop is a number denoting how many iterations of the original loop can be executed in one iteration of the vectorized loop. Instructions in a loop may have different vectorization factors. We choose the largest one as the VF for the loop to maximize the data parallelization. An approximate cost model for evaluating the profit of vectorization is given as follows: C1 ( I )

cost(I ) vf c

A( I ) U

cost ( I V ) (vf c / vf ( I ))

A(I ) V

0 A(def(o)) = A(I) ⎧ ⎪ C2 ( I ) = ∑o∈opnd ( I ) ⎨ cost(vec_pack) A(def(o)) = U ∧ A(I) = V ⎪cost(vec_extract) × vf ( I ) A(def(o)) = U ∧ A(I) = U ⎩ P = ∑ I∈L (cost ( I ) × vf c ) − ∑ I∈L (C1 ( I ) + C 2 ( I ))

(1)

(2)

(3)

10

J. Nie et al.

Notations: - C1(I): instruction cost of I after transformation; - C2(I): operand cost of I after transformation. - A(I): transformation action to instruction I. U: to unroll; V: to vectorize. - vfc: common VF of the loop; - vf(I): VF of instruction I. - IV : vector version of instruction I. - def(o): defining instruction of variable o. - opnd(I): operand set of instruction I. - cost(I): cost of instruction I. Equation (1) estimates the execution cost of instruction I after the transformation. Equation (2) estimates the operand retrieval cost of I after the transformation. The vec_pack is a pseudo operator representing a set of SSE instructions for packing scalar values into a vector operand. The vec_extract is another pseudo operator representing the SSE instruction for extracting a scalar operand from a vector value. Equation (3) calculates the performance profit of the action set given by A. For instructions that cannot be vectorized, we have no choice other than to unroll them. For other instructions that can be vectorized, we have two choices: to vectorize them or to unroll them. Different determinations of transforming actions for these instructions may lead to different profit. We use a simple local policy to try to maximize the profit. Our algorithm first finds all instructions that can be vectorized, marking them as "to-be-vectorized" and marking all others as"to-be-unrolled". Then, it goes through all instructions marked as "to-be-vectorized" and checks for each one of them whether changing the action for the instruction to "to-be-unrolled" will bring more profit (from eliminating operand retrieval cost). If yes, the algorithm changes the action of that instruction. This greedy policy may not generate the optimal result, but it is good enough in practice considering its simplicity and efficiency. 4.2 Vectorization Transformation After all SCCs of the DDG have been assigned actions, the transformation algorithm traverses all SCCs in the dependence order, in which depended SCCs appear before depending SCCs. For each of the SCCs, the algorithm emits either a vector instruction or a set of scalar instructions according to the action of that SCC. The key structure for connecting vectorized instructions and unrolled instructions is a map from variables of the original loop to arrays of variables of the transformed loop. For each original variable, the mapped-to array stores its copies in the transformed loop, corresponding to a sequence of iterations. Specifically, in our algorithm, the arrays have VF + 1 elements; the first VF elements of the array stores scalar copies of the original variable corresponding to VF iterations, and the last element of the array stores the vector copy of the original variable. Both scalar and vector copies are created on demand. The transformation algorithm is as follows. The function unroll emits unrolled scalar instructions to the loop body for instructions in a given SCC. The function vectorize emits a vector instruction to the loop body for the single instruction of the given SCC. The function map returns the variable corresponding to

Vectorization for Java

11

the requested iteration. At the first time a variable is requested, it creates the variable and generates extract or pack instructions to initialize that variable. transform () { for each SCC, say s, of the DDG in the dependence order switch (action of s) { case to_unroll: unroll (s); break; case to_vectorize: vectorize (s); break; } } unroll (s) { for (i = 0; i < VF; i++) for each instruction "'x' '=' 'y' 'op' 'z'" in s { create a new scalar variable, say 't'_i; emit "'t'_i '=' map('y',i) 'op' map('z',i)"; MAP('x')[i] = 't'_i; } } vectorize (s) { let "'x' = 'y' 'op' 'z'" be the single instruction of s; create a new vector variable, say 'v'; emit "'v' '=' map('y',VF) 'op' map('z',VF)"; MAP('v')[VF] = 'v'; } map (var, i) { if (MAP(var)[i] == NULL) { if (i < VF) { for (j = 0; j < VF; j++) { create a new scalar variable, say 't'_j; emit "'t'_j '=' 'vec_extract' MAP(var)[VF], j"; MAP(var)[j] = 't'_j; } } else { // i == VF create a new vector variable, say 'v'; emit "'v' '=' 'vec_pack' MAP(var)[0], MAP(var)[1], ..., MAP(var)[VF-1])"; MAP(var)[i] = 'v'; } } return MAP(var)[i]; } For example, in the following code, the arrays and the variable C are of floating point type. Reading from a[i-1] at line #5 and writing to a[i] at line #7 plus the addition at line #6 create a dependence circle with distance 1 and hence cannot be vectorized. Other instructions do not belong in any dependence circles and can be vectorized. The comments in the code indicate which SCC an instruction belongs in.

12

J. Nie et al.

1 for (int i = 1; i < N; i++) { 2 t1 = b[i]; // SCC 0 3 t2 = c[i]; // SCC 1 4 t3 = t1 * t2; // SCC 2 5 t4 = a[i - 1]; // SCC 3 6 t5 = t4 + t3; // SCC 3 7 a[i] = t5; // SCC 3 8 t6 = t5 * C; // SCC 4 9 d[i] = t6; // SCC 5 10 } The following is the transformed code commented with changes of the variable map, i.e. the MAP in the above algorithm. The SCCs are processed just in the order of SCC numbers given in the comments of the above code. Lines #2 through #4 are instructions generated for SCCs #0 through #2. Lines #5 through #20 are for SCC #3. Lines #21 through #23 are for SCC #4 and line #24 is for SCC #5. 1 for (int i = 1; i < N; i++) { 2 v1 = b[i:i+3]; // t1 -> [0, 0, 0, 0, v1] 3 v2 = c[i:i+3]; // t2 -> [0, 0, 0, 0, v2] 4 v3 = v1 * v2; // t3 -> [0, 0, 0, 0, v3] 5 s1 = a[i - 1]; // t4 -> [s1, 0, 0, 0, 0] 6 s2 = vec_extract(v3, 0); // t3 -> [s2, 0, 0, 0, v3] 7 s3 = s1 + s2; // t5 -> [s3, 0, 0, 0, 0] 8 a[i] = s3; 9 s4 = a[i]; // t4 -> [s1, s4, 0, 0, 0] 10 s5 = vec_extract(v3, 1); // t3 -> [s2, s5, 0, 0, v3] 11 s6 = s4 + s5; // t5 -> [s3, s6, 0, 0, 0] 12 a[i + 1] = s6; 13 s7 = a[i + 1]; // t4 -> [s1, s4, s7, 0, 0] 14 s8 = vec_extract(v3, 2); // t3 -> [s2, s5, s8, 0, v3] 15 s9 = s7 + s8; // t5 -> [s3, s6, s9, 0, 0] 16 a[i + 2] = s9; 17 s10 = a[i + 2]; // t4 -> [s1, s4, s7, s10, 0] 18 s11 = vec_extract(v3, 3); // t3->[s2, s5, s8, s11, v3] 19 s12 = s10 + s11; // t5 -> [s3, s6, s9, s12, 0] 20 a[i + 3] = s12; 21 v4 = vec_pack(s3, s6, s9, s12);//t5->[s3,s6,s9,s12,v4] 22 v5 = vec_duplicate(C); // C -> [0, 0, 0, 0, v5] 23 v6 = v4 * v5; // t6 -> [0, 0, 0, 0, v6] 24 d[i:i+3] = v6; 25 }

5 Experiments and Discussions We use SPECjvm2008 to evaluate the impact of our vectorization on performance on two representative multi-core platforms:

Vectorization for Java

1. 2.

13

Core i7 965 3.2GHz with 8MB L3, 4.80 GT/s QPI and 3GB DDR3 memory, representing desktop machines. Dual Xeon 5560 2.8 GHz with 8MB L3, 6.4 GT/s QPI and 12G DDR3 memory, representing server machines.

Processors of the machines are both based on the Nehalem micro-architecture with a powerful out-of-order engine and supporting 32-bit and 64-bit scalar instruction set and up to SSE4.2 instruction set. We used 32-bit Linux as the operation system (OS) on both platforms. The performance measurement includes the JIT compiling time, but the compiling time only occupies a very little portion of the whole running time of the benchmarks. In fact, the additional compiling time of the automatic vectorization comparing to JVI only comes from the automatic vectorization pass, which only applies to few loops satisfying many restrictions, and hence has very little impact on the benchmark scores. Therefore, JVI and automatic vectorization achieve similar performance and we do not distinguish them in the following discussions. 5.1 Performance Improvement by Vectorization During the execution of SPECjvm2008, there are totally 44 loops of 37 methods within benchmarks or libraries being vectorized. Though many of them only have a very low coverage and hence do not significantly contribute to the performance gain. These numbers at least indicate that there are considerably vectorization opportunities in general Java programs and libraries. In the performance evaluation, due to their high coverage of the hottest loop, scimark.lu (LU) and scimark.fft (FFT), two sub-benchmarks of SPECjvm2008, are selected as the main workloads for our vectorization evaluation. Both of these benchmarks have three kinds of inputs. They are small, default and large input sets, corresponding to 512KB, 8MB, 32MB input data set per thread. According to the data size requirement, we configure Harmony VM with maximal 600M heap to eliminate the unnecessary overhead from garbage collection. Figure 2 shows the program level performance improvements of LU and FFT. Multi-threading data are provided to check the vectorization effect in multi-core systems. From the figure we can see that vectorization can get obvious performance gain in almost all scenarios. With single thread, FFT and LU get 55% and 107% performance improvements on the i7 platform, and 45% and 100% on the Xeon platform. On the other hand, we can see that with the increment of the number of threads, the performance gain is reduced. Especially, for the default and large input data set, the performance gain degradation on the i7 platform is much faster than on the Xeon platform. Figure 3 shows the scalability of LU benchmark. From the data, we can find that, by vectorization, the scalability becomes worse on both platforms. Based on the fact that the size of the default and large data set of LU and FFT is larger than the whole L3 cache size, we deduce that the limit of memory bus bandwidth is the main reason that restrains the performance gain of vectorization. The higher memory bus bandwidth on the Xeon platform remedies the problem to some degree, and hence causes better scalability comparing with the i7 platform.

14

J. Nie et al. Small

Default

Small

Large

Improvement

Improvement

50.00% 40.00% 30.00% 20.00% 10.00%

Large

40.00% 30.00% 20.00% 10.00% 0.00%

0.00% 1

2

3

1

4

2

3

(a) FFT on Core i7 965 Small

Default

4

5

6

7

8

Threads number

Threads number

(b) FFT on Xeon 5560

Large

Small

120.00%

120.00%

100.00%

100.00%

Improvement

Improvement

Default

50.00%

60.00%

80.00% 60.00% 40.00% 20.00% 0.00%

Default

Large

80.00% 60.00% 40.00% 20.00% 0.00%

1

-20.00%

2

3

4

1

2

3

Threads number

4

5

6

7

8

Threads number

(c) LU on Core i7 965

(d) LU on Xeon 5560

Fig. 2. The performance improvement of FFT and LU by vectorization Orginal + Default Orginal + Large

Vectorized + Default Vecorized + Large

2.5

6

2

5

Speedup .

Speedup

Original + Default Original + Large

1.5 1 0.5

Vectoried + Default Vectoried + Large

4 3 2 1

0 1

2

3

4

0 1

2

3

Threads number

4

5

6

7

8

Threads number

(a) LU on Core i7 965

(b) LU on Xeon 5560

Fig. 3. The scalability of LU

To verify the bandwidth effect, we use the NUMA feature of the Xeon 5560 platform. On NUMA enabled platforms, accessing the local memory of the host processor is much faster than accessing the remote memory. Cores accessing local memory do not consume the bus bandwidth of remote memory. The NUMA feature can not be exploited with current Harmony VM because the VM allocates and binds whole memory heap to single processor in the main thread, although all benchmark threads run in single VM (Single VM mode). To illustrate the impact of NUMA feature on bandwidth with Harmony VM, we simulate the local memory access by running one benchmark thread in one VM instance, and binding the VM with one core. Multiple VM instances (Multiple VM mode) are executed at the same time to simulator multiple benchmark threads.

Vectorization for Java S-VM + Vec

M-VM + Vec

S-VM + Vec M-VM + Vec

Speedup

Improvement

M-VM + Seq 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 4

5

6

7

15

S-VM + Seq M-VM + Seq

6 5 4 3 2 1 0

8

Threads number

(a) LU performance improvements

4

5

6

7

8

Threads number

(b) LU scalabilities

Fig. 4. LU performance improvements and scalabilities in different scenarios

In the experiment, we run both original sequential and vectorized codes in single VM and multiple VMs mode, totally four scenarios are executed. In both execution modes, all the benchmark threads are evenly mapped to different processors, and large data input set is used to make sure that the effect of cache misses is negligible. Figure 4 shows the data of benchmark LU from Xeon 5560 platform. The data of 4~8 threads are showed to illustrate the scalability problem. Figure 4 (a) shows the performance improvements of benchmark LU with the aids of NUMA and vectorization. In the figure, the sequential execution in single VM mode is used as the base line of the performance. From the data we can find that NUMA can dramatically improve the performance of vectorized programs. Figure 4 (b) compares the scalabilities of the LU benchmark running in four scenarios. From the data we can find that the scalability problem can be largely solved with NUMA architecture. 5.2 Limitation of the Partial Vectorization The partial vectorization algorithm described in the last section is powerful in the sense of exploiting data parallelism. However, whether it can improve the performance of a program depends on two factors: how much the inherentparallelism exists in the program, and how well the micro-architecture automatically exploits the parallelism without the vectorization. For example, most modern processors have superscalar micro-architecture that can exploit general instruction-level parallelism. If the inherent data parallelism of a program has been sufficiently exploited by the superscalar engine of the processor, performing vectorization for that program cannot bring any more performance gain. We observe this phenomenon with the benchmark scimark.sor, whose hot spot is a loop covering more than 90% execution time. The pseudo-code of the loop is given below: for (j = 1; j < N; j++) G[j]= A * (Gm[j] + Gp[j] + G[j-1] + G[j+1]) + B * G[j]; In this loop, all arrays and variables except j and N are of double type. With our partial vectorization algorithm, the first addition and the last multiplication at the second line are vectorized, and others are unrolled due to the dependence circle. One double addition, one double multiplication and three double loads are saved per two iterations. However, the performance of the vectorized program does not improve compared

16

J. Nie et al.

to the non-vetorized version. The reason is that the inherent parallelism of the loop has been sufficiently exploited by the powerful superscalar engine of Nehalem. Figure 5.(a) is the data dependence graph of the loop. The least execution time of this loop is determined by the execution time of the circle of the dependence graph in spite of how strong the parallelization ability of a processor is. Suppose double type addition, multiplication, load and store requires 3, 5, 4 and 1 processor cycles respectively, then the least execution cycles of the loop is (3 × 3 + 5 + 4 + 1) × N = 19N . During the 19 cycles of each iteration, the Nehalem micro-architecture with six issue ports, three pipelined execution units, one load unit and one store unit can easily arrange the other loads and computations to be executed concurrently. Thus, the execution speed of this loop has reached its toplimit.

19

+ * A + +

B

Gm[j] Gp[j]

(a)

* G[j]

A

+

G[j-1]

+ G[j]

*

* A

*

Gm[j]

A

G[j-1]

G[j]

+ B +

G[j+1]

Gm[j] Gp[j]

13

+

* + B

+

G[j+1]

G[j-1]

16

+

* +

G[j]

G[j]

G[j]

G[j+1]

Gp[j]

(b)

(c)

Fig. 5. Data dependence graph of the kernel loop of scimark.sor

In this example, expression re-association can help to expose more parallelism. When floating point precision is not required to be strict, the computation order can be changed to shorten the length of the dependence circle in the loop. The dependence graphs of the following two versions of the loop are shown in Figure 5 (b) and (c). for (j G[j] for (j G[j]

= = = =

1; j < N; j++) A * (Gm[j] + Gp[j] + G[j+1] + G[j-1]) + B * G[j]; 1; j < N; j++) A * (Gm[j] + Gp[j] + G[j+1]) + B * G[j] + A * G[j-1];

With the reorder of the computation, the numbers of the approximate cycles required by the dependence circles of version (b) and version (c) are 16 and 13 respectively, and their (non-vectorized) execution speeds are improved by about 30% from (a) to (b) and then 40% from (b) to (c). This means that, the superscalar is still powerful enough for exploiting the increased data parallelism even when the dependence circle is minimized, and there is no further optimization space left for the vectorization technique.

Vectorization for Java

17

6 Conclusion and Future Work In this paper, we introduce our Java vectorization work that uses two ways to exploit the data parallelism of Java applications. They can achieve similar vectorization results. Our work shows up to 55% and 107% performance gain for scimark.fft and scimark.lu of SPECjvm2008 when running in one thread. Performance gain was also observed with eight threads. In the manual vectorization, we define, design and implement a unified and generic set of Java Vectorization Interface so that Java programmers can use the interface for Java programming without considering the specific hardware supports. In the automatic vectorization, we propose a novel algorithm which provides aggressive vectorization supports to maximize the vectorization benefit. We analyze the result of our work, and our investigation shows that several factors such as memory bus bandwidth, superscalar micro-architecture and code shape need to be considered when applying vectorization techniques. Currently we are porting this work to JavaScript engine. It is challenging because JavaScript language is dynamically typed. Lots of branches inside code make regular data parallelism hard to be exploited. We are developing type analysis technique to overcome the problem.

References 1. The Apache Software Foundation. Apache Harmony, http://harmony.apache.org 2. Larsen, S., Amarasinghe, S.P.: Exploiting superword level parallelism with multi-media instruction sets. In: PLDI, pp. 145–156 (2000) 3. El-Mahdy, S.E.S.A., El-Mahdy, A.: Automatic vectorization using dynamic compilation and tree pattern matching technique in jikes rvm. In: ICOOOLPS, pp. 63–69 (2009) 4. Free Software Foundation. Auto-vectorization in gcc, http://gcc.gnu.org/projects/tree-ssa/vectorization.html 5. Free Software Foundation. Gcc, http://gcc.gnu.org 6. Naishlos, D.: Autovectorization in gcc. In: GCC Summit, pp. 105–118 (2004) 7. Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for simd. In: Schwartzbach, M.I., Ball, T. (eds.) PLDI, pp. 132–143. ACM, New York (2006) 8. Nuzman, D., Zaks, A.: Autovectorization in gcc - two years later. In: GCC Summit, pp. 145–158 (2006) 9. Rosen, I., Nuzman, D., Zaks, A.: Loop-aware slp in gcc. In: GCC Summit, pp. 131–142 (2007) 10. Novell Corporation. Mono, Simd. Namespace, http://www.mono-project.com/ 11. Intel Corporation. IA-32 Intel Architecture Optimiztion Refernce Manual. Copyright (1999-2003) 12. Pedro, V., et al.: Automatic loop transformations and parallelization for Java. In: ICS 2000: 14th Int. Conf. on Supercomputing, pp. 1–10 (2000) 13. Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures: A Dependencebased Approach. Morgan Kaufmann, San Francisco (2001)

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse Yu Zhang1,2 , Lina Yuan1 , Tingpeng Wu1 , Wen Peng1,2 , and Quanlong Li1,2 1

School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, P.R. China 2 Software Security Laboratory, Suzhou Institute for Advanced Study, University of Science and Technology of China, Suzhou 215123, P.R. China [email protected]

Abstract. Garbage collection consumes signiﬁcant overhead to reclaim memory used by dead (i.e., unreachable) objects in applications. This paper explores techniques for compiler assisted object reclamation and allocation on an actual JVM. Thereinto, the just-in-time compiler identiﬁes dead objects using pointer and escape analysis combining liveness information and inserts calls to free them. The garbage collector provides runtime support for explicit reclamation and space reuse. Our approach diﬀers from other compiler assisted GC in two crucial ways. First, it identiﬁes not only the objects that are no longer referenced directly by the program, but also the objects that are referenced only by those identiﬁed to-be-freed objects. Second, it modiﬁes a parallel garbage collector, and not only frees the identiﬁed dead objects, but also tries to reuse their space immediately. The experimental results show that the JIT-assisted GC improves the memory utility and the performance eﬃciently. Keywords: Compiler assisted garbage collection, pointer and escape analysis, live variable information, parallel garbage collector.

1

Introduction

Garbage collection (GC) [1] is a technology that frees programmers from the error-prone task of explicit memory management. However, it consumes signiﬁcant overhead to ﬁnd dead (i.e., unreachable) objects in the managed heap and to reclaim the memory used by them. Accordingly, GC has become one of the dominant factors inﬂuencing performance of the runtime systems such as Java virtual machine (JVM). For example, SPECjbb2005 [2] usually spends 10% of its total execution time in GC. In order to reduce the cost of GC, other than improving GC algorithms [1, 3, 4], a more eﬀective approach is compiler assisted memory management, including stack allocation [5–7], region allocation [8–11], compile-time free [12–15] and reuse [15–17].

This research was supported in part by the National Natural Science Foundation of China under Grant No. 90718026 and gifts from Intel Corporation.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 18–34, 2010. c IFIP International Federation for Information Processing 2010

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

19

Stack or region allocation reduce the load of GC through allocating some objects in a method stack frame or a region, and all objects in a stack frame or in a region should be reclaimed simultaneously even if some of them became dead before. However, stack allocation may induce stack overﬂow, while region allocation needs sophisticated region management, neither has delivered improvements on garbage collectors. Compile-time free and reuse belong to compiler assisted GC, they improve the collection or reuse of objects allocated in heap through compiler eﬀorts. Some works insert free instructions to free dead objects [12–15], thus reduce GC the load of identifying dead objects. Others automate compile-time object merging or reuse [15–17] to decrease the number of objects allocated in heap. We explore techniques on compiler assisted object reclamation and space reuse on an actual JVM, i.e., Apache Harmony DRLVM [18], and implement them as a system called just-in-time compiler assisted garbage collection (JIT-assisted GC). The novel contributions we made are as follows: – We design a novel object lifetime analysis algorithm which is ﬁeld-sensitive and context-sensitive. The analysis combines pointer and escape analysis with ﬂow-sensitive liveness information to identify not only the objects that are no longer referenced directly by the program, but also the objects that are referenced only by those identiﬁed objects. – We collect the free instrument information from the dead object information based on the dominance relationship in control ﬂow. Various strategies are used to ensure the validity and ﬂexibility of the instrumentation. – We modify GCv5 [19], a parallel garbage collector, not only adding gc free interface for explicit reclamation but also improving gc alloc to try to reuse the explicitly reclaimed space immediately. The JIT-assisted GC system can handle multi-threaded programs. The experimental results show that the memory utility and the performance of the whole runtime system are improved eﬃciently.

2

Overview of the JIT-Assisted GC

In this section we ﬁrst give an overview of the framework of JIT-assisted GC, then take a simple example to illustrate the compiler analysis and transformation for explicit object deallocation. 2.1

The Framework of the JIT-Assisted GC

The JIT-assisted GC is built on DRLVM, involving several components of DRLVM, such as VMCore, EM(Execution Manager), Jitrino.OPT ( a JIT optimizing compiler), and GCv5 garbage collector, etc. VMCore concentrates most of the JVM control functions. EM selects a compiler or an interpreter for compiling/executing a method, handles proﬁles and the dynamic recompilation logic. Jitrino.OPT features two types of code intermediate representation (IR): platform-independent high-level IR (HIR) and

20

Y. Zhang et al.

platform-dependent low-level IR (LIR). Both of them are graph-based structures denoting the control ﬂow of a program. Jitrino incorporates an extensive set of code optimizations for each IR type, and deﬁnes the compilation process as a pipeline, which is a linear sequence of steps. Each step stores a reference to an action object (e.g., an optimization pass), its parameters and other information. GCv5 is a fully parallel garbage collector including various algorithms, and can work in generational and non-generational modes.

Fig. 1. Framework of the JIT-assisted GC

Fig.1 shows the framework of JIT-assisted GC, which mostly refers to the shadowed areas in the ﬁgure. On the JIT side, the compilation pipeline loads the bytecode of current to-be-compiled method, ﬁrst translates it into the HIR via the translator, then transforms the HIR into the LIR via the Code Selector, and last emits the native code via the Code Emitter. We currently explore techniques for explicit object reclamation and space reuse based on the framework and they work as follows: JIT side. To support explicitly object deallocation, an Op Free instruction and a VM RT FREE runtime call instruction are extended into HIR and LIR, respectively. And the Code Selector and the Code Emitter are modiﬁed to support translating the extended instructions. Then algorithms on compiler analysis and transformation for explicit object deallocation are designed and implemented as an optimization pass for HIR type. GC side. A gc free interface is added to support explicitly reclaiming object space, and the implementation of gc alloc is modiﬁed to try to reuse the explicitly reclaimed space immediately. VMCore side. Runtime support for mapping VM RT FREE instruction to gc free interface is implemented. The mapping needs to push the right parameters onto the runtime stack, and to ensure the consistency of the stack pointer before and after the call of gc free interface.

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

21

Fig. 2. Code fragment from BH, one of the Jolden benchmarks. The code in italics is inserted by the compiler.

Beyond the explicit reclamation and space reuse discussed in this paper, more techniques for compiler assisted GC can be explored on the framework. e.g., as to some allocation sites in loops we can let them produce objects at the ﬁrst iteration and reset such objects in subsequent iterations to reduce the allocation overhead of GC. 2.2

A Simple Example

Fig.2 shows a code fragment from BH of Jolden. The code underlined is inserted by the compiler. Lines 1-20 show a class MathVector containing a static ﬁeld NDIM and an array ﬁeld data. The constructor at lines 4-8 builds a double array object o1 and initializes each element of o1 . Lines 23-31 show a method subdivp in class Cell which creates an object o2 of type MathVector at line 24. Note that after line 26, dr is not live, thus o2 only referenced by dr is dead. If the compiler maintains the ﬁeld reference information across procedures, it can further check whether objects referenced by the ﬁelds of the identiﬁed dead objects are dead. In the example object o1 is only referenced by ﬁeld data of o2 in method subdivp, so once o2 is dead, the compiler can detect that o1 is also dead according to the ﬁeld reference information, and can decide that the dead points (i.e., program point where an object is dead) of o1 and o2 are the same. Although the compiler identiﬁes o1 and o2 are dead after line 26, there are still two problems to be considered. One is does the allocation site of an object dominates its dead point? The other is how to get the reference to the dead object? Here the allocation site of o2 dominates the point after line 26, so its free instruction can be inserted after line 26. However, there is no reference to o1 in original subdivp code, so the compiler has to generate instructions to obtain the reference of o1 , the load of dr.data in Fig.2 (line 27) will correspond to several instructions in HIR level.

22

3

Y. Zhang et al.

Object Lifetime Analysis and Program Transformation

In this section we ﬁrst give the compiler analysis and transformation for explicit object deallocation in a nutshell, then describe some key parts in it. 3.1

Analysis and Transformation in a Nutshell

When executing a Java application on DRLVM, VMCore controls the class loading, and interacts with the compiler to compile the bytecode of a to-be-executed method into native code, and then executes it. The compilation process of a Java method is deﬁned as a pipeline speciﬁed in the EM conﬁguration ﬁle, and our analysis and transformation is developed as an HIR optimization pass which can be conﬁgured into a pipeline. Due to the limitation of the pipeline management framework, the pass can only directly obtain the HIR of the current compiling method M , and have to insert free instructions into M only when it is the compiling method of the pipeline. 01 02 03 04 05 06 07

if (M is not analyzed){ G = init(M); L = calculateLiveInfo(M); B = getRevRuntimeBasicBlocks(M); foreach b in reverse iterator of B { foreach instruction i in b transGwithInst(i, G);

08 09 10 11 12 13 14

transGwithLiveInfo(b, L, D); genInstrumentInfo(D, I)

} addResult(M);

} M, I = getResult(M); transHIR(M, I);

Fig. 3. Flow of the object lifetime analysis and transformation

Fig.3 is the ﬂow of the pass, where method M is in the HIR of static single assignment (SSA) form; L, D, I, and M represent the liveness information, the dead object information, the free instrument information and the summary of M , respectively. Line 03 calculates L. Line 04 gets the reversed pseudo-runtime basic block sequence of M , which consists of all reachable basic blocks of M in reverse topological order, where the basic block exited from a loop is located before all basic blocks in the loop, and the exception handling block edged from a basic block b is located before other blocks edged from b. Lines 05-10 include operations on identifying D through the intra-procedural and inter-procedural analysis based on a program abstraction called points-to escape graph (PEG), and collecting I from D. Line 11 records the analyzing result of M , and line 14 transforms the HIR of M according to the analyzing result of M. In the following subsections we present the details of the PEG, the intraprocedural and the inter-procedural analysis, and the instrument information collection in turn. 3.2

Points-to Escape Graph

Deﬁnition 1. Suppose M is a method, and V , P denote the set of variables and the set of formal parameters of method M , respectively. The PEG of M is a directed graph, denoted as G = (No Nr , Ep Ef ) ( represents disjoint union) where:

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

23

– No = Nc Np represents the set of objects accessed in M . • Nc represents the set of objects created by allocation sites in M . • Np = Nf p ∪ Nin represents the set of objects created outside M , called phantom objects, where Nf p represents the set of objects created in the direct or indirect callers of M and passed into M via formal parameters of M and their fields, Nin represents the set of objects created in the direct or indirect callees of M and passed into M via the return value receivers and their fields or fields of the actual parameters at each call site in M . • Nret ⊆ No represents the set of objects returned from M . – Nr is the set of reference nodes in M . Each variable with reference type in M corresponds to a reference node, i.e., Nr ⊆ V . – Ep ⊆ Nr × No represents the set of points-to edges. v, o ∈ Ep denotes that reference node v may point to object o. – Ef ⊆ No × F × No represents the set of field edges where F represents the set of non-static fields in M . o1 , f, o2 ∈ Ef denotes that field f of object o1 may point to object o2 . Each object o in a PEG G (o ∈ No ) has an associated escape state, denoted as ξ(o). The range of ξ(o) is a lattice E consisting of two elements: EN ≺ EG . EG means the object escapes globally and may be accessed by multiple threads, EN means that the object may not escape globally. If an object o does not escape globally, that is, the object can be accessed only by a single thread, and no other variables or object ﬁelds refer to object o after a program point p, then o can be reckoned as a dead object at point p, we call p the dead point. 3.3

Intra-procedural Analysis

The identiﬁcation of dead objects are accompanied by building and transforming the PEG of M according to each instruction in the pseudo-runtime basic block sequence, and the live variable information, i.e., lines 07-08 in Fig.3. We ﬁrst discuss the analysis process neglecting call instructions in this subsection. Transforming the PEG According to the Basic Instructions. Given an instruction i in the HIR, the PEG at entry to i (denoted as G(i) ) and that at exit from i (denoted as G(i) ) are related by the standard data ﬂow equations: G(i) = f i (G(i) )

(1)

G(i) = ∧i ∈P red(i) G(i )

(2)

where f i denotes data ﬂow transfer function of instruction i, P red(i) is the set of predecessor instructions of i and operator ∧ is a merge of PEGs. Table 1 shows the transfer function f i for each kind of basic instructions i, where the Ac and Ap operations are deﬁned in Deﬁnitions 2 and 3.

24

Y. Zhang et al. Table 1. The transfer functions for each kind of basic instructions

G(i) = f i (G(i) ) o := newObject(); Nf p := Nf p ∪ {o}; Ep := Ep ∪ {f p, o}; ξ(o) := EG . new : v = new C o := newObject(); Nc := Nc ∪ {o}, Ep := Ep ∪ {v, o}; v = new C[] if (o is a thread object) ξ(o) := EG else ξ(o) := EN . copy: v1 = v2 Ep := Ep ∪ {v1 , o|v2 , o ∈ Ep }. phi: v=phi(v1 ,v2 ) Ep := Ep ∪ {v, o|v1 , o ∈ Ep ∨ v2 , o ∈ Ep }; putField : v1 .f = v2 suppose X = {x|v1 , x ∈ Ep }, Y = {y|v2 , y ∈ Ep } Ef := Ef ∪ {x, f, y|x ∈ X, y ∈ Y }; ∀x ∈ X, ∀y ∈ Y .Ac (x, y); if (ξ(y) = EG ) Ap (y). getField : v1 = v2 .f suppose X = {x|v2 , x ∈ Ep }, Y = {y|x, f, y ∈ Ef , x ∈ X} if (Y = ∅){ o := newObject(); Nin := Nin ∪ {o}; ξ(o) := EN ; Ef := Ef ∪ {x, f, o|x ∈ X}; Ep := Ep ∪ {v1 , o}} else{Ep := Ep ∪ {v1 , y|y ∈ Y }} putStaticField :C.sf=v ∀v, o ∈ Ep .ξ(o) := EG ; Ap (o). getStaticField : v=C.sf o := newObject(); Nin := Nin ∪ {o}; ξ(o) := EG ; Ep := Ep ∪ {v, o}. return: return v Nret := Nret ∪ {o|v, o ∈ Ep }. HIR instruction i defineArg: fp ∈ P

Deﬁnition 2. Given two object nodes o1 , o2 ∈ No in the PEG G, the escape state combination operation Ac (o1 , o2 ) which propagates ξ(o1 ) to ξ(o2 ) is defined as: e ∈ E e = ξ(o1 ) ξ(o2 ) ≺ e (3) ξ(o2 ) := e Deﬁnition 3. Given an object node o ∈ No in the PEG G where ξ(o) = EG , operation Ap (o) sets the escape state of each object reachable from object o via a path of field edges to be EG . The Ac operation is used when there is a ﬁeld assignment (i.e., putField or getField in Table 1) or inter-procedural information combination, while the Ap operation is used when there is a static ﬁeld assignment (i.e., putStaticField or getStaticField ). Transforming the PEG Combining with the Live Variable Information. After analyzing all instructions in a basic block, combining with the live variable information, if a variable v is not live, the out points-to edges of v will be clipped, thus objects only pointed to by v can be regarded as dead. Furthermore, if an object o dies, the out ﬁeld edges of o will be clipped, thus objects only referenced by the ﬁelds of o can also be regarded as dead. The clip operation AD is based on the live variable information and produces the dead object information at the end of each basic block. Deﬁnition 4. Given an object o ∈ No , a reference node v ∈ Nr , G and G denote the PEGs before and after the AD operation respectively. The AD operation is defined as the following two rules.

3.4

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

25

G = (No ∪ Nr , Ep ∪ Ef ) v ∈ Nr Epv = {v, o|o ∈ No } G = (No ∪ Nr , Ep ∪ Ef ) Nr = Nr − {v} Ep = Ep − Epv

(4)

G = (No ∪ Nr , Ep ∪ Ef ) o ∈ No Efo = {o, f, o |o ∈ No } G = (No ∪ Nr , Ep ∪ Ef ) No = No − {o} Ef = Ef − Efo

(5)

Inter-procedural Analysis

When analyzing a method M , only objects with EN state may be explicitly freed. If such an object is referenced by a formal parameter or the return value of M or reachable from their ﬁelds, the object cannot be freed in M because M ’s callers may use it. The object lifetime analysis needs to record them into a summary of M , and update the PEG of M ’s caller using M ’s summary when analyzing a call instruction to invoke M . Deﬁnition 5. Given a method M and its PEG G = (No ∪ Nr , Ep ∪ Ef ), the object lifetime analysis result of M is a 2-tuple M, D where: – M = (Nf p ∪Nret , Ef ) is a summary of M . It records all objects referenced by the formal parameters or the return value of M , i.e., Nf p ∪Nret , and the set of field edges starting from them, i.e., Ef = {o, f, o |o ∈ Nf p ∪Nret ∧o, f, o ∈ Ef . – D describes the dead object information in M . It is a set of triples, each triple is denoted as o, r, p, where • o ∈ No is dead after the point p in M , • r = v, f represents the reference to o. If f is null, then v is the reference to o, otherwise, v.f is the reference to o. Given a method M and its PEG G, suppose there is a call instruction v = m , E m . The v0 .m(v1 , ..., vn ) in M and the summary of m is Mm = Nfmp ∪ Nret f process of dealing with the call instruction is as follows: 1. Combine the formal parameters and the actual parameters. For each vi , o in Ep , perform Ac (f pi , o) where f pi ∈ Nfmp is the corresponding formal parameter of m. 2. Combine the return value and the return value receiver. For each v, o in m is the return value of m. Ep , perform Ac (r, o) where r ∈ Nret 3. Map ﬁeld edges. For each edge in Efm , add a corresponding edge in Ef . 4. Propagate escape states. If the escape state of an object o referenced by one of the actual parameters or the return value receiver becomes EG , then perform Ap (o). The above inter-procedural combination takes a callee summary as precondition. However, a callee of M may not be analyzed when analyzing M . If so, there are two optional ways to deal with the call site. One is to neglect the callee and to make a decision conservatively. The other is to start up a new pipeline to compile the unanalyzed callee to obtain its summary. This special pipeline only includes a few basic passes translating bytecode into non-optimized HIR of SSA

26

Y. Zhang et al.

form and the object lifetime analysis pass in order to obtain the analyzed result of the callee and not the native code of the callee. The latter way is more precise but consumes more overhead. We introduce an argument to control the depth level starting up the special compilation of unanalyzed callees, thus users can use it to trade oﬀ between precision and performance. 3.5

Collecting Free Instrument Information

The dead objects and their dead points in D cannot be directly used as the instrument information generating free instructions. Sometimes instrumenting directly at a dead object o’s dead point may bring compile-time or runtime errors. e.g., if o is created in a branch and dies outside the branch, freeing o at the dead point might induce a runtime error. Another problem is how to free an object o that dies in method M and has no explicit reference to o in M , e.g., o1 referenced by dr.data in method subdivp. Therefore, we need to collect instrument information from the dead object information. The structure of the instrument information I is quite similar to that of dead object information, the only diﬀerence is that the dead point in the latter is changed into the instrument point in the formal. When collecting instrument information, a rule must be followed: given a dead object o and its dead point p, the basic block in which the allocation site of o appears must dominate the dead point p of o. Otherwise, it means the compiler is trying to free an object that may not be allocated before, thus causing runtime errors. The dominance relationship can be obtained from the dominance tree of HIR. Two key steps of the collecting process are as follows: Conﬁrm dead object reference: for an object o, there are two kinds of references: one is at its allocation site, e.g., A a = new A( ), where a is the reference to the newly created object o here; the other is brought by phi, assign, getF ield, or putF ield instructions. We preferentially choose the reference at allocation site for a dead object. If the dead object has no explicit reference in the method, we can make use of other object’s ﬁeld to generate its reference indirectly. Conﬁrm instrument point: for a dead object o, we preferentially choose the basic block which contains a return instruction (denoted return node) to insert instructions to free o. If there is not any exception when executing the program, the return node must be executed. If the conﬁrmed reference point of dead object o cannot dominate return node, and the dead point of o is in a branch, then we have to insert instructions to free o at the dead point because such a branch may not be executed. According to the reference and the instrument point information of each dead object provided in I, the code transformation of free instrumentation can easily create instructions to explicitly free object. Fig.4 shows the analysis process of method subdivp in Fig.1. We give sourcelevel statements for the sake of brevity. The new expression in block 2 will implicitly invoke the constructor, so object o1 created there will be passed into subdivp and become an element of Nin in the PEG of subdivp. At the end of

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

27

Fig. 4. An example illustrating the analysis process

block 4, variable dr is not live, so the analysis determines that o2 only referenced by dr and o1 only referenced by dr.data are both dead. At last the analysis will record the analysis result. 3.6

Special Design Tradeoﬀ

Thread Object. Thread objects are distinguished from other ordinary objects based on class hierarchy diagrams, and their escape states can be initialized as EG . If an object o is assigned to a thread object ﬁeld, then ξ(o) := EG . If a thread object ot has ended its execution, ot will be treated as an ordinary object and ξ(ot ) will be reset according to the escape states of objects referring to ot in the current PEG. However, it is diﬃcult to identify when and where a thread has ended execution. Our analysis only judges this case by join() invocations of thread objects. Loop. Instructions in loops are analyzed only once, which makes the analysis simpler and cheaper since the analysis overhead is a part of the whole program runtime overhead. The analysis is also correct and conservative because according to the rules in section 3.5, 1)assuming the allocation site pa of an object o occurs before a loop entry and o dies in the loop, if pa dominates a return node pr , then select pr as the instrument point, else might select some point after the loop exit; 2) assuming the allocation site pa of o occurs in a loop, if o dies in the loop, then select the dead point in the loop which can be dominated by pa as the instrument point, otherwise indicating any reference to o is live at all basic blocks of the loop, and not freeing o.

28

Y. Zhang et al.

Array. All elements of an array are abstracted as an object with a special ﬁeld, and accesses to an element are treated as accesses to the special ﬁeld. It may reduce the size of explicitly freed objects but save analysis overhead. Recursion. Our inter-procedural analysis can handle recursion. It maintains a chain of method invocation when meeting an unanalyzed callee, if the current unanalyzed callee has already existed in the chain (i.e., there is a recursion), the loop in the chain is cut and the inter-procedural combination could be done conservatively based on the current method summaries.

4

Explicit Reclamation and Space Reuse

GCv5 [19] is a parallel GC which support multiple collector s running collaboratively. We choose GCv5-MS to implement explicit reclamation and space reuse (denoted as JIT-GCv5-MS), because it uses free-list to organize heap space and is convenient to add or acquire a free space from the heap. 4.1

Brief Overview of GCv5-MS

Each thread in an application (called application thread ) corresponds to a mutator thread in GCv5-MS. Each mutator takes charge of the allocation of the corresponding application thread. GCv5-MS classiﬁes objects into two kinds, i.e., small objects (less than 1KB) and large objects (greater than or equal to 1KB), and provides Free Block Pool (FBP) and Free Area Pool (FAP) shown in Fig.5 for the allocation of the two kinds, respectively. Each pool is organized as an array of segregated free lists, where each free list contains blocks/areas of the same size or class size. The FBP has 254 segregated free block lists shared among all mutator s, and block s in the same list provide objects of the same size (from 8B to 1020B, aligned in 4B). Each block comprises a header and a data area. The header depicts information on the data area, e.g., a bitmap marking the status of each slot in the data area, such as in use or free. Each mutator requests a free block from the pool and its acquired blocks are local to the mutator. When a mutator receives a request of allocating a small object, it searches its local block of the requested size. If there is a free slot of the requested size then the mutator can return one; otherwise it need to request a free block of the requested size from the pool. Operations on the pool must be synchronized while operations on the mutator-local blocks need not. The FAP has 128 segregated free area lists. The last list contains free areas of the size greater than or equal to 128KB. All mutator s share the pool and must request memory for large objects with synchronization. Generally speaking, there are relatively few large objects in applications, so the synchronization overhead of parallel large object allocations is not high. 4.2

Allocation and Explicit Reclamation in JIT-GCv5-MS

In order to support explicit reclamation and space reuse, we modify GCv5-MS as JIT-GCv5-MS to add gc free and to modify the implementation of gc alloc.

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

Fig. 5. Heap space management of GCv5-MS

29

Fig. 6. Heap space management of the explicitly reclaimed objects

Each mutator in JIT-GCv5-MS handles not only allocation requests but also explicit free ones from its corresponding application thread. Due to the diﬀerent memory management mechanisms between the small and the large objects in GCv5-MS, we take diﬀerent methods. Handling with Small Objects. If a mutator mf receives a request to free a small object o, slot occupied by o must belong to a mutator-local block of some mutator ma , where mf may not be ma , that is, o may not be thread-local. If mf directly modiﬁes the mark bits of o in the block header as free status to reclaim the slot, and lets the original allocation algorithm control the reuse, accesses to the word containing the mark bits by mf need to be synchronized, because the word contains other slots’ mark bits, which may be accessed by ma to handle an allocation request or by other mutator to handle another explicit free request, simultaneously. Thus allocation operations on the mutator-local blocks which need not be synchronized originally, have to be synchronized, which brings more synchronization overhead. In order to avoid such synchronization, we introduce a Reclaimed Object Pool (ROP) (shown in Fig.6) for each mutator to collect its explicitly reclaimed object spaces. When mutator mf reclaims an object o, it does not modify the mark bits of o, but forces the object slot into a node of type Reclaimed Object Entry and inserts the node into a list of the same size in mf ’s local ROP. gc alloc need be modiﬁed to try to reuse the explicitly reclaimed object space immediately. That is, it ﬁrst searches its local ROP for free space of the requested size. If there are none, it continues to allocate as the original strategy in GCv5-MS. Handling with Large Objects. Because all mutator s share the FAP for allocating large objects and the synchronization on these operations cannot be neglected, we keep the implementation on allocating large objects as original. When mutator mf receives a request to free a large object o, it directly insert the memory area occupied by o into the free list of the matched size in the FAP, thus the subsequent object allocation of the same size will reuse the memory area. It is noticed that explicit reclamation of large objects need be synchronized, and we cannot easily obtain the reuse rate of large reclaimed object space.

30

Y. Zhang et al.

5

Experimental Results

We have implemented the above work in DRLVM and evaluated it with Jolden and SPECjbb2005. The experiments were performed on 2.1GHz AMD Athlon dual core machine with 896MB of memory running Windows XP. 5.1

Eﬀectiveness of the JIT-Assisted GC

First, we check whether the JIT-assisted GC frees still reachable objects or frees dead objects at wrong program points. In order to perform the correctness validation, we modify the implementation of gc free, mark the explicitly reclaimed object as un-useable and un-reusable. In addition, the pipeline performs many checks in LIR, such as variable liveness checking. In this way if done a wrong free action, the system will throw exception at the next access to a freed object or at the access to a potential undeﬁned variable. The experiments show that there are no such exceptions and errors in compile time or runtime. Table 2 presents the statistics on allocation, free and reuse for our JIT-assisted GC system at the default heap size 256MB of the VM. The ﬁrst four programs are from Jolden, the JIT-assisted GC explicitly frees 66% of all objects on average and up to 96% in Jolden. We ﬁnd that the free instructions inserted in loops or recursive methods can bring considerable income, and these explicitly reclaimed object spaces can be reused easily because the same allocation site will be executed many times, e.g., Health reclaims 14MB and almost all the space is from such free instructions. Table 2. Memory freed and reused by JIT-assisted GC. Application

BH Health Power TSP SPECjbb2005

Total Alloc Free Mem Free Mem Total Mem on free(x) on free(x.f) Free Mem. 67MB 14MB 46MB 60MB 60MB 14MB 0B 14MB 24MB 23MB 100B 23MB 51MB 28MB 88B 28MB 1419MB 104MB 0B 104MB

Total Reuse Mem 60MB 14MB 23MB 27MB 104MB

%Free Mem

%Reuse Mem

90% 23% 96% 55% 7%

100% 100% 100% 96% 100%

The last column of the table shows the explicitly reclaimed memory reuse ratio. For the programs in Jolden, the ratios are high and the explicitly reclaimed objects are all small objects, this illustrates that the JIT-GCv5-MS can reuse almost all these small objects. The reuse ratio of SPECjbb2005 is relatively low because the system reclaims many large objects, and the system does not count statistics for the large object space reuse due to not increasing the synchronization cost. 5.2

Time Cost

Table 3 presents the statistics on the time cost of our object lifetime analysis and transformation pass and the total compilation time, we can see that the pass cost less than 10% of the total compilation time.

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

31

Table 3. Analysis time and total compilation time. Application BH Health Power TSP SPECjbb2005

5.3

Objlife Time Total Comp. Time %Objlife 23ms 537ms 4.3% 14ms 309ms 4.5% 12ms 332ms 3.6% 11ms 207ms 5.3% 738ms 19011ms 3.9%

Performance Improving of JIT-Assisted GC

To evaluate the performance impact, we compared the GC execution times of the benchmark programs. Fig.7 presents the GC execution time comparison of programs in Jolden. The x-axis is the heap size and the y-axis is the GC execution time. We can see that GC execution time of JIT-GCv5-MS is less than that of GCv5-MS. Along with the increase of the heap size, the performance improvement becomes small. This is because the numbers of the explicitly reclaimed objects and those of the reused objects are ﬁxed, the larger heap size relatively decreases the performance improvement, as Health and TSP in Fig. 7. As to BH and Power, even if the heap size is set to the least 16MB, the execution time of GC in JIT-GCv5-MS is zero, since more than 90% of the allocated space can be explicitly reclaimed.

Fig. 7. Performance comparison of 4 programs in Jolden

Fig.8 shows the throughput comparison of SPECjbb2005 with and without JIT-assisted GC optimization. It lists the collectively throughputs of 6 group experiments. The dark column and the ﬁrst row in the data table illustrate the throughput without JIT-assisted GC. The tint column and the second row in the

32

Y. Zhang et al.

Fig. 8. The throughput comparison of SPECJbb2005

data table illustrate the throughput with JIT-assisted GC. The third row in the table is the improving ratio. We can see the improving ratio is about 1.3∼2.9%.

6

Related Work and Conclusions

Guyer et al. propose a free-me analysis [12] which is closest to our work. They combine a light-weight pointer analysis with liveness information that detects when short-lived objects die, and insert calls to free dead objects. However, their method cannot identify the lifetime of objects referenced by ﬁelds due to its ﬁeld-insensitive property. Cherem et al. present a uniqueness inference and can free objects with unique reference in the whole heap through free instructions and destructors [13, 14], the work needs to modify libraries to add destructors, this method is complex and diﬃcult and not ﬁt for the system built in virtual machine because the latter need to exucute applications accompanied by justin-time compilation. Both of the works do not support the reuse of the explicitly reclaimed space. Lee and Yi’s analysis inserts free instructions only for immediate reuse, i.e., before an allocation of the same size [15]. Marinov et al. present Object Equality Proﬁling (OEP) [16] to discover opportunities for replacing a set of equivalent object instances with a single representative object. Gheorghioiu et al. present an inter-procedural and compositional algorithm for ﬁnding pairs of compatible allocation sites [17], which have the property that no object allocated at one site is live at the same time as any object allocated at the other site. All these works focus on object merging and reuse only for the same size objects with lifetime homogeneity only on the compiler end. Our work can identify some short-lived objects not limited in method scope or other special features like [15], it also detects objects only referenced by the ﬁelds of the identiﬁed dead objects. The PEG based analysis seems similar to [7], however, the deﬁnition of the escape lattice and the rules on building and transforming the PEG are very diﬀerent. In addition, our work not only frees the identiﬁed dead objects, but also tries to reuse them immediately. Although our current work in GCv5 is on Mark-sweep algorithm, we can easily extend the work to other algorithms in GCv5.

Just-in-Time Compiler Assisted Object Reclamation and Space Reuse

33

Based on the JIT-assisted GC framework, we can explore more optimization on memory management. We are analyzing the beneﬁt of each free instruction inserted by JIT and the memory utility of each allocation site in loop or recursive method by developing a log system with the cooperation among JIT, VMCore and GC. According to the analysis results, we will ﬁnd more chances on memory management optimization.

References 1. Jones, R., Lins, R.: Garbage collection: algorithms for automatic dynamic memory management. John Wiley & Sons, Chichester (1996) 2. Specjbb2005 benchmark (2005), http://www.spec.org/jbb2005/ 3. Kero, M., Nordlander, J., Lundgren, P.: A correct and useful incremental copying garbage collector. In: Proc. 6th Int’l Symp. on Memory Management, October 2007, pp. 129–140. ACM Press, New York (2007) 4. Blackburn, S., McKinley, K.: Immix garbage collection: mutator locality, fast collection, and space eﬃciency. In: Proc. 2008 ACM Conf. on Prog. Lang. Design and Impl., pp. 22–32. ACM Press, New York (June 2008) 5. Whaley, J., Rinard, M.: Compositional pointer and escape analysis for java programs. ACM SIGPLAN Notices 34(10), 187–206 (1999) 6. Gay, D., Steensgaard, B.: Fast escape analysis and stack allocation for object-based programs. In: Watt, D.A. (ed.) CC 2000. LNCS, vol. 1781, pp. 82–93. Springer, Heidelberg (2000) 7. Choi, J.D., Gupta, M., Serrano, M.J., Sreedhar, V.C., Midkiﬀ, S.P.: Stack allocation and synchronization optimizations for java using escape analysis. ACM Trans. on Programming Languages and Systems 25(6), 876–910 (2003) 8. Gay, D.E., Aiken, A.: Language support for regions. In: Proc. 2001 ACM Conf. on Prog. Lang. Design and Impl., June 2001, pp. 70–80. ACM Press, New York (June 2001) 9. Grossman, D., Morrisett, G., Jim, T., Hicks, M., Wang, Y., Cheney, J.: Regionbased memory management in cyclone. In: Proc. 2002 ACM Conf. on Prog. Lang. Design and Impl., June 2002, pp. 282–293. ACM Press, New York (June 2002) 10. Salagnac, G., Yovine, S., Garbervetsky, D.: Fast escape analysis for region-based memory management. In: Proc. 1st Int’l Workshop on Abstract Interpretation for Object-Oriented Languages, January 2005. ENTCS, vol. 141, pp. 99–110. Elseiver, Amsterdam (January 2005) 11. Stefan, A., Craciun, F., Chin, W.N.: A ﬂow-sensitive region inference for cli. In: Ramalingam, G. (ed.) APLAS 2008. LNCS, vol. 5356, pp. 19–35. Springer, Heidelberg (2008) 12. Guyer, S.Z., McKinley, K.S., Frampton, D.: Free-me: a static analysis for automatic individual object reclamation. In: Proc. 2006 ACM Conf. on Prog. Lang. Design and Impl., June 2006, pp. 364–375. ACM Press, New York (June 2006) 13. Cherem, S., Rugina, R.: Compile-time deallocation of individual objects. In: Proc. 5th Int’l Symp. on Memory Management, June 2006, pp. 138–149. ACM Press, New York (June 2006) 14. Cherem, S., Rugina, R.: Uniqueness inference for compile-time object deallocation. In: Proc. 6th Int’l Symp. on Memory Management, October 2007, pp. 117–128. ACM Press, New York (October 2007)

34

Y. Zhang et al.

15. Lee, O., Yi, K.: newblock Experiments on the eﬀectiveness of an automatic insertion of memory reuses into ml-like programs, October 2004, pp. 97–108. ACM Press, New York (October 2004) 16. Marinov, D., O’Callahan, R.: Object equality proﬁling. In: Proc. 18th ACM SIGPLAN Conf. on Object-Oriented Prog. Systems, Lang., and Applications, October 2003, pp. 313–325. ACM Press, New York (October 2003) 17. Ovidiu Gheorghioiu, A.S., Rinard, M.: Interprocedural compatibility analysis for static object preallocation. In: Proc. 30th ACM Symp. on Principles of Prog. Lang., January 2003, pp. 273–284. ACM Press, New York (January 2003) 18. Apache harmony drlvm (2006), http://harmony.apache.org/subcomponents/drlvm/index.html 19. Apache harmony gcv5 (2008), http://harmony.apache.org/subcomponents/drlvm/gc-v5.html

Optimization of Triangular Matrix Functions in BLAS Library on Loongson2F Yun Xu1,2 , Mingzhi Shao1,2 , and Da Teng1,2 1

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China [email protected], [email protected], [email protected] 2 Anhui Province Key Laboratory of High Performance Computing, Hefei, China

Abstract. BLAS (Basic Linear Algebra Subprograms) plays a very important role in scientiﬁc computing and engineering applications. ATLAS is often recommended as a way to generate an optimized BLAS library. Based on ATLAS, this paper optimizes the algorithms of triangular matrix functions on 750 MHZ Loongson 2F processor-speciﬁc architecture. Using loop unrolling, instruction scheduling and data pre-fetching techniques, computing time and memory access delay are both reduced, and thus the performance of functions is improved. Experimental results indicate that these optimization techniques can eﬀectively reduce the running time of functions. After optimization, double-precision type function of TRSM has the speed of 1300Mﬂops, while single-precision type function has the speed of 1800Mﬂops. Compared with ATLAS, the performance of function TRSM is improved by 50% to 60%, even by 100% to 200% under small-scale input. Keywords: BLAS; ATLAS; triangular matrix function; loop unrolling; data pre-fetching.

1

Introduction

In the contemporary scientiﬁc engineering, most of the running time is spent on basic linear algebra functions. A lot of software related to matrix computing invokes functions in BLAS [1] (Basic Linear Algebra Subprograms). As a consequence, it is imperative to optimize the BLAS libraries based on a speciﬁc machine to fully utilize its hardware resource. KD-50-I is a high performance computer that employs China’s Loongson 2F superscalar CPU, which has the advantage of low power, low cost and high integration. Our benchmark for high performance computer KD-50-I is HPL (High Performance Linpack), which is implemented by invoking functions in the BLAS library. Therefore, the eﬃciency of functions in the BLAS library can directly aﬀect the performance of the KD50-I system. The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 35–45, 2010. c IFIP International Federation for Information Processing 2010

36

Y. Xu, M. Shao, and D. Teng

Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrixmatrix operations. In this paper we mainly aim at the optimization of functions that computes triangular-matrix and vector in level 2, and the TRMM and TRSM function that implement triangular-matrix and matrix operations in level 3. These functions take up about one third of the BLAS library. On optimization of linear algebra library, contemporary research focuses on an algorithmic level [2,3]. In a period of time that CPU reads one byte from memory, it can execute hundreds of instructions. Consequently, the bottleneck of optimizing functions is not computing time but memory access delay. In order to reduce memory access delay, RA Chowdhury [4] proposed a method that extends the cache-oblivious framework to solve The Gaussian Elimination Paradigm (GEP); Tze Meng Low [5] provided with a high-eﬃciency blocking algorithm for functions in level 3. ATLAS [6, 7] (Automatically Tuned Linear Algebra Software) is one of the matrix packages [8,9]. ATLAS is portable BLAS software which ﬁrstly tests hardware parameters and then optimizes some dense-matrix functions using basic optimizing techniques. ATLAS can optimize basic BLAS functions automatically upon the parameters of cache capacity and memory access delay that ATLAS has tested. However, there still exist unknown parameters of speciﬁc architectures, e.g. pipeline structure. Thus, there is room for optimization of codes that ATLAS generates. Based on ATLAS, we further optimize triangular-matrix functions in BLAS from an algorithmic level to reduce access delay and to improve the performance of BLAS, using general optimizing techniques (such as matrix blocking, loop unrolling) and optimizing techniques speciﬁc on Loongson 2F (such as multiplyadd instruction, instruction scheduling, data pre-fetching).

2

Triangular-Matrix Functions and ATLAS Methods

There are 8 triangular-matrix functions in BLAS, however, here we only take the TRSM function to illustrate optimizing methods and results. 2.1

TRSM and Symbols

TRSM implements multiplication of inverse of triangular-matrix A and matrix B, as formula(1) illustrates, B ← αop(A)B

or

B ← αBop(A−1 )

(1)

where α(ALPHA, a scalar) is an extension factor, B is an M -by-N matrix, A is an upper (or lower) triangular (or unitriangular) matrix, and op(A) can be A, the transpose of A, or the conjugate transpose of A. If op(A) is on the left of B (left multiplier), A is M -by-M ;if not, A is N -by-N . The declaration of TRSM function is xTRSM (ORDET, SIDE, UPLO, TRANS, DIAG, M, N, ALPHA, A, LDA, B, LDB), where x represents s, d, c or z which

Optimization of Triangular Matrix Functions in BLAS Library

37

respectively stands for single precision ﬂoat data type, double precision ﬂoat data type, single precision complex data type, and double precision complex data type. 2.2

ATLAS Methods

In terms of TRSM, ATLAS employs the solution method for linear equations. ATLAS has made some basic optimizations toward TRSM as follows: Matrix Blocking. ATLAS optimizes TRSM by blocking, setting block size of real numbers as 4 and block size of complex numbers as 8, the same as coeﬃcients of loop unrolling. After matrix blocking, triangular matrix A is divided into several smaller rectangular matrices and triangular matrices, where smaller rectangular matrices can be solved by invoking GEMM function and smaller triangular matrices can be solved directly. Copy and Partial Matrix Transpose. Through copy and partial matrix transpose technique, ATLAS transfers a matrix into a transposed or non-transposed status, which changes data storage order and further improves the performance of functions. Loop Unrolling. ATLAS has devised the trsmKL and trsmKR function that operate on real numbers, and the CtrsmK function that operates on complex numbers, where trsmKL and trsmKR unroll a loop by 8 × 1 × 1 and CtrsmK unrolls all of the two inner loops.

3 3.1

General Optimizing Techniques Adjusting Block Size

Matrix blocking is a widely applied optimizing technique to enhance storage availability. It reduces local data sets to avoid conﬂicts. The matrix blocking algorithm is to partition the sub-data blocks of a matrix, in order to reuse data that are in the cache. We adjust the size of the blocks to a proper value so that each data set could be stored in a buﬀer, which reduces extra cost of blocking and ensures relatively low conﬂicts. The left multiplication format of dTRSM exempliﬁes the speciﬁc steps of matrix blocking. As Fig. 1 illustrates, triangular matrix AM×M is spitted into several smaller triangular matrices ARB×RB (grey region in Matrix A in Fig. 1) and several smaller rectangular matrices ARB×M (white region in Matrix A in Fig. 1, where M is a variable and RB ≤ M ≤ M ), so each A can be fully stored in L1 cache. As for ARB×RB , BM×N is split into several smaller rectangular matrices BRB×N , the counterparts of partitioned ARB×RB is solved by TRSM, and the counterparts of partitioned ARB×RB is solved by GEMM.

38

Y. Xu, M. Shao, and D. Teng

Fig. 1. Using matrix block technique to the triangular matrix A and the matrix B

As a consequence, optimization is concentrated in the multiplication operation of triangular matrix ARB×RB and BRB×N . It is clear that both A and B can be read from cache instead of memory, which avoids conﬂict and enhances the performance of TRSM. 3.2

Setting Coeﬃcients of Loop Unrolling

Loop unrolling is a common optimizing compilation technique, employed to reduce both cyclic variable operations and branch instructions. Besides, unrolling the outer loop of multi-loop can make certain data reusable. We can put these data into registers so that these data can be read directly from registers instead of memory, which lowers the requirements of communication bandwidth between register and cache as well as that between cache and memory. Next, we analyze how varied coeﬃcients aﬀect performance of functions, and then speciﬁc coeﬃcient will be chosen upon Loongson 2F architecture. We deﬁne the function ATL dreftrsmLUNN(ref for short) that implements the upper triangular form, non-transposed form, and non-unimatrix form of partitioned dTRSM. As Algorithm 1 illustrates, this function shows how coeﬃcients aﬀect function performance. There are 3 layers in the loop of algorithm 1, which are denoted by R, S, T. The ref function is attributable to the speed-up of memory access of TRSM under limitation. The times of memory access is (M 2 + M d)N when multiplication operation of M × M triangular matrix A and M × N rectangular matrix B is implemented. If layer T is unrolled for α times, the times of memory access of A and B is respectively N M (M + 1)/2 and M N (M + 1)/2α ; if layer S is unrolled forβ times, the times for memory access of A and B is respectively M N (M + 1)/2β and M N (M + 1)/2 . So the sum of memory access for unrolling layer T for α times and for unrolling layer S for β times is M N (M + 1)/2a + M N (M + 1)/2β. The computing complexity of dTRSM is (M 2 + 2M )N , and memory access speed is denoted by L(Mb/s), then the theoretical upper limit for the computing speed is:

Optimization of Triangular Matrix Functions in BLAS Library

39

Algorithm 1. Algorithm of ATL dreftrsmLUNN 1: S

for(j=0;j
2: T

for(i=M-1;i>=0;i–) { rC=A[i][i]∗B[k][j]; for(k=i+1;k<M;k++) { rC-=A[i][k]∗B[k][j]; } B[i][j]=rC/A[i][i];

3: R

}

}

speed = =

computing complexity ×L memory access

(M 2 + 2M )N L 2αβ(M + 2)L = M N (M + 1)(α + β)/(2αβ) (α + β)(M + 1) ≈

(2)

2αβL α+β

When we are deal with large-scaled data, and eﬃciency of ATL dreftrsmLUNN is limited by memory access, increasing the value of α and β could elevate this upper limit of computing speed. 3.3

Other Optimizing Techniques

Except for transforming division to multiplication operation for ref, we have done the same to complex data type functions, because a multiplication operation only takes one cycle while a division operation takes tens of cycles. Taking the left multiplication form of zTRSM for example, computing each column of matrix B that stores complex numbers requires 2M times division operations, so the total number of operations is 2M N . An array is necessary to place the elements in the diagonal of the triangular matrix, whose expression is r/(r2 + i2 ) and i/(r2 + i2 ) where r and i respectively stand for the real part and the imaginary part. It is not diﬃcult to replace the former division operations by multiplication, owing to which there is totally 2M division operations and 2M N multiplication operations-executing time is shortened. What’s more, after unwinding the loops in zTRSM and cTRSM, we also unroll Mmls multiply-subtract instruction for complex numbers used as core of computation to separate real part and imaginary part, so that it is easier for further optimization with speciﬁc techniques on Loongson 2F architecture.

40

4

Y. Xu, M. Shao, and D. Teng

Optimization Techniques Based on Loongson 2F Architecture

Loongson is a family of general-purpose MIPS-compatible CPUs developed at the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS) in China . KD-50-I is a supercomputer with a total of more than 330 Loongson-2F CPUs. Loongson’s instruction set is 64bit MIPS III compatible. It has separate 64/64 KB instruction and data L1 caches. According to literature[10], RB is set as 60 when we block the matrices in dTRSM function in section 3.1. There are 32 registers in Loongson 2F, so theoretically the number of times of loop unrolling for ATL dreftrsmLUNN can be set as 4 × 4 × 2, that is α = β = 4, to acquire a good result speed = 4L. In the following sections, there are other special techniques based on Loongson 2F architecture. 4.1

Multiply-Add Instruction

Because the traditional multiply and add instruction is RAW (read-after-write), it is often necessary to put other instructions between them to reduce pipeline idling. However, Loongson 2F is compatible with a speciﬁc multiply-add (or multiply-substract) instruction, using which we can combine multiplication with add operations to improve our program. 4.2

Instruction Scheduling

Another technique is to adjust the sequence of instructions to avoid pipeline idling. Loongson Pentium Pro Architecture has 5 execution units: 2 ALUs, 2 FPUs, and 1 address generation unit (AGU). Consequently, in each execution, one load instruction has to be followed by two ﬂoating point instructions and one ﬁxed-point math instruction, so that 4-way superscalar is strictly ﬁtted, which is contributable to acceleration of IPC (Instructions Per Clock). 4.3

Data Pre-fetching

The prefetching instruction of Loongson instruction set can cause the reorder queue blocked, so we employ a branch prediction technique as data pre-fetching technique. After unrolling loops in algorithm 1(in Fig. 2)by 4× 4 × 2, we use data pre-fetching and instruction scheduling techniques to make instructions in each row ﬁt for 4-way superscalar. The codes of the most inner loop are as follows: In Algorithm 2, incA and incB respectively stands for the distance of two continuous elements in matrix A and B. The load instruction in the ﬁrst part has acquired data required by the second part, and the data acquired in the second part is necessary in the ﬁrst part of the next iteration. It is obvious that except rA0, rB0, rA1, rB1, rA2, rB2, rA3 and rB3, each instruction can

Optimization of Triangular Matrix Functions in BLAS Library

41

Algorithm 2. Using data pre-fetching in loop unrolling 1:

2: {

rA0=∗ pA0; rB0=∗ pB0; rA1=∗ pA1; rB1=∗ pB1; rA2=∗ pA2; rB2=∗ pB2; rA3=∗ pA3; rB3=∗ pB3; for(k=i+1; k < M-1; k+=2) ra0=pA0[incA]; rC00-=rA0∗ rB0; rC10-=rA1∗ rB0; rb0=pB0[incB]; rC01-=rA0∗ rB1; rC11-=rA1∗ rB1; ra1=pA1[incA]; rC20-=rA2∗ rB0; rC21-=rA2∗ rB1; rb1=pB1[incB]; rC02-=rA0∗ rB2; rC12-=rA1∗ rB2; ra2=pA2[incA]; rC22-=rA2∗ rB2; rC32-=rA3∗ rB2; rb2=pB2[incB]; rC30-=rA3∗ rB0; rC31-=rA3∗ rB1; ra3=pA3[incA]; rC03-=rA0∗ rB3; rC13-=rA1∗ rB3; rb3=pB3[incB]; rC23-=rA2∗ rB3; rC33-=rA3∗ rB3; rA0=∗ pA0; rC00 -= ra0∗ rb0; rC10 -= ra0∗ rb1; rB0=∗ pB0; rC20 -= ra0∗ rb2; rC30 -= ra0∗ rb3; rA1=∗ pA1; rC01 -= ra1∗ rb0; rC11 -= ra1∗ rb1; rB1=∗ pB1; rC21 -= ra1∗ rb2; rC31 -= ra1∗ rb3; rA2=∗ pA2; rC02 -= ra2∗ rb0; rC12 -= ra2∗ rb1; rB2=∗ pB2; rC22 -= ra2∗ rb2; rC32 -= ra2∗ rb3;

pA0+= incA2; pB0+= incB2; pA1+= incA2; pB1+= incB2; pA2+= incA2; pB2+= incB2; pA3+= incA2; pB3+= incB2;

}

attain the data needed at least one cycle ahead, that is, data is pre-fetched. By renaming registers and using registers in turns, dependency among instructions can be decreased. For example, as algorithm 2 indicates, we employ rA0, rA1, rA2, rA3 and ra0, ra1, ra2, ra3 by rotation. The following two lines states that rA0 and ra0 fetch data in turns and they are independent of each other, so that data is successfully pre-fetched. ra0 = pA0[incA]; rC00− = rA0 ∗ rB0; pA0+ = incA2; ra0 ∗ rb0.

5

rA0 = ∗pA0; rC00− =

Experimental Results and Discussion

Our experiment is implemented through the combination of repeatedly testing using one case and testing in circles using a group of cases. Speciﬁcally, every function is executed repeatedly under various data scale, until the average speed is calculated as the ﬁnal result. Here we use Mﬂops(million ﬂoating-point operations per second) as the technical criteria. For convenience, parameters ”M”, ”N”, ”LDA” and ”LDB” of function TRSM is replaced by the same value ”length”. Conﬁgurations of compilers include set open multiply-substract instructions (-DAdd DStringSunStyle), Linux operating system(-DATL OS Linux), not saving frame pointer at function call (-forit-frame-pointer), optimization level (-O3), and unrolling all the loops(-funroll-all-loops).

42

5.1

Y. Xu, M. Shao, and D. Teng

Optimization Results of dTRSM

The outcome of optimization on dTRSM will be discussed from two aspects: inﬂuences placed on dTRSM by each techniques under small data scale and comprehensive eﬀect by all the techniques under large data scale. Because implementation of functions is independently done by TRSM under small data scale, the eﬀect of optimization is reﬂected directly by performance of TRSM. Fig. 2 shows the performance inﬂuenced by each technique under small data scale.

Fig. 2. Comparison of diﬀerent optimization methods in terms of dTRSM function performance

Fig. 3. Performance of optimized function dTRSM

Optimization of Triangular Matrix Functions in BLAS Library

43

Fig. 4. Performance of optimized functions xTRSM

In Fig. 2, Test1 stands for invoking dTRSM directly; Test2 stands for using loop unrolling technique; Test3 for loop unrolling and data pre-fetching; Test4 for loop unrolling, data pre-fetching, and division replced by multiplication. Test2 outperforms Test1 by almost 3 times, which is close to the theoretical value; from Test2 and Test3, it is clear that rate is doubled by instruction scheduling technique and data pre-fetching technique on the basis of loop unrolling; from Test3 and Test4, it is not hard to observe that executing time is shortened by replacing division with multiplication instruction. Thus, loop unrolling, dataprefetching, instruction scheduling, and replacing division with multiplication techniques can all accelerate the speed of execution. After Test4 optimization, the speed of dTRSM can be 1723.45Mﬂops when the threshold of blocking size is 60-it outperforms the algorithm 1 by a factor of 7.37 and outperforms ATLAS by a factor of 1.5. When dealing with large-scaled data, the computation of TRSM takes up a rather low ratio. Thus, we only discuss a comprehensive result of optimizationopt in Fig.3 represents the result. The speed of dTRSM in the steady status is 1300Mﬂops, which outperforms the former function by 60%. 5.2

Performance of Other Optimized Functions in TRSM

In terms of double precision complex type function zTRSM, we acquire the proper optimizing techniques through testing how those techniques works under small scaled data. The function does not achieve a high performance-the speed is only 573.56Mﬂops when threshold of block size is 24, when loop unrolling and replacing division by multiplication techniques are employed. So we unroll the core instruction Mmls (multiply-subtract instruction for complex numbers) to respectively compute the real part and the imaginary part, and then instruction scheduling and data pre-fetching is applied. Finally, the rate has reached 1211.57Mﬂops, which outperforms ATLAS by 200%.

44

Y. Xu, M. Shao, and D. Teng

As for single ﬂoat type function sTRSM, the techniques used are the same as that of dTRSM except the threshold of block size is set as 72. The optimizing methods for cTRSM are similar as that of zTRSM except the threshold of block size is set as 36. Fig. 3 presents the performance of optimized sTRSM, dTRSM and zTRSM. In Fig. 4, s atlas, z atlas, c atlas individually represent performance of sTRSM, zTRSM, cTRSM optimized by ATLAS, and s opt, z opt, c opt stands for performance of sTRSM, zTRSM, cTRSM optimized by us. It is obvious in Fig. 3 that the curves titled s opt, z opt, c opt are relatively smooth, and the ﬁnal rate of sTRSM, zTRSM, cTRSM are respectively 1800, 1400, 1800Mﬂops, which outperforms that of ATLAS at least 70%.

6

Conclusion and Further Research

Every speciﬁc high performance computer has the necessity to be equipped with a speciﬁcally optimized BLAS, in order to have the hardware resources fully utilized. In this paper, we have optimized triangular matrix functions from an algorithmic level, based on the optimization of GEMM function and the Loongson 2F architecture. Here we have also employed optimizing techniques such as loop unrolling, data pre-fetching and instruction scheduling to elevate the performance of functions. The rate of double ﬂoat functions dTRSM and zTRSM has reached 1300Mﬂops, and the rate of single ﬂoat functions sTRSM and cTRSM has reached 1800Mﬂops. In comparison with ATLAS, our optimization has elevated by 50-60%, even 100-200% when dealing with small-scaled data. At present, our research is concentrated in optimization of BLAS in single core environment. In the future, we will start a new program researching the optimization of BLAS in parallelism when the multi-core CPU, Loongson 3, is put into use.

Acknowledgment We thank Bo Chen, Haitao Jiang, who provided many helpful suggestions. This paper is supported by the Key Subproject of the National High Technology Research and Development Program of China, under the grant No. 2008AA010902 and No. 2009AA01A134.

References 1. Lawson, C., Hanson, R., Kincaid, D., Krogh, F.: Basic Linear Algebra Subprograms for FORTRAN usage. ACM Transaction on Mathematical Software 5(3), 308–323 (1979) 2. Dumas, J.G., Gautier, T., Pernet, C.: Finite Field Linear Algebra Subroutines. In: Proceedings of the 2002 International Symposium on Symbolic on Symbolic and Algebraic Computation (2002)

Optimization of Triangular Matrix Functions in BLAS Library

45

3. Elmroth, E., Gustavson, F., Jonsson, I., Kagstrom, B.: Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. In: SIAM Review (2004) 4. Chowdhury, R.A., Ramachandran, V.: The Cache-oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation. In: Proceedings of the Nineteenth Annual ACM Symposium on Algorithms and Computation Theory, pp. 71–80 (2007) 5. Low, T.M., Robert, A., et al.: API for Manipulating Matrices Stored by Blocks. Department of Computer Sciences, the University of Texas at Austin (2004), http://www.cs.utexas.edu/users/flame/pubs/flash.ps 6. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27, 3–35 (2001) 7. Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., et al.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE. Special Issue on Program Generation, Optimization, and Adaptation 93(2) 2005 8. Goto, K., van de Geijn, R.: On reducing tlb misses in matrix multiplication. Technical Report TR02-55, Department of Computer Sciences, U. of Texas at Austin (2002) 9. Koenker, R., Pin, N.G.: SparseM: A sparse matrix package for R. J. of Statistical Software 8(6) (2003) 10. Gu, N.J., Li, K., et al.: Optimization for BLAS on Loongson 2F architecture. Journal of University of Science and Technology of China 38(7) (2008)

Exposing Tunable Parameters in Multi-threaded Numerical Code Apan Qasem1 , Jichi Guo2 , Faizur Rahman2 , and Qing Yi2 1

Texas State University [email protected] 2 University of Texas at San Antonio {jguo,srahman,qingyi}@cs.utsa.edu

Abstract. Achieving high performance on today’s architectures requires careful orchestration of many optimization parameters. In particular, the presence of shared-caches on multicore architectures makes it necessary to consider, in concert, issues related to both parallelism and data locality. This paper presents a systematic and extensive exploration of the combined search space of transformation parameters that aﬀect both parallelism and data locality in multi-threaded numerical applications. We characterize the nature of the complex interaction between blocking, problem decomposition and selection of loops for parallelism. We identify key parameters for tuning and provide an automatic mechanism for exposing these parameters to a search tool. A series of experiments on two scientiﬁc benchmarks illustrates the non-orthogonality of the transformation search space and reiterates the need for integrated transformation heuristics for achieving high-performance on current multicore architectures. Keywords: Autotuning, memory hierarchy, optimization, parallelism.

1

Introduction

The emergence of chip multiprocessor systems has greatly increased the performance potential of modern computer systems. However, much of the responsibility in exploiting the on-chip parallelism lies with system software like the compiler and the operating system. The complexity of modern architectures implies that compilers, in addition to analyzing the code for proﬁtability, need to account for a large number of architectural parameters to achieve a high fraction of peak performance. Since information about many of these hardware parameters is not readily available, achieving high performance on modern architectures is an extremely challenging task for a compiler. The problem of achieving portable high-performance is often more pronounced for numerical code in scientiﬁc domains. Scientiﬁc applications are characterized by high degrees of temporal reuse and large working sets that often do not ﬁt in

This research is funded by the National Science Foundation under Grant No. 0833203 and No. 0747357 and by the Department of Energy under Grant No. DE-SC001770.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 46–60, 2010. c IFIP International Federation for Information Processing 2010

Exposing Tunable Parameters in Multi-threaded Numerical Code

47

the higher-level caches. These codes also involve signiﬁcant amount of ﬂoatingpoint computation and contain loop nests that are amenable to parallelization at one or more levels. Thus, to achieve high-performance for numerical code, the compiler needs to ﬁnd ways to extract parallelism and apply optimizations that exploit data reuse. This task of exploiting locality and parallelism is further complicated by the presence of shared-caches on current multicore platforms [1]. A shared-cache poses an inherent trade-oﬀ between data locality and parallelism. On one hand, any parallel decomposition of the application inevitably inﬂuences data access patterns in each concurrent thread. On the other hand, transformations for improving locality often impose constraints on how much parallelism can be extracted. For example, if a data-parallel decomposition of an application creates a working set of size W Si for thread ti , then to improve data locality, the compiler needs k to ensure that Σi=0 W Si < CS, where CS is the size of the shared cache. To satisfy this constraint, the compiler may consider several options including tiling each W Si , ﬁnding a suitable schedule for t0 − tk or reconﬁguring the decomposition itself. Each approach will not only impact the data reuse of an individual thread but also the synchronization cost and task granularity. Thus, when parallelizing an application for multicore-core architectures, it is important to ﬁnd the right balance between data locality and parallelism, which involves considering a large number of code transformations and parameters. This combinatorial explosion of performance inﬂuencing parameters has made static compiler heuristics largely ineﬀective in delivering portable performance on current architectures. In response to this daunting challenge, several research groups have proposed methods and techniques for automatic performance tuning [2–5]. In an autotuning framework, a code is analyzed, alternate code variants are generated with tuning parameters and then a heuristic search is performed based on execution time feedback to obtain an implementation that yields optimal or near-optimal performance for the target platform. Many of these tuning eﬀorts have achieved reasonable success and have reduced the need for manual tuning of code in certain domains [5, 6]. However, one area where autoutning research has lagged is in considering a search space that covers both parallelism and data locality parameters for current multicore architectures. The few autotuning eﬀorts that have considered tuning for parallelism have limited themselves to single dimensional problem decomposition and have not considered the issue of data locality in concert [7]. There is one notable work that considers both parallelism granularity and data locality in the search space of stencil computations [3]. However, the exploration of the search space is done in an orthogonal manner and does not consider the interaction between search dimensions. This paper takes a systematic approach to characterize and explore the search space of transformation parameters that aﬀect both data locality and parallelism in multi-threaded applications. We observe that the interaction between locality and parallelism can be captured by considering both the shape and size of problem decomposition. The shape of a problem decomposition can be expressed as a

48

A. Qasem et al.

combination of multiple blocking factors used in the code. This is the key insight that has driven this research. In this work, we identify code transformations and transformation parameters that can be used to control both the granularity of parallelism and the memory access patterns of concurrent threads. We establish a set of criteria to characterize the relationship of transformation parameters that aﬀect both data locality and parallelism. Since multicore systems with sharedcaches give rise to both intra and inter-core locality, we consider both types of reuse in constructing the initial search space. Additionally, we also incorporate the issue of false-sharing withing co-running threads. In terms of parallelism, we consider problem decomposition and thread creation at all levels of a given loop nest. We combine all of these parameters into one uniﬁed multi-dimensional search space. We use a transformation scripting language [8] to implement each optimization and expose the parameters to a search engine. To explore the search space in a non-orthogonal manner, we employ several multi-dimensional search methods, including direct search and simulated annealing. Analysis of our experimental results suggest that the shape of a problem decomposition does indeed have signiﬁcant impact on performance of numerical kernels on multicore architectures. The main contributions of this paper include: • identiﬁcation of key transformation parameters for optimizing parallel numerical code on multicore architectures • an automatic method of exposing these parameters for tuning • a non-orthogonal exploration of a search space that includes parameters for exploiting both parallelism and data locality.

2 2.1

Related Work Exploiting Parallelism and Data-Locality on CMPs

The dominance of multicore technology within the processor industry has lead to a plethora of work in code improvement techniques for this platform. Softwarebased approaches have been proposed to create new parallel abstractions, extract more parallelism, exploit data locality in the shared memory hierarchy, improve thread schedules, and control synchronization delays. In our treatment of related research, we limit the discussion to work most relevant to our approach, namely strategies for exploiting parallelism and data locality. Many techniques for extracting parallelism and controlling granularity is described in the literature [9]. Recent work has focused on extracting ﬁne-grained parallelism and exploring diﬀerent models of parallelism such as pipelined parallelization. Thies et al. [10] describe a method for exploiting coarse-grain pipelined parallelism in C programs. They also develop a set of parallel programming primitives to support pipeline parallelism. They utilize the asynchronicity of short threads to hide the latency of memory accesses to improve performance. Papadopoulos et al. show that adding more execution threads is not beneﬁcial and can be prohibitively diﬃcult to implement for database applications [11].

Exposing Tunable Parameters in Multi-threaded Numerical Code

49

Locality transformations described in this paper have been widely studied [9]. Loop blocking or tiling is the predominant transformation for exploiting temporal locality for numerical kernels [12, 13]. The use of unroll-and-jam to improve register reuse is also common in both commercial and research compilers. Loop fusion and array contraction have been used in conjunction to improve cache behavior and reduce storage requirements [14]. Loop alignment has been used as an enabling transformation with loop fusion and scalarization [9]. Loop Skewing and time skewing serve as enabling transformations in strip-mining and parallelizing loop with carried dependencies. Although the literature is replete with heuristics for selecting tile sizes and choosing unroll factors [12, 13], attempts at integrating all these transformations have been less common [15]. These approaches target single-core machines, and thus does not deal with the problem of exploiting parallelism. Relatively few papers have addressed the issue of data locality and parallelism in concert. Among these, Vadlamani and Jenks [16] present the synchronized pipelined parallelism model for producer-consumer applications. Although their model attempts to exploit locality between producer and consumer threads, they do not provide a heuristic for choosing an appropriate synchronization interval (i.e. tile size). Krishnamoorthy et al. [17] describe a strategy for automatic parallelization of stencil computations. Their work addresses both parallelism and data locality issues and is similar to the work presented in this paper. However, Krishnamoorthy et al. uses static heuristics for selecting tile sizes and does not employ empirical search, as we do in this work. 2.2

Autotuning Multi-threaded Applications

Since the autotuning eﬀort started prior to the multi-core era, much of the earlier work focused on tuning for single-core machines. Among these, several autotuned libraries for speciﬁc scientiﬁc domains have been quite successful. ATLAS which provides highly tuned linear algebra routines is widely used within the scientiﬁc community and has become the de facto standard for evaluating other autotuning systems [5]. Research that aims to autotune general applications fall into two categories: those that tackle the phase-ordering problem and aim to ﬁnd the best sequence of transformations [2] and those that concentrate on ﬁnding the best parameter values for transformations that use numerical parameters [3, 4]. Autotuning techniques for multicore processors have also been applied to stencil computations and other numerical code within the scientiﬁc domain. Hall et al. developed the Chill framework that can tune numerical code on a wide range of memory transformation parameters [18]. Datta et al. propose a framework that can tune stencil code for both parallelism and locality on current multicore systems [3]. The work by Datta et al. comes closest to the work presented in this paper. The key diﬀerence between their approach and ours is that our framework explores a multi-dimensional search space, whereas their framework performs an orthogonal search, looking at one dimension at a time.

50

A. Qasem et al. do k = 1, N do j = 1, N do i = 1, N A(i,j,k)= B(i,j,k) + B(i-1,j,k) + B(i+1,j,k) + B(i,j-1,k) + B(i,j+1,k) enddo enddo enddo

(a) example numerical kernel individual thread k j

i

“unit” sweep parallelization over i, j, k no blocking

“plane” sweep parallelization over k no blocking

“beam” sweep blocking of i and j parallellization over ii and jj

(b) thread execution patterns Fig. 1. Example execution patterns of parallelized and blocked numerical code

3

Characterizing Performance Trade-Oﬀs

In this section, we characterize performance trade-oﬀs in parallelizing and blocking memory intensive numerical code. For this discussion, we will consider a simple three dimensional loop nest, as presented in Fig. 1(a). We assume the loop is fully parallelizable and there is reuse of data along all three dimensions (i.e., the only carried dependence in the loop nest are input dependencies). Although these assumptions are somewhat simplistic, this example captures the core computation pattern for many scientiﬁc code and is a suitable tool for illustrating the complex interaction between blocking and parallelizing transformations. Fig. 1(b) depicts example execution patterns for the code in Fig. 1(a). As we can see, the number of loops that are parallelized and the number of dimensions that are blocked can result in widely varying thread granularity and data access patterns for each thread. For example, parallelizing across i, j and k loops creates extremely ﬁne-grained parallelism, where each thread updates only one value in the array. We achieve a high-degree of concurrency with this decomposition.

Exposing Tunable Parameters in Multi-threaded Numerical Code

51

do t = 1, T do k = 1, N do j = 1, N do i = 1, N A(i,j,k)= B(i,j,k) + B(i-1,j,k) + B(i+1,j,k) + B(i,j-1,k) + B(i,j+1,k) enddo enddo enddo enddo

(a) time-step kernel k j

thread granularity

i

Reduced granularity Improved intra-core locality

reuse over time, multiple sweeps over working set

smaller working set per thread

(b) thread execution patterns Fig. 2. Exploiting intra-core locality by reducing task granularity

However, this variant is unlikely to perform well on most systems because the thread creation time to task completed ratio is very high. Moreover, parallelizing the innermost loop is going to negatively impact spatial locality, which can lead to performance loss (as we discuss later in this section). This issue of extreme ﬁne-granularity can be addressed by parallelizing a subset of the loops in the nest (e.g., “plane” sweep) and by blocking in one or more dimensions and then parallelizing the blocked loops (e.g., “beam” sweep). However, both these methods have potential drawbacks. Parallelizing a subset of loops might imply that available parallelism on the target platform is not fully exploited. On the other hand blocking a loop with an unfavorable block size may lead to poor locality in threads. We now discuss these trade-oﬀs in terms of exploitable intra-core and inter-core locality. Intra-core temporal locality occurs when a data value, touched by a thread running on core p, is reused either by a thread running on the same core (i.e., core p). Exploiting intra-core locality is particularly important for numerical code that sweeps over data domains multiple times (e.g., time-step computations). An example code is shown in Fig. 2(a). To achieve eﬃcient sequential execution, this code would typically be tiled in all three spatial dimensions (with the

52

A. Qasem et al.

aid of loop skewing [19]) to exploit temporal locality across diﬀerent time-steps. If we parallelize this code along the time dimension, the execution resembles the one shown on the left in Fig. 2(b), where the time dimension is broken up into four blocks. Each of the blocks are executed concurrently as a separate thread and each thread sweeps a block of data multiple times. In this scenario, it is important to ensure that the working set of each thread is made small enough to ﬁt in the cache. To enforce this, we can further subdivide the time blocks as shown on the right in Fig. 2(b). By selecting a suﬃciently small block size, we can ensure that the working set of each individual thread ﬁts into the cache. However, as we observe, reducing the block size also causes a reduction in thread granularity. Thus, reducing the block size for improved intra-core locality may result in an unbalanced load for the entire application and also add to thread creation overhead. The optimal blocking factor that exploits intra-core locality and ﬁnds a suitable granularity depends on a host of factors including the number of cores, the cache size and associativity, the current system load and the input data set. Thus, ﬁnding a suitable block size for intra-core locality and parallelism is best achieved through autotuning, as we demonstrate later in this paper.

4

Tuning Framework

Fig. 3 gives an overview of our tuning framework, which includes two key components: a transformation engine based on the POET language [8] for generating alternate code variants, and a parameterized search engine (PSEAT) for searching the transformation search space. We use HPCToolkit to probe HW performance counters of the target platform and collect a variety of performance metrics [20]. Since we do not explicitly deal with feedback issues in autotuning in this research, the rest of this section is devoted to describing our transformation scripting language and the search engine. 4.1

Parameterization of Compiler Optimizations

We have used the POET transformation engine shown in Figure 3 to optimize the thread-level parallelism, memory locality, variable privatization, and register reuse for two SPEC95 benchmarks, mgrid and swim. The core computation of mgrid is a 27-point stencil computation on a three dimensional space. We have focused on optimizing two performance critical subroutines, RESID and PSINV, as they take around 58% and 23% of the mgrid execution time respectively. Each subroutine has three perfectly-nested loops, all of which can be parallelized. In particular, none of the loops carry any dependence, but each loop carries a large number of inter-interation reuses of the input data. We have applied three optimizations: OpenMP parallelization, loop blocking, and scalar replacement, for both subroutines, and have parameterized each optimization in the following fashion.

Exposing Tunable Parameters in Multi-threaded Numerical Code Arch Specs

53

Source

PSEAT

Search Space Interpreter

Hill Climber

Sim. Anneal

GA

next point

POET Interpreter

Build Script Fortran Frontend Specializa tion

Feedback Parser

feedback

Transformat ion Library

C Frontend Specializa tion

Code Variant

execute

Direct Search

POET Script

POET

initial script

Search Space

HPCToolkit

Fig. 3. Example execution patterns of parallelized and blocked numerical code

– Parallelization. We have parameterized which loops within each loop nest to parallelize, whether nested parallelism should be considered, the thread scheduling and chunk size computed by each thread, and the number of threads that will be used to evaluate the loops. – Loop blocking. We have parameterized which loops to block and the blocking factor of each loop. – Scalar Replacement. We have parameterezied whether to apply scalar replacement for each array referenced with the loop nests. The core computation of swim includes two subroutines, CALC1 and CALC2, each of which takes more than 30% of the overall execution time of swim. Each routine includes a sequence of three loop nests that can be selectively fused to improve register performance. Speciﬁcally, the ﬁrst loop nest can be fused with either of the following loops, but not both. All loops can be parallelized. To optimize these loop nests, we have combined the application of OpenMP parallelization, loop blocking, loop fusion, loop unroll-and-jam, and scalar replacement. All transformations are parameterized in the following fashion. – Parallelization. We have parameterized which loops to parallelize, and how many threads to use. – Loop blocking. We have parameterized which loops to block, and the block factor for each loop. – Loop fusion. We have parameterized which loops to fuse together.

54

A. Qasem et al. 100 3 R 1 16 P 4 E 2 8 16

# # # # #

maximum number of program evaluations number of dimensions in the search space range : 1 .. 16 permutation : sequence length 4 enumerated : two possible value 8 and 16

Fig. 4. Example conﬁguration ﬁle for PSEAT

– Scalar replacement. We have parameterized which loops to apply scalar replacement. – Unroll and Jam. We have parameterized which loops to apply the transformation, and what unroll factor to use for each loop. All the parameterized optimizations are implemented as source-to-source program transformations using POET [8, 21] and by invoking a POET optimization library which implements a large collection of compiler transformations such as loop fusion, loop blocking, scalar replacement, etc. First, an annotation is inserted as a comment within the original Fortran source code to identify each important sequence of loop nests to optimize. Then, a separate POET transformation script is written to apply necessary optimizations to the interesting loop nests. A large collection of command-line parameters are used to control how to apply each optimization in diﬀerent ways. POET is a language speciﬁcally designed for parameterizing optimizations of application code for auto-tuning. For more details, see [8, 22]. 4.2

Searching the Space of Alternate Variants Using PSEAT

Searching for alternate code variants is accomplished through the use of PSEAT a parameterized search engine for automatic tuning. In most existing autotuning systems, the search module is tightly coupled with the transformation engine. PSEAT is designed to work as an independent search engine and provides a search API that can be used by other autotuning frameworks. This section discusses some of the design features of PSEAT and its integration into the the tuning system. Input to PSEAT is a conﬁguration ﬁle that describes the search space of optimization parameters. Fig. 4 shows an example conﬁguration ﬁle. The syntax for describing a search space is fairly simple. Each line in the conﬁguration ﬁle describes one search dimension. A dimension can be one of three types: range (R), permutation (P) or enumerated (E). range is used to specify numeric transformation parameters such as tile sizes and unroll factors. permutation speciﬁes a transformation sequence and is useful when searching for the best phase-ordering. An enumerated type is a special case of the range type. It can be used to describe a dimension where only a subset of points are feasible within a given range. An example of an enumerated type is the prefetch distance in software prefetching. In addition, PSEAT supports inter-dimensional constraints for all three dimension types. For example, if the unroll factor of an inner loop needs to be smaller than the tile size of an outer loop then this constraint is speciﬁed using a simple inequality within the conﬁguration ﬁle.

(a) Execution time sensitivity

Exposing Tunable Parameters in Multi-threaded Numerical Code

55

(b) L1 miss rate sensitivity

(c) L2 miss rate sensitivity Fig. 5. mgrid performance sensitivity to block size

PSEAT implements a number of search strategies including genetic algorithm, direct search, window search, taboo search, simulated annealing and random search. We include random in our framework as a benchmark search strategy. A search algorithm is considered eﬀective only if it does better than random on a given search space.

5 5.1

Experimental Results Experimental Setup

Platforms: We present experimental results on three Intel-based multicore systems: a dual-core (Core2Duo, Conroe), a quad-core (Core2Quad, Kentsfield) and an eight-core machine with two quad-core processors (Xeon, Nehalem). GCC 4.3.2 with the -O2 ﬂag was used to compile variants on each platform. Benchmarks: For this study, we look at the performance characteristics of mgrid and swim, two scientiﬁc benchmarks from the SPEC95 benchmark suite. Both mgrid and swim contain several loops nests that can be fully parallelized and exhibit a high-degrees of temporal locality. Experiments for both swim and mgrid used the reference data set.

56

5.2

A. Qasem et al.

Performance Impact of Blocking

We ﬁrst examine the blocking search space of mgrid and swim. A blocking search space is divided into two levels : loop selection and block sizes. The loop selection level refers to which loops are selected for blocking. For example, in a twodimensional nest, we may choose to block just the inner loop, just the outer loop, both the inner and outer loop or none of the loops. These choices can be represented with a bit string of size two and represents a search space of size 22 . For each loop selection level, there is a multi-dimensional search space that consists of all the valid block sizes for each loop that is blocked. For mgrid, we explored four diﬀerent blocking options the loop nests appearing in resid and psinv routines. For each blocking selection we explored a range of block sizes, starting from four and going up to 64 in steps of four. This gave rise to a two-level 15 dimensional search space with about 1516 points. The loop nests in swim are two-dimensional. Thus, the blocking search space of swim has fewer dimensions. However, the range within each dimension was larger, starting at 16 and going up 512. Fig 5 shows how selection of blocking loops and choice of blocking factors impact performance of mgrid. The numbers presented are from experimental runs on Kentsfield. As we can see, there is signiﬁcant variation in performance and L1 and L2 miss rates as we vary the block sizes. This is not surprising, since scientiﬁc code like mgrid are known to be sensitive to changes in the blocking factor. We notice that miss rates for L1 are high for smaller block sizes, and they gradually go down as we increase the block size. For L2, we observe a slightly diﬀerent behavior. We notice that when multiple loops are blocked, the L2 miss rates are very low for smaller block sizes. This implies that multi-level blocking is able to exploit locality at multiple levels. However, for this to happen block sizes need to be very small (< 8). Interestingly, we observe that a reduction in the L1 or L2 miss rate does not necessarily correspond to performance gains. This means that although some blocking factors may be good for improving locality, they may have other adverse aﬀects in terms of high loop instruction overhead and possibly reduced ILP. These results reiterate the need for tuning to ﬁnd the optimal block size. 5.3

Performance Impact of Parallelization Granularity

Similar to blocking, selecting loops for parallelization can have a signiﬁcant impact on performance. We explored several diﬀerent parallelization options in terms of number of loops selected for parallelization and the number of threads used for the parallel variant. Although thread scheduling and chunk size can also have a huge impact on performance, we did not explore these dimensions for this study. Among the choices for parallelizing a loop nest, we observed that nested parallelism always resulted in huge performance loss due to thread synchronization overhead. This may be due to deﬁciencies in the GCC OpenMP library. Nevertheless, we discard these options in our study.

57

Exposing Tunable Parameters in Multi-threaded Numerical Code

(a) mgrid

(b) swim

Fig. 6. Performance sensitivity to parallelization parameters

Thus, parallelization search spaces for mgrid and swim have signiﬁcantly fewer dimensions then the blocking search spaces. For mgrid, we considered parallelizing loops at each level of the three dimensional loop nests in resid and psinv subroutines. For each variant, we also explored implementations with two, four and eight threads. For swim, we considered parallelizing both the inner and outer loops of initial, calc1, calc2 and calc3 routines. Fig. 6(a) shows performance of six diﬀerent parallel variants of mgrid on Conroe, Kentsfield and Nehalem. In the ﬁgure, P1TK means that the outermost loops in both resid and psinv was paralleled, and the parallel variant used K threads. P2TK means the middle loop was parallelized. We observe that generally, parallelizing the outermost loop is most proﬁtable. However, this is not universally true. For example, on Conroe, the variant where the middle loop is parallelized performs the best. In terms of number of threads, as expected, setting the thread number to the available cores appear to work best. But again, there are exceptions. For example, on Kentsfield the best performance is obtained for P1T2 where the number of threads is 2. For swim, the performance sensitivity to diﬀerent parallelization strategies is depicted in Fig. 6(b). For swim, we do not show numbers for parallelizing the inner dimension, as they always ran orders of magnitude slower than the sequential version. In Fig. 6(b), P1TK, refers to a parallel variant where all the outer loop of all four subroutines are parallelized, with K threads. The P2TK variant refers to the case where we only parallelize calc1 and calc2. Overall, the selection of parallel loops tend to have less of an impact on swim than mgrid. However, even in this case, there is no single conﬁguration that works well for all three platforms. On Conroe, P2 performs better than P1 when using four threads, whereas for the other two platforms P1 appears to be the best choice. 5.4

Non-orthogonality of Search Space

One of the main goals of this experimental study was to validate our claim that blocking and parallelization search dimensions are indeed non-orthogonal. To accomplish this, we set up an experiment where we explored the combined search space of blocking and parallelism using both orthogonal and non-orthogonal

A. Qasem et al.

58

(a) mgrid search space

(b) swim search space Fig. 7. Orthogonal vs. Multi-dimensional search

search methods. We chose three orthogonal search methods : one where the blocking dimensions are explored ﬁrst (blockfirst), one where the parallelism dimensions are explored ﬁrst (parfirst), and one where blocking and parallelism dimensions are explored independently and best value obtained from each search is used (separate). For multidimensional search, we selected simulated annealing (anneal) and direct search (direct), both of which are known to be eﬀective in exploring the search space of transformation parameters [4]. To keep the comparison fair, individual dimension in the orthogonal search were searched using simulated annealing. We allowed each search algorithm to run for 120 iterations. We instrumented the search algorithms to output the current best value every ten iterations. Fig. 7(a) shows results of exploring the combined search space of mgrid on the quad-core platform. We observe the both direct and anneal have a clear advantage over the orthogonal search methods when the number of evaluation goes beyond 50. For fewer number of iterations parfirst is able to compete with the multidimensional strategies but in the long run does not yield the desired performance. separate performs the worst, not ﬁnding a better value over the baseline until about the 60th iteration. In fact, most of the variants picked by separate lead to worse performance. Since we only display the best value so far, the speedup appears as 1. The poor performance of separate and blockfirst, and less than average performance of parfirst indicate that there is indeed interaction between the blocking and parallelization dimensions

Exposing Tunable Parameters in Multi-threaded Numerical Code

59

that are not captured by orthogonal search method. For the multi-dimensional search methods, there is no clear winner between direct and anneal. However, the overall speedup obtained by the two search methods is not that high. We speculate this could be attributed to the absence of scheduling and chunk size as a search space parameter. Fig. 7(b) presents performance comparison of diﬀerent search algorithms on the swim search space. Again, we notice that the multi-dimensional search strategies outperform the orthogonal techniques. However, in this case the performance gap is not as much. We also observe that the overall performance achieved by any search algorithm is less for swim than mgrid. This can be attributed to the ﬁner thread granularity in swim. Because swim contains only two-dimensional loops, the amount of work per thread is not suﬃcient to oﬀset the synchronization overheads. Moreover, blocking proved to be less eﬀective for the two-dimensional case.

6

Conclusions

In this study, we explored the search space of parallelism and data locality transformations for multi-threaded applications. We presented a method for identifying and exposing tunable parameters to a search tool. Our experimental results illustrate the non-orthogonality of the search spaces and reinforces the need for application tuning through integrated transformation heuristics.

References 1. Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? In: PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2010) 2. Almagor, L., Cooper, K., Grosul, A., Harvey, T., Reeves, S., Subramanian, D., Torczon, L., Waterman, T.: Finding eﬀective compilation sequences. In: Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (2004) 3. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC 2008 (2008) 4. Qasem, A., Kennedy, K.: Proﬁtable loop fusion and tiling using model-driven empirical search. In: Proceedings of the 20th ACM International Conference on Supercomputing (June 2006) 5. Whaley, C., Dongarra, J.: Automatically tuned linear algebra software. In: Proceedings of SC 1998: High Performance Networking and Computing (November 1998) 6. Frigo, M.: A fast Fourier transform compiler. In: Proceedings of the SIGPLAN 1998 Conference on Programming Language Design and Implementation, Montreal, Canada (June 1998)

60

A. Qasem et al.

7. Song, F., Moore, S., Dongarra, J.: Feedback-directed thread scheduling with memory considerations. In: HPDC 2007: Proceedings of the 16th International Symposium on High Performance Distributed Computing (2007) 8. Yi, Q.: The POET language manual (2008), http://www.cs.utsa.edu/~ qingyi/POET/poet-manual.pdf 9. Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2002) 10. Thies, W., Chandrasekhar, V., Amarasinghe, S.: A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In: International Symposium on Microarchitecture (2007) 11. Papadopoulos, K., Stavrou, K., Trancoso, P.: Helpercore db: Exploiting multicore technology for databases. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (2007) 12. Wolf, M.E., Lam, M.: A data locality optimizing algorithm. In: Proceedings of the SIGPLAN 1991 Conference on Programming Language Design and Implementation, Toronto, Canada (June 1991) 13. Coleman, S., McKinley, K.S.: Tile size selection using cache organization. In: Proceedings of the SIGPLAN 1995 Conference on Programming Language Design and Implementation, La Jolla, CA (June 1995) 14. Ding, C., Kennedy, K.: Improving eﬀective bandwidth through compiler enhancement of global cache reuse. In: International Parallel and Distributed Processing Symposium, San Francisco, CA (April 2001) (best Paper Award) 15. Wolf, M., Maydan, D., Chen, D.: Combining loop transformations considering caches and scheduling. In: Proceedings of the 29th Annual International Symposium on MicroArchitecture (1996) 16. Vadlamani, S.N., Jenks, S.F.: The synchronized pipelined parallelism model. In: The 16th IASTED International Conference on Parallel and Distributed Computing and Systems (2004) 17. Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Eﬀective automatic parallelization of stencil computations. In: PLDI 2007: Proceedings of the 2007 ACM SIGPLAN conference on Programming Language Design and Implementation (2007) 18. Hall, M., Chame, J., Chen, C., Shin, J., Rudy, G., Khan, M.M.: Loop transformation recipes for code generation and auto-tuning. In: The 22nd International Workshop on Languages and Compilers for Parallel Computing, LCPC 2009 (2009) 19. Wonnacott, D.: Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS 2000), Washington, DC, USA, IEEE Computer Society, Los Alamitos (2000) 20. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: Hpctoolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2009) 21. Yi, Q., Seymour, K., You, H., Vuduc, R., Quinlan, D.: Poet: Parameterized optimizations for empirical tuning. In: Workshop on Performance Optimization for High-Level Languages and Libraries (March 2007) 22. Yi, Q., Whaley, C.: Automated transformation for performance-critical kernels. In: ACM SIGPLAN Symposium on Library-Centric Software Design, Montreal, Canada (October 2007)

LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors Feng Mao and Xipeng Shen Computer Science Department The College of William and Mary Williamsburg, VA, USA 23185

Abstract. To meet the needs of high performance computing, the Cell Broadband Engine owns many features that diﬀer from traditional processors, such as the large number of synergistic processor elements, large register ﬁles, the ability to hide main-storage latency with concurrent computation and DMA transfers. The exploitation of those features requires the programmer to carefully tailor programs and simutaneously deal with various performance factors, including locality, load balance, communication overhead, and multi-level parallelism. These factors, unfortunately, are dependent on each other; an optimization that enhances one factor may degrade another. This paper presents our experience on optimizing LU decomposition, one of the commonly used algebra kernels in scientiﬁc computing, on Cell Broadband Engine. The optimizations exploit task-level, data-level, and communication-level parallelism. We study the eﬀects of diﬀerent task distribution strategies, prefetch, and software cache, and explore the tradeoﬀ among diﬀerent performance factors, stressing the interactions between diﬀerent optimizations. This work oﬀers some insights in the optimizations on heterogenous multi-core processors, including the selection of programming models, considerations in task distribution, and the holistic perspective required in optimizations. Keywords: Software cache, Heterogeneous architecture, LU decomposistion, CELL Broadband Engine.

1

Introduction

Multi-core and heterogeneousness have been the recent trends in computer development. A typical example is the IBM Cell Broadband Engine (Cell B/E) [11], an asymmetric and heterogeneous multi-core architecture. It typically consists of one general-purpose IBM PowerPC processor element (PPE) and eight independent synergistic processor elements (SPEs). The SPEs have large register ﬁles and good ability to hide main-memory latency with concurrent computation and direct memory access (DMA) transfers. These features make this heterogeneous

This work is done when Feng Mao was associated with College of William and Mary.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 61–75, 2010. c IFIP International Federation for Information Processing 2010

62

F. Mao and X. Shen

architecture suitable for accelerating computation-intensive applications, such as gaming, multimedia, and scientiﬁc applications. The matching between software and hardware on such an architecture is more important and also, more challenging than on traditional homogeneous systems, mainly because the architecture is more complex and more ﬂexible in control. For instance, the SPEs have no cache but local storages, whose management has to be explicit through schemes like DMA transfers. Therefore, a suboptimal matching may easily cause factors of performance degradation than the optimal matching. On the other hand, ﬁnding the good matching requires the consideration of multiple factors at the same time, in particular, how an optimization aﬀects data locality, load balance, communication cost, and multi-level parallelism. It is typical that those factors are inter-dependent, sometimes even causing optimization conﬂicts. A good understanding to those performance factors and their interactions is important for eﬀective uses of such processors. This work concentrates on the exploration of the diﬀerent performance factors and their interactions on LU decomposition. We choose LU decomposition as the focus because it is a fundamental kernel in many scientiﬁc applications, such as linear algebra and signal processing programs. The insights obtained from this work may directly beneﬁt those applications. In addition, the computation of LU decomposition includes many data dependences, posing interesting challenges to the exploitation of parallelism, communication, and other optimizations. Although many studies have analyzed LU decomposition on traditional homogeneous systems, we are not aware of any systematic exploration to the optimizations of the problem on the Cell B/E architecture. (Existing implementations, such as [9], show no systematic explorations to the optimization space.) This work emphasizes the interactions between the diﬀerent performance factors in the optimization, distinguishing it from many other case studies on Cell programming. More speciﬁcally, this work makes the following contributions: – Based on Cell SDK, we develop an extensible framework for ﬂexibly experimenting diﬀerent optimization components for LU decomposition on Cell B/E. The framework allows plugins of a set of optimization components, and reports various performance metrics, including numbers of DMA operations, software cache hit rates, branch hint hit rates, numbers of clock cycles and so forth. – We exploit diﬀerent levels of parallelism supported by Cell. Our implementation exposes parallel tasks through a status matrix, leverages data-level parallelism by manual vectorization, and enables parallel communications by the use of non-blocking mailboxes and DMA. – We explore the eﬀects of a spectrum of locality optimization techniques and four kinds of task distribution schemes for block LU decomposition. We concentrate on prefetching and software cache management to enhance the eﬀective bandwidth and hide memory access latency. We adopt SPE-centric computation acceleration programming model, and construct three static

LU Decomposition on Cell Broadband Engine

63

task distribution models and a dynamic distribution model to explore the tradeoﬀ between locality and load balance. – We conduct detailed analysis on the inﬂuence of the diﬀerent techniques on matching LU decomposition with Cell B/E architecture. The analysis reports the inﬂuence of each optimization on individual performance factors, such as locality, load balance, communication overhead, and parallelism. More importantly, it reveals the interactions of those factors and produces insights into the holistic consideration of optimizations for heterogeneous multicore processors. For instance, the experiments show that although task distribution aﬀects both locality and load balance, load balance should be the only consideration when a good prefetching scheme is included in block LU decomposition. The rest of the paper is organized as follows. In section 2, we introduce the background on Cell B/E architecture and block LU decomposition algorithm. In section 3, we describe our implementation and optimizations of the algorithm on Cell B/E. Section 4 reports and analyzes the inﬂuence of the optimizations and their interactions. Section 5 discusses related work, followed by a short summary.

2 2.1

Background Cell B/E Architecture

The Cell B/E is a heterogeneous architecture, designed for accelerating computationally intensive applications [7]. A Cell processor is a single-chip multi-core processor, including 1 PPE and 8 SPEs operating on a shared, coherent memory. Figure 1 shows an overview of the architecture. The PPE is the main processor. It contains a 64-bit PowerPC Architecture core with multimedia extension unit to support vector operations. Typically it runs the operating system, manages system resources, and controls the allocation and management of SPE threads. The 8 SPEs are processors designed for single instruction multiple data (SIMD) computation. Each contains a 256-KB local store controllable by software and a large (128-bit, 128-entry) register ﬁle. It relies on asynchronous DMA for data and instruction transfer to and from the main memory. It supports a special SIMD instruction set and is optimized for data-rich operations. PPE and SPEs are connected by the element interconnect bus. Each SPE has a memory ﬂow controller (MFC) to communicate with main memory. It is the application’s responsibilty to maintain coherence between main memory and distributed local stores. User programs explicitly issue DMA command to exchange data between local store and memory. The user program code running on SPEs typically are implemented as a group of threads. 2.2

Block LU Decompostion Algorithm

LU decomposition is to transform a matrix A into a product of a lower triangular matrix L and an upper triangular matrix U , expressed as A = L ∗ U . (Pivoting

64

F. Mao and X. Shen

PPE

SPE

SPE

SPE

LS

LS

LS

SPE LS

SPE

SPE

SPE

SPE

LS

LS

LS

LS

Bus / DMA

Main Memory

Fig. 1. Cell Broadband Engine archicture

is not considered in this work.) Due to the importance of LU decomposition in numerical computing, many studies have explored the problem on the aspects of both algorithms and implementations. Block LU decompostion is a typical parallel algorithm to solve this problem in a divide and conquer stragtegy. Assume that we have a matrix A, expressed as a composition of 4 sub-matrices: L00 0 U00 U01 A00 A01 = ∗ (1) A= A10 A11 0 U11 L10 L11 The LU decomposition results can be derived as follows: ⎧ ⎧ L00 U00 = A00 L00 U00 = A00 ⎪ ⎪ ⎪ ⎪ ⎨ ⎨ L10 U00 = A10 L10 = A10 /U00 −→ L00 U01 = A01 U01 = L00 \A01 ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ L10 U01 + L11 U11 = A11 L11 U11 = A11 − L10 U01

(2)

L00 and U00 are respectively a lower and upper triangle matrix. Because A00 is usually small, L00 and U00 can be easily obtained through Gaussian elimination. The sub-matrices, L10 and U01 , can be computed subsequently. Notice that the ﬁnal equation in 2 is another LU decomposition problem, with a smaller problem size. It can be reduced to an even smaller LU decomposition problem in the same manner as above. When the problem is reduced to a matrix with only 1 block, Gaussian elimination will produce the ﬁnal result. This iterative strategy is the core of the block LU decomposition algorithm. We refer the reader to [4] for more details.

3

Implementation and Optimizations

The Cell B/E supports parallelism at diﬀerent levels, such as task-level, data level, and communication level. In the ﬁrst part of this section, we concentrate on the exploitation of various parallelisms in block LU decomposition, with the focus on task-level parallelism. We discuss the task-level dependences in block LU decomposition, and describe the use of a status matrix to help dependence analysis and expose parallel tasks. In the second part, we describe the programming model used in our implementation. Our focus in this part is on the diﬀerent

LU Decomposition on Cell Broadband Engine

65

strategies for distributing tasks to SPEs. The distribution strategies are important to load balance, locality, and communication cost. In the third part, we concentrate on the use of prefetch and software cache for locality improvement. 3.1

Dependence Analysis and Parallelism Exploitation

In a typical implementation of block LU decomposition, a matrix is partitioned into many small blocks. At the ﬁrst iteration of the decomposition process, the whole matrix is treated as 4 regions as represented by diﬀerent depths of grey (or colors) in the left bottom graph in Figure 2. The ﬁrst iteration computes the ﬁnal LU decomposition results corresponding to regions A00 , A01 , and A10 using the top 3 equations in Equation 2. It also computes the right hand side of the ﬁnal equation in Equation 2 as the intermediate results corresponding to the region A11 . The next iteration conducts the same computation but only on the updated region A11 . As more iterations are executed, the working set becomes smaller and smaller until the ﬁnal result of the last block is attained. The bottom row of graphs in Figure 2 illustrate the whole iterative process.

0

1

2

Iteration 3 ….

A00

A10

n-2

n-1

n

A01 A11

Fig. 2. Iterative computation in block LU algorithm. The top graph shows the dependences in the computation, represented by arrows. In that graph, the matrix is represented by a single column. The bottom graph shows the part of the matrix that is manipulated in each iteration.

The top row of graphs in Figure 2 shows the dependences in the computation. Consider the computation in iteration 0. The ﬁnal results corresponding to region A00 depend on the current values of region A00 . And the computed results, along with the current values of a block in region A01 , determine the computation corresponding to block A01 ; the blocks in the region A10 have the similar dependences. The computation corresponding to a block (i, j) in region A11 depends on the current values of block (i, j) and the results corresponding to blocks (0, j) and (i, 0).

66

F. Mao and X. Shen

We use a status matrix to help runtime dependence checking. Each block in the data matrix has one corresponding element in the status matrix, indicating the number of the iteration in which the computation corresponding to the block has just ﬁnished. During runtime, the PPE will use the status matrix to determine the tasks that are ready to run. For example, in the second iteration (i.e. iteration 1), if the PPE ﬁnds that the status corresponding to block (1, 3) and (2, 1) are both 1 (the row and column numbers are 0-based), it will immediately know that the computation corresponding to block (2, 3) is ready to run. This scheme exposes all task-level parallelism. Besides task-level parallelism, we also exploit other levels of parallelism supported by Cell B/E. Cell B/E supports data-level parallelism mainly through vectorization. In our implementation, we manually vectorized the program to maximize the data-level parallelism. We unroll the loops and insert branch hints to increase instruction-level parallelism (details in Section 3.4.) We exploit communication parallelism by using mailbox as the main communication scheme between SPEs and the PPE. Communication through mailbox is non-blocking. When a task is assigned to a SPE, it is given a unique DMA tag, and then all the DM A Get requests that the task needs are enqueued into the MFC. While the MFC is executing the DMA operations, the SPE is free to do computation on other requests. The status of the pending DMA operations under each tag is polled regularly. When the MFC indicates that some DMA tag has no more pending DMA operations, the corresponding task is marked ”processable”; it will be processed when the processor becomes free. The similar non-blocking scheme is used for the store of computation results to the main memory. 3.2

Programming Model and Task Distribution

As a heterogeneous architecture, Cell permits two kinds of programming models: the PPE-centric, and the SPE-centric. In the PPE-centric model, the PPE runs the main application, and oﬀ-loads individual tasks to the SPEs. The PPE waits for, and coordinates, the results returned by the SPEs. This model has some variants like multistage pipeline model, parallel stage model, and services model [8]. In the SPE-centric model, most of the application code is distributed among the SPEs. The PPE is a centralized resource manager for the SPEs. Each SPE fetches its next work item from main storage when it completes its current work. In this work, we choose SPE-centric model because it ﬁts the property of the problem and may expose the maximum amount of task-level parallelism. Figure 3 depicts our programming model. The instructions for the computation of a matrix block reside in the local store of every SPE. The SPEs conduct all the decomposition computation, and the PPE’s job is to maintain the status matrix, ﬁnd ready tasks and distribute them to the SPEs. There are two queues associated with each SPE: the ready queue and the done queue. The PPE puts the tasks that are ready to run into the ready queues for SPE to dequeue and run. The SPEs put completed tasks and the results

LU Decomposition on Cell Broadband Engine

Static task lists

67

SPE 0 available task queue Task distributor(s)

Mail box

Block operation

SPE 0

SPE 0 finished task queue In DMA queue

Status matrix SPE 7 available task queue

Check/update status matrix

Mail box SPE 7 finished task queue

PPE

Block operation

SPE7

In DMA queue

Fig. 3. Programming model for block LU decomposition on Cell B/E. The static task lists are used only in static task distributions.

into the done queues for PPE to dequeue and commit the changes to the status matrix and the output matrix. The communication between PPE and SPEs is through mailboxes, a scheme provided by Cell for exchanging 32-bit messages. Both parties use non-blocking operations (mailbox stating) to avoid unnecessary data stalls. Task Distribution. The scheme of task distribution determines which SPE will get the next ready task. It critically aﬀects the load balance, locality, communication cost, and task-level parallelism. In this work, we implement 4 diﬀerent distribution schemes. The ﬁrst is a balance-driven dynamic distribution scheme. When a task becomes ready, the PPE puts it into the ready queue that contains the fewest tasks. Although this scheme may produce good load balance among SPEs, it is locality oblivious, considering no data reuses across tasks.

(a) 1D-unbalance

(b) 1D-interleaving

Processors

1

2

3

(c) 2D-interleaving

4

Fig. 4. Illustration of 3 static task distribution schemes on 4 processors

The other 3 distribution schemes are static, with diﬀerent tradeoﬀ between locality and load balance. Each of the 3 schemes determines a static mapping from tasks to SPEs. All of mappings are embodied by a partition of the matrix blocks. The ﬁrst static scheme evenly partitions the matrix into 8 sub-matrices as illustrated in Figure 4 (a) (the ﬁgure uses 4 processors for illustration); the second is a 1-dimension interleaving partition as shown in Figure 4 (b); the third

68

F. Mao and X. Shen

is a 2-dimension interleaving partition shown in Figure 4 (c). Each SPE executes only the tasks corresponding to those blocks that are mapped to it. The ﬁrst static partition has the worst balance: The SPE 0 has much less job to do than the other SPEs, because the blocks in the right and bottom regions have more computations than the blocks in the left and top regions, due to the iterative computation as shown in Figure 2. The second static scheme has much better balance, and the third one has the best. On the other hand, the ﬁrst and the second scheme has better locality than the third one because a SPE in the two schemes is in charge of some whole columns and thus has more data reuse than the third scheme has. Better locality also suggests the need for fewer communications. Section 4 reports the quantitative measurement of the eﬀects of these distribution schemes. The dynamic distribution uses only the runtime status matrix to discover and distribute ready tasks. Whereas, the static distribution schemes use both the runtime status matrix and the static task mapping for task distribution. In our implementation of the static schemes, the PPE creates 8 threads, each of which dedicates itself to the task distribution for one SPE. 3.3

Locality Optimizations

For locality optimizations, we concentrate on the use of prefetch and software cache. Prefetch hides the latency in data transfer, and software cache reduces the required data fetches. Prefetch. Prefetch is an eﬀective mechanism to hide the latency in data transfer if the data accesses are predictable. In our programming models, no matter with static or dynamic task distributions, a SPE can easily predict what data it is about to use by checking the tasks in its ready queue. The prefetch in our implementation works in this way. When a SPE is about to process a task, it checks its ready queue and issues prefetch instructions for the data that the newly entered tasks may need (and not in the local store if software cache is used.) As prefetch is non-blocking, the SPE then can immediately start processing the next task in the ready queue. The capacity of the mailbox in Cell allows at most 4 tasks to be handled concurrently by the processors. So, we set the length of a ready queue to be 8. The implementation uses double-buﬀering to overlap computation and communication. As shown in Figure 5, the pointer of the target location for prefetch moves continuously in the local store to avoid the conﬂicts between the data that are used or to be used soon in the computation and the data that are newly prefetched. Software Cache. As mentioned earlier, the SPEs in Cell have no cache but only local stores. Previous work has used the local store as a software cache, such as the general-purpose software cache in the single source compiler developed by IBM [6]. In this work, we implement a software cache speciﬁc for LU

LU Decomposition on Cell Broadband Engine

69

LRU level-2 cache

Prefetch pointer

Computation pointer LFU level-1 cache

Fig. 5. Illustration of double-buﬀering for prefetch and 2-level software cache

decomposition. By tailoring itself to the data structure of the application, it is potentially more eﬃcient. In our implementation of the software cache, we use an index array to record which blocks are in the local store. Each element in the index array is a triple: (row, column, iteration), where, (row, column) identiﬁes the block in the input matrix, iteration is the iteration in which the last update to this block occurs. At a request for data, the index is ﬁrst searched; only when not found, a DMA is issued to fetch the data from the main memory. When a block is fetched into the local store, the index array is updated immediately. We use a set of counters to record the number of times each block has been accessed since the latest time it is brought into the local store. The counters are used to implement the least frequently used (LFU) replacement policy, in which, when the cache is saturated, the new block replaces the block that has the smallest counter value among all the blocks currently in the cache. To explore the beneﬁts from diﬀerent replacement policies, we also implemented a second-level cache with least recently used (LRU) replacement policy. When the second-level cache is used, the blocks evicted from the LFU cache are put into it before being completely evicted from the local store. The cache scheme is used for load operations only. For store operations, the data are directly written to the main memory via DMA to update the status matrix as soon as possible. 3.4

Other Optimizions

The SPEs in Cell B/E contain no hardware branch predictors, but support the use of branch hints. In our implementation, we insert 104 branch hints (51 often and 53 seldom) according to our understanding of the program behavior. These hints help the program speculate on branches and keep the instruction pipeline properly ﬁlled. In addition, we manually unroll the loops to gain more instruction-level parallelism and reduce loop control overhead. As these are standard optimizations, the details are skipped in this paper.

70

4

F. Mao and X. Shen

Evaluation

This section reports the eﬀects of diﬀerent task distributions and the locality optimizations. To help understand the eﬀects of the various optimizations, we use the IBM Full System Simulator to collect detailed runtime information. The simulator supports both functional simulation and cycle-accurate simulation of full systems, including the PPE, SPEs, MFCs, PPE caches, bus, and memory controller. It can simulate and capture many levels of operational details on instruction execution, cache and memory subsystem, interrupt subsystem, communications, and other important system functions [8]. We conﬁgure the simulator to simulate a Cell B/E with 8 SPEs, each of which has 256K load storage, and 2 PPEs. It runs on a Linux operating system. All the matrices used in the experiments are randomly generated; each element is a double precision ﬂoating-point number. 4.1

Single SPE Performance

This section measures the beneﬁts from program code optimizations, which include vectorization, loop unrolling, and branch hints. We use a standalone mode to measure the computation to a 4-block matrix by a single SPE. All communication overhead is ignored. There are three versions of the program in our comparison: a scalar version with pure scalar operations, a simple vector version, and an optimized vector with loop unrolling and branch hints. Table 1. SPE performance in standalone mode

Block size

Clock cycles 2x2 4x4 8x8

16x16 32x32

Instructions Issued 2x2 4x4 8x8 16x16 32x32

Scalar 20710 88596 149126 640344 3848292 9066 43332 63962 263830 1628488 Vector 23257 48131 56255 181201 1003435 11286 24080 25554 75290 412910 Opt Vector 8597 16254 15936 39248 194461 3788 7918 7296 17164 83600

Table 1 shows the total numbers of clock cycles and instructions by the three versions. When the block size is small, 2 × 2, the performance of the vector code without branch hints and unrolling is even worse than that of scalar code. It is because the block is too small to beneﬁt from the vector registers. When the block size increases, the vector code ﬁnds enough elements to vectorize. However, the loop control statements and branches limit the speedup of the vector code: The SPE has to execute them in scalar operations. The loop unrolling reduces loop controls and increases instruction-level parallelism. The branch hints help the vector code to decrease branch miss predictions and remove unnecessary control instructions. Together, they bring speedup of a factor of 2.7 to 5.2 compared to the simple vector version. The optimized vector version outperforms the scalar version by a factor of 2.4 to 19.8 as showed in Figure 6. The optimizations cause

LU Decomposition on Cell Broadband Engine

71

the code size to increase slightly, from 4K bytes in the scalar version to 4.8K in the vector version to 5.1K in the optimized vector version. The increase has negligible eﬀects to the program performance.

25 scalar

20 Speedup

vector 15

vector+hint+unroll

10 5 0 2x2

4x4

8x8

16x16

32x32

BlockSize

Fig. 6. SPU SIMD speedup on standalone mode

Instruction/Byte

900 800

Scalar

700

Vector

600

Vector Hint & Unroll

500 400 300 200 100 0 2

4

8

16

32

Block Size

Fig. 7. The number of instructions required for processing one byte of a matrix element

Figure 7 shows the eﬃciency of instructions in the three versions. The vectorized code uses far fewer instructions to process one matrix element because of the SIMD scheme provided by Cell. The unrolling and branch hints enhance the eﬃciency further by removing many scalar operations and reducing the penalty of branch miss prediction. The following sections report the beneﬁts from locality optimizations and diﬀerent task distributions. All the experiments use 32x32 as the block size, with all code optimizations enabled. 4.2

Locality Optimizations

Software cache is eﬀective in reducing the number of DMAs as shown in Figure 8 (a). The graph shows the required number of DMAs, normalized by the number when software cache is not used. As the 1D and 2D static task distribution have the best locality, the software cache reduces 45% DMA accesses. In contrast, the dynamic distribution has the worst locality, only showing 20% DMA reduction.

F. Mao and X. Shen 0.9

50

0.8

45

0.7

40 35

0.6

Clocks (M)

Normalized DMA #

72

0.5 0.4 0.3

30 25 20 15

0.2

10

0.1

5

0

0 Dynamic

2D

(a)

1D

1D-unbalanced

nPre+nCache

nPre+Cache

Pre+nCache

Pre+Cache

(b)

Fig. 8. Reduced DMAs due to software cache (a), and the impact to overall performance from prefetch and software cache (b). (“nPre”: no prefetch, “nCache”: no software cache.)

Figure 8 (b) shows how the reduction of DMA helps the overall performance. The matrix size is 256x256 and we use 2D task distribution. If prefetch is disabled, the software cache improves the performance by 43%. However, when prefetch is enabled, the beneﬁts from software cache become unnoticeable. The prefetch itself is enough to hide the latency in data transfer, improving the overall performance by 137%. This result suggests one of the interactions between diﬀerent optimization techniques: An eﬀective optimization technique, such as the software cache, becomes unnecessary when some other optimizations like prefetch is used. So by default, we enable prefetch and disable software cache in the following experiments. 4.3

Task Distribution

Diﬀerent task distribution strategies cause diﬀerent data locality and load balance. Good locality helps to reduce the required data transfers (i.e., the number of DMA operations.) In last section, Figure 8 (a) already shows the diﬀerent eﬀectiveness of software cache on diﬀerent task distributions. Figure 9 further shows the average numbers of DMA operations per task when diﬀerent task distributions are used when the block size is 32x32. When single-level cache is used only, the dynamic and the 2D interleaving distributions require much larger numbers of DMA operations than the 1D distributions. When two level cache is used, the 2D interleaving distribution requires no more DMA operations than the 1D distributions, whereas, the dynamic distribution still requires signiﬁcantly more DMA operations. This result is intuitive as the dynamic distribution is locality-oblivious, having the worst locality among all the distributions. On the other hand, the dynamic distribution and the 2D interleaving distribution have the best load balance. The poor load balance in the two 1D distributions causes up to orders of magnitude performance degradation compared to the performance by the other two distributions, even though they have better locality.

LU Decomposition on Cell Broadband Engine

73

Fig. 9. Average numbers of DMA operations per task

Besides load balance, the dynamic distribution has another advantage: It allows the full freedom for SPEs to explore task-level parallelism. While in the static distributions, a ready task may have to wait for a particular SPE to get free, even though some other SPEs may be idle. This advantage comes at the sacriﬁce of data locality. However, as shown in Section 3.3, prefetch is enough to hide data transfer latencies for LU decomposition. Therefore, the two advantages of the dynamic distribution make it a more appealing choice than the static distributions. The comparison between the two graphs in Figure 10 verﬁes the above projection. The ﬁgures show the performance of solving a 128 × 128 matrix for various block sizes by using the 2D-interleaving distribution and the dynamic distribution. The three bars for a block size show the range and the mean of the performance of the 8 SPEs. The dynamic distribution shows smaller diﬀerence between the minimum and maximum clocks when the block size is 16 and 32 than the static distribution. It indicates better balance. The relatively larger diﬀerence when the block size is 8 is likely due to the randomness in dynamic distribution. Overall, the dynamic distribution outperforms the static distribution by a factor of 1 to 7 in terms of the maximum clocks.

(a) 2D interleaving distribution

(b) dynamic distribution

Fig. 10. Overall running time with diﬀerent task distributions

5

Related Work

As an important linear algebra kernel, LU decomposition has been studied extensively on traditional parallel computers. In 1993, Buoni et al. have studied the

74

F. Mao and X. Shen

diﬀerent algorithms for static LU decomposition on traditional shared memory processors [4]. A scalable implementation is included in some widely used linear algebra packages, such as LAPACK [1] and ScaLAPACK [3]. However, the implementations assume the computing systems to be either a homogeneous parallel computer or a homogeneous network of workstations. Beaumont et al. studied the matrix partition schemes for LU decomposition on heterogeneous networks of workstations [2], rather than Cell B/E. We note that this current work is not the ﬁrst work that implements LU decomposition on Cell B/E. But the existing implementations, such as the one in the Cell SDK library [9], although working well on a single SPE, have not systematically explored the design space and optimization issues for multiple SPEs. This work is not aimed to produce a universally applicable, fastest LU decomposition, but to use LU decomposition as an example problem to reveal the interactions among diﬀerent optimizations on Cell B/E and obtain the insights in holistic optimizations for heterogeneous multicore architecture. Locality optimization has been a focus in many previous studies, especially on traditional CPU and modern multicores [13,10]. For LU decomposition, an example is the automatic blocking for improving its locality on SMP [12]. On Cell B/E, the IBM project of Single Source Compiler (SSC Research Compiler) [6, 5] has included interesting explorations to locality optimizations. These explorations are particularly for automatic conversion of general OpenMP programs to Cell programs; the performance of the generated programs is often not as good as that of the programs directly developed from Cell SDK.

6

Conclusions

In this paper, we present an exploration to tailor block LU decomposition to Cell Broadband Engine processors. The implementation exploits diﬀerent levels of parallelism supported by Cell. It exposes parallel tasks through a status matrix, leverages instruction-level parallelism by manual vectorization, and enables parallel communications by the use of non-blocking mailboxes. We study the effects of diﬀerent task distribution strategies and a set of locality optimizations. The exploration reveals the interactions between those optimizations, and oﬀers some insights into the optimization on heterogenous multi-core processors, such as the selection of programming models, considerations in task distribution, and holistic perspective required in optimizations. Acknowledgments. We thank Dimitrios Nikolopoulos, Zhengyu Wu, and Stephen McCamant for their help. This material is based upon work supported by the National Science Foundation under Grant No. 0720499 and 0811791. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation.

LU Decomposition on Cell Broadband Engine

75

References 1. Angerson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: LAPACK: A portable linear algebra library for high-performance computers. IEEE Supercomputing, 2–11 (1990) 2. Beaumont, O., Legrand, A., Rastello, F., Robert, Y.: Static LU decomposition on heterogeneous platforms. The International Journal of High Performance Computing Applications 15(3), 310–323 (Fall 2001) 3. Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK user’s guide, Society for Industrial and Applied Mathematics, Philadelphia (1997) 4. Buoni, J.J., Farrell, P.A., Ruttan, A.: Algorithms for lu decomposition on a shared memory multiprocessor. Parallel Comput. 19(8), 925–937 (1993) 5. Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on cell. In: CGO, pp. 155–164 (2008) 6. Eichenberger, A.E., et al.: Using advanced compiler technology to exploit the performance of the cell broadband enginetm architecture. IBM Syst. J. 45(1), 59–84 (2006) 7. Pham, D., et al.: The design and implementation of a ﬁrst-generation cell processor. In: Proceedings of the IEEE International Solid-State Circuits Conference, ISSCC (2005) 8. IBM. Cell be programming tutorial, http://www-01.ibm.com/chips/techlib/techlib.nsf/ products/Cell Broadband Engine 9. IBM. Cell broadband engine sdk libraries v3.0 (2008), http://www.ibm.com/developerworks/power/cell 10. Jiang, Y., Zhang, E., Tian, K., Shen, X.: Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Proceedings of the International Conference on Compiler Construction (2010) 11. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005) 12. Yi, Q., Kennedy, K., You, H., Seymour, K., Dongarra, J.: Automatic blocking of qr and lu factorizations for locality. In: MSP 2004: Proceedings of the 2004 Workshop on Memory System Performance, pp. 12–22. ACM, New York (2004) 13. Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? In: PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 203–212 (2010)

FDTM: Block Level Data Migration Policy in Tiered Storage System Xiaonan Zhao, Zhanhuai Li, and Leijie Zeng School of Computer Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China {zhaoxn,lizhh,zenglj}@nwpu.edu.cn

Abstract. ILM and tiered storage system are designed to address the challenge of achieving balance between cost and storage performance. However, both of them are hard to implement fully automatic data migration by traditional solutions for which are mainly relying on administrators experience and need huge manual work for data migration according to storage conﬁguration and IO access patterns. This paper proposes a novel bi-directional migration policy FDTM based on blocklevel data valuation and fully automatic migration process. FDTM aims to get a trade-oﬀ between storage QoS and migration costs by introducing double thresholds to narrow the migration scope of block-level data objects. Experiment and analysis show that FDTM is eﬃcient at blocklevel data migration comparing with traditional migration policies. In addition, it could help pave the way to implement tiered storage system with fully automatic data migration. Keywords: Data migration policy; data valuation; feedback; Tiered Storage System.

1

Introduction

Industry research has shown that 70-80% of all storage data is inactive, and data is rarely accessed after 30-90 days[11]. So in business perspective, the data value is declining over time dramatically. SNIA proposes ILM to address this issue[18]. In addition tiered storage as “tiering ILM”, is the key component in an ILM practice, but not all. Generally, there are three types of storage in tiered storage system: online, near-line and oﬀ-line [19], in which data is classiﬁed into distinct classes and stores in diﬀerent tier separately with considering performance, availability, recoverability, safety and other requirements. A tiered storage system is usually composed of expensive FC disk arrays for high performance accessing, cheaper SATA storage subsystems for data staging and high-capacity robotic tape libraries for data archiving, which could decrease the system’s TCO while keeping the storage performance at the same time. Fig. 1 shows the infrastructure of a tiered storage system. In a tiered storage management system, data classiﬁcation, data placement and data migration are the core function components. Furthermore data migration is also helpful to optimize storage system performance, which includes C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 76–90, 2010. c IFIP International Federation for Information Processing 2010

FDTM: Block Level Data Migration Policy in Tiered Storage System

Applications Definition Business process

Requirement

77

Requirement Management object Policy

Data Parameters

Data Migration

Data placement Data valuation data

Execution

Data management & optimize system service

Storage devices Management

Tired storage system

Online

Nearline

Offline

Fig. 1. Infrastructure of Tiered Storage System

single-level migration and multi-level migration. The main goal of single-level migration is to balance workload, however multi-level is more common in tiered storage system in order to manage the system in the form of ILM by reducing the TCO of the system with QoS guaranteed as while. It’s clearly that data migration policy aﬀects performance, ﬂexibility and other properties of the whole system. Today, much researches on migration focus on load balance of distributed system or cluster system [17], and because of the obvious diﬀerence on performance, capacity and other aspects among storage devices, data migration is more complex in tiered storage system. However traditional migration policies took little consideration about robustness. Moreover, the trigger of migration today is heuristic and coarse-grained, and almost all of them pay too much attention to one way migration, to migrate from high performance storage to low performance storage, without taking enough consideration of down to up migration. Based on our study of traditional policies for automated tiered storage, this paper proposes a novel bi-directional migration policy FDTM(Feedback based Double Threshold data Migration) at block level. It is fully self-adaptive, deals with the down to up migration carefully as well as up to down migration, and could avoid unnecessary data migration under “false overload” situation by setting of double thresholds and spill over interval.

2

Data Placement by Block-Level Data Valuation

Data classiﬁcation is a precondition for data migration, and migration policy is usually based on data classiﬁcation and placement. In addition data valuation is

78

X. Zhao, Z. Li, and L. Zeng

the core of data classiﬁcation. In this paper, FDTM considers the speciﬁcation of storage system and the characteristics of data objects, to get data objects value based on block-level evaluation, which is used to evaluate the importance of the data in a tiered storage system. Obviously, it is not hard to achieve couple of metrics related to data value at ﬁle-level, because metadata contains all the necessary information. However, in storage subsystem at data center, most data are stored and managed at blocklevel, within one data block there maybe couple of ﬁles, so value metric is hard to achieve. But data blocks still have some attributes which are helpful for blocklevel data valuation: 1. READ FREQUENCY. Read times in given interval to a certain data block. 2. WRITE FREQUENCY. Write times in given interval to a certain data block. Usually, it has diﬀerent performance in write and read within same storage device, and another fact is that two storage devices maybe have nearly same read performance but have huge diﬀerence in write performance. Ri and Wi represent the read/write frequency at the ith time-step. 3. R/W GRANULARITY. The granularity represents the data amount ratio which is related to an IO (whether read or write) to a ﬁxed size data block. Pi is the average R/W granularity in the ith time-step. 4. DATA DISTRIBUTION. It is the location information of accessed data, measured by statistical result which comes from all the operations in data blocks in a given interval. Di means the data placement value at the ith time-step. 5. RELEVANCE BETWEEN BLOCKS. Similar to relevance between ﬁles [5], it means if the IO operations on a data block is similar to the IO operations on another block in a given time interval, in other words these two blocks are associated, then the data value of these blocks is also considered as being correlated. MIf there are M blocks relate to block d within the same storage device, j=1 RLdj represents block relevance of d and other M blocks. And the association coeﬃcient RLdj which is indicated association degree between data block d and j is deﬁned as equation 1: Ed · Ej (1) |Ed ||Ej | N N 2 in which, Ed · Ej = i=1 qdi qji , |Ed | = i=1 qdi here, T means the valuation time interval, and it is divided into N time-steps equally, each time-step’s length is ti − ti−1 = L. If t indicates the current time, then T is the time period of [t − N × L, t]. Ed (qd1 , qd2 , . . . qdN ) is an IO vector during T of data block d (Ej is IO vector of data block j as well ), it is used to record R/W times information of d during T , and component qdi is R/W times in the ith (i = 1, 2 . . . N ) time-step. as following, the data blocks valuation could be calculated by equation 2: M N 1 + j=1 RLdj Vt (d) = (λw Wi + λr Ri )Pi Di (2) N i=1 RLdj =

FDTM: Block Level Data Migration Policy in Tiered Storage System

79

Vt (d) is the value of data block d at time t. In which Ri and Wi represent the read/write frequency at the ith time-step [t−i×L, t−(i−1)×L] in the valuation interval [t−N ×L, t] respectively; Pi is the average R/W size at the ith time-step; Di is the data placement value at the ith time-step. If there are M data blocks relate to block d within the same storage device, then sum all M factors as one of valuation factor for block d. λr and λw are the compensation coeﬃcients for read and write respectively, which are used to measure the time cost diﬀerence for the R/W operation between two kinds of storage (mainly include time cost ratio in accessing the same size of data with same times of R/W operations). For more detail about data valuation, please refer to[25].

3

Starting Conditions for Data Migration

In this section, we will discuss when migration should start. In a tiered storage system, the data migration process got to consider many preconditions in the management system for special requirements. Generally, there are four main trigger conditions need be taken into account for the applications. 1. FREE SPACE OF STORAGE DEVICES This is the spare capacity threshold of storage devices to keep the speciﬁc application running well. 2. DIFFERENCE OF DATA VALUES There are two attributes to determine the value of data. One is the characteristic of the application itself, such as requirements of security, reliability of the system, so some data are more important than others inherently sometimes, the other attribute is the data value itself which is declining as time goes by. 3. UTILIZATION RATE OF DATA It depends on the data class. Some class is declining when it is created; some class is increasing at ﬁrst and declining after it reaches the peak utilization rate of itself, and utilization rate of another class changes periodically. 4. RULES DEFINED BY LAW There are some rules deﬁned by law in data management. Such as call history of cell phone for billing would only be reserved for several months, people could forecast the trend of some data changing and so on. Moreover, these conditions are related to each other. So in order to get a more reasonable migration policy, it is necessary to consider these conditions comprehensively. At current, there are two well-used basic polices[24]: 1. FIXED THRESHOLD WITH ACCESS FREQUENCY It has both up-migration and down-migration, and setting min and max thresholds for data access frequency. Migration will be triggered when reaching speciﬁc threshold. 2. HIGH-LOW WATER LEVEL OF STORAGE Thresholds are setting according to the capacity usage ratio. When the actual capacity usage ratio exceeds the threshold, the system will migrate some data according to their access frequencies.

80

X. Zhao, Z. Li, and L. Zeng

In above policies there are couple of demerits: although it’s an up-down migration, it will become time-consuming badly when the system has huge amount of data objects. Moreover, if a system with average low workload is changing dramatically at capacity ratio (called “false overload”), it will trigger migration operations many times and maybe introduces oscillation migrations.

4

Feedback-Based Double Threshold Data Migration

Data migration policy got to address four questions that are known as when, where, what and how much, by which to determine the migration candidates, migration quantity, and migration target. Migration target is obvious in tiered storage system, either is higher performance storage or lower performance storage. So the migration candidates and the quantity are key elements. Based on the summary introduction in early paper [26], an extended discussion of this policy will be given out completely in following sections. Other 3 migration situations will be proposed, however we just discuss two-tier storage structure for better description and understanding. 4.1

Parameters Deﬁnition and Initialization

Suppose the deﬁned capacity usage ratios are same across diﬀerent storage tiers. And using letters h and l at subscript of parameters to represent high performance storage and low performance storage. If CT = Ch + Cl represents total capacity and Ch is the capacity of high storage, Cl is the capacity of low storage, then high storage’s usage ratio is HC = Ch /CT , and low storage’s usage ratio is LC = Cl /CT . Actual data size at each tier has 4 thresholds: base-high, base-low thresholds, limit-high and limit-low thresholds, they will be described in detail later. Here, β represents high threshold and α represents low threshold, superscript 0 means “base” thresholds, superscripts max and min represent “limit” thresholds. We use two queues, the up-migration queue and the down-migration queue which are represented by two vectors Qu and Qd respectively for each tier. The elements in queue vectors are metadata of each migration candidates. The information about data value will be collected periodically after the system startup, and the interval is same with data valuation but without any calculation of data value until the enqueue condition was triggered. 4.2

When to Migration

An ideal storage management system should keep online for applications during data migration, and the storage performance should not be impacted badly at any time. So, the migration timing depends on bandwidth, migration speed and migration time window. The D-value of high and low thresholds δ should

FDTM: Block Level Data Migration Policy in Tiered Storage System

81

be the function of E(t) and g(Bi ). E(t) is the function of time cost of data migration, and g(Bi ) represents the migration speed, Bi is the bandwidth for data migration. In other words, D-value δ is a function of expected time cost and system migration bandwidth: δ → f (g(Bi ), E(t)). The system could be regarded as stable, if high storage data size βh (t) at time t is not bigger than βh0 , in which βh0 = βT × HC , βT (CT > βT ) is the total data size of storage devices, and the limit-high threshold is βhmax = Kβh0

(3)

K = min(γ, 0.5 + 0.5Ch /βh0 ),

(4) HC

in equation 4, γ is the ratio of actual data capacity usage ratio and the original HC when the high storage provides same QoS with low storage. And when the ratio is much larger than 2, a ratio which can reach the right mid-point of βh0 and Ch should be used. As to the base-low threshold α0h = βh0 − δ of high performance storage, the = α0h × LC . The 4 thresholds for low limit-low threshold is deﬁned as αmin h performance storage could be derived with similar approach with considering two-tiered storage, δ and K are deﬁned as following:

4.3

δ = δ × Cl /Ch

(5)

K = min(γ , 0.5 + 0.5Cl /βl0 )

(6)

Migration

FDTM has up-migration and down-migration conditions for both high storage and low storage. The up-migration means data migrating from low storage to high storage, which could be passively migration triggered by the actual data size on low storage exceeds its base-high threshold βl0 , or be proactive migration triggered by the actual data size on high storage lower than its base-low threshold α0h . In reverse, the down-migration is data migrating from high to low performance storage, which could be triggered by the actual data size on high storage exceeds βh0 , or be triggered by the actual size on lower than α0l in low storage. Storage device speciﬁcation determines the timing of enqueue and migration, to answer the “when” question, and the characteristics of data objects determine data valuation. If the base thresholds were triggered, then each storage tier starts to enqueue migration candidates into Qu and Qd based on data blocks’ value, which answers the “what” and “how much” questions. Adjusting the entry queue conditions by each migration operation, can improve the queue length (or the migration quantity), enhance the migration policy eﬃciency, and keep the stability of the tiered storage.

82

X. Zhao, Z. Li, and L. Zeng

A1

β h0

A

t1 t2 t0

B

β hm a x

C

β h0

B

B

β hm a x

β h0

A

t m tn

t0

t

tm

Case1：timing from t0 to tm after from t1 to t2， migrate

t0

t

Case3: timing from t0 to tn , migrate

A

tm

t

tn

Case5: timing from t0 to tn , migrate

β

β

β hm ax

β h0

β

β

β

β hm a x

A

t0

tm

B

β hmax

C

β h0

t

β

B

A

t0

B

βhmax

C

tm

tn

Case2：timing from t0 to tm ， not Case4：timing from t0 to tn ， not migrate, stop monitoring migrate，go on monitoring

βh0

t

A

t0

C

tm

tn

t

Case6: timing from t0 to tn，not migrate if area less than S△ABC

Fig. 2. Condition cases for trigger migration. The area formed by three points A,B,C and whether the special case need to be migrated according to the conditions discussed in algorithm 2.

Situation 1: Down-migration for Exceeds βh0 . There are several input parameters used in the algorithm 2. L, checkpoint interval during monitoring; t, checkpoint; βh (t), actual data size d of high storage at time t; Adh , data amount in down queue of high storage; Vlast , d u data value of the last data object in Qh ; Vmin , data value of the min data t object in up queue of low storage; C1 , represents expression t0n (βh (t) − βh0 )dt ≥ 1 (t − t0 )(βhmax − βh0 ); Dmig , data amount has been migrated till current time. 2 n In fact, the determined conditions of FDTM are based on the ﬁtting of a linear function to the trend of data amount in the tiered storage. When the growth trend of the curve is less than the slope of the linear function, also known as the curve integral is less than the area of the triangle (SABC ) in the determined condition (the area composed by the slash and dash between two time points and y = βh0 , as shown in case 4 of the Fig.2.), we regard the system as stable and doesn’t need migration. Or else, the data amount growth could be considered as linear growth and the data migration should be implemented at once to reduce the data amount in high storage. It is the passive down-migration in high storage. Situation 2: Up-migration Exceeds βl0 . Situation 2 and situation 1 are similar, but the length of up queue is limited by δ instead of δ which is related to low storage, for the usage stability of high storage is the precondition of up-migration, data amount of migration is determined by high storage’s capacity.

FDTM: Block Level Data Migration Policy in Tiered Storage System

83

Algorithm 1. ImpMigration(Qdh ) 1: 2: 3: 4: 5:

while (Dmig < δ)and(βl (t) < βl0 ) do migration data objects in Qdh end while d record last data value migrated by Vmax ; d return Vmax ;

At time t0 and βl (t0 ) = βl0 , start timing and enqueue data objects into Qul according to their value descending, the length of Qdh is determined by the amount of data objects in it, it should be less than δ or the value of last data object in d . Continue to monitor the system at interval L, and do steps Qdh less than Vmax similar to situation 1, the diﬀerences are as following: All the conditions which related to the high storage in situation 1 (such as βh0 , βhmax , βh (t)) will be replaced by the corresponding ones of the low storage (βl0 , βlmax , βl (t)) here, and vice versa; The inequation used to judge whether the tiered system needs to migrate is changed at the right side: give a value of LC as an coeﬃcient when calculate the area of triangle, because with the same data changing rate, the performance of low storage is lower; u , Furthermore, the min data value in up-migration should be recorded as Vmin which is used for a new migration cycle after migration. Situation 3: Up-migration Below α0h . Moreover, to make a complete migration policy to achieve higher storage eﬃciency it should implement migration when the actual data size of high storage or low storage is less than base-low threshold, in situation 3 and situation 4, we’ll discuss it in detail. At time t0 , when high storage capacity usage size declines to less than α0h , start timing and enqueue block-level data objects into the queue Qul in low storage, data amount in Qul should not exceed δ or include all the active data objects in low storage, and do as following steps: , goto 1. If it raises above α0h again before reaching the limit-low threshold αmin h step 3, else continue timing until the actual data size drops to αmin , then h record the current timing as tm , and goto step 2; 2. Implement up-migration until all the data objects in Qul are migrated, and goto step 3; 3. Stop timing and enqueuing monitoring. Waiting for next condition trigger αh (t) = α0h , then goto step 1 for next cycle. Situation 4: Down-migration Below α0l . At time t0 , the data amount in low storage declines to the base-low threshold α0l and at the same time the data in high storage exceeds base-high threshold βh0 , then start timing and enqueuing block-level data objects into the queue Qdh in

84

X. Zhao, Z. Li, and L. Zeng

Algorithm 2. Down Migration for exceeds βh0 1: while 1 do 2: keep monitoring the tiered storage system; 3: if βh (t) = βh0 then 4: t0 ⇐ t; start to enqueue Qdh by block-level data value, length limited by Adh < δ d u or Vlast < Vmin 5: end if 6: while βh0 ≤ βh (t) do 7: if βh (t) = βhmax then 8: tm ⇐ t; t 9: if t0m (βh (t) − βh0 )dt ≥ 12 (tm − t0 )(βhmax − βh0 ) then 10: ImpMigration(Qdh ); 11: break; 12: else 13: t = t + L; 14: if βh (t) > βhmax then 15: ImpMigration(Qdh ); 16: break; 17: else 18: for t; t − tm ≤ tm − t0 ; t + L do 19: tn ⇐ t 20: if (βh (t) ≥ βhmax )or((tn − tm = tm − t0 orβh (t) ≤ βh0 )and t ( t0n (βh (t) − βh0 )dt ≥ 12 (tn − t0 )(βhmax − βh0 ))) then 21: ImpMigration(Qdh ); 22: break; 23: end if 24: end for 25: end if 26: end if 27: end if 28: end while 29: stop timing and enqueuing; 30: end while

high storage, data amount in Qdh should not exceed (βh (t) − βh0 ) · LC or δ (take the smaller one from them), and do as following steps: , goto 1. If it raises above α0l again before reaches the limit-low threshold αmin l step 3, else continue timing until the data amount drops to αmin , then record l the current timing as tm , and goto step 2; 2. Implement down-migration until all the data objects in queue Qdh are ﬁnished migrating or the actual data size of low storage reaches βT · LC , and goto step 3; 3. Stop timing and enqueuing monitoring. Waiting for next condition triggers αl (t) = α0l , then goto step 1 for next cycle.

FDTM: Block Level Data Migration Policy in Tiered Storage System

85

It’s obviously that situation 1 and 2 would happen in higher probability than situation 3 and 4, if there are a reasonable data placement method in the tiered storage system which is out of discussion in this paper. 4.4

Discussion on Better Migration Eﬀect

FDTM introduces feedback mechanism to improve its thresholds parameters for the parameters used at beginning maybe not very appropriate. So the parameters of migration policy could be more reasonable and the application could get more performance beneﬁt with time going by. d which indicates the length of down-migration For example, the value of Vmax d queue Qh of high storage, could be adjusted by its history values. Here, a simple method is introduced: Fi , (i = 1 . . . n, n is the migration times) represents the last block-level data object’s value in the ith migration queue Qdh , after the ﬁrst d d 2 migration Vmax = F1 , after the second migration, V should be F1 +F . . . and 2 max n

F

i d after the nth migration, the adjusting is Vmax = i=1 , which is the average n value of Fi in n times of migrations. Certainly, with considering the diﬀerent importance of its history value, we can get some diﬀerent adjusting method, such as equation 7

d (n + 1) = Vmax

n i=1

d θi Vmax (i),

n ( θi = 1)

(7)

i=1

d . here, θ represents the importance of history values of Vmax u At the same time, if Vmin is up-migration enqueue condition, after the nth time of migration, it should be adjusted by all values of the last data objects in the queues at each migration. In order to make FDTM more robust and eﬃcient, the other parameters or thresholds could be adjusted according to real migration u monitoring and feedback, such as Vmin , α, β, δ and etc.

4.5

Experiment

In this section, we validate and analyze the FDTM policy with DiskSim simulator [3] for we dont have a real tiered storage environment. In addition we choose Seagate Cheetah 15.5K FC disk and SSD disk to compose a two tiered storage system within DiskSim. The Seagate Cheetah 15.5K FC disk speciﬁcation is extracted by DIXtrac disk characterization tool, and in other side the SSD DiskSim extension by Microsoft research is an idealized SSD that is parameterized by the properties of NAND ﬂash chips instead of a simulator for any speciﬁc SSD. We show the device capabilities diﬀerence by dollar cost in Fig. 3 according to the latest research on the tradeoﬀs between FC disk and SSD disk [13] for better understanding of the storage devices in the experiment environment. It’s obviously that SSD disk could provide better IO performance, which could be around 100 times [13], than FC disks under same workload but cost more dollars. Obviously in enterprise data center it’s very worthy to adopt tiered storage

86

X. Zhao, Z. Li, and L. Zeng Table 1. Storage Device Characters

Tier 1 storage

Tier 2 storage

SSD Disk SSD extension for DiskSim by Microsoft research Size 32 GB Block Size 256 KB Block Number 131072 The SSD is composed by 8 elements, each el ement has 8 planes, and each plane has 2048 blocks FC Disk Seagate Cheetach 15.5K FC Disk Size 146.8 GB Block Size 512 B Block Number 28679487

,236

*%

0HPRULJKW05 66' 6HDJDWH&KHHWDK

0%V Fig. 3. Device capabilities normalized by dollar cost

infrastructure by FC disks and SSD disks, to provide improved IO performance with reasonable storage cost. Tab. 1 is the basic conﬁguration for this simulated environment. The workload IO trace for experiment is randomly generated by using DiskSim at start, moreover, in order to play the same workload trace on four kind of storage systems (the SSD disk, the FC disk and the tiered storage system) respectively, so the block access range in this trace is limited by the size of the SSD disk, which is the smallest in capacity. Besides, we have two assumptions at here: Assumption 1: We assume that the initial data placement has little performance impact on experiment results, so we randomly place the data block objects across the tiered storage system to avoid the initial data placement issue. Assumption 2: We assume that the overhead of data block objects migration between SSD and FC disks could be ignored for the experiment, for it’s hard to simulate the overhead of data migration in this simulated environment and we’ll pay more attention on policy itself. So we only change the IO request target devices between SSD and FC disks in the workload trace to simulate the environment with data migration, in fact the data migration doesn’t happen in the experiment.

Average Response Time (ms)

FDTM: Block Level Data Migration Policy in Tiered Storage System

9 8 7 6 5 4 3 2 1 0

87

SSD FC SSD+FC TS-with-FDTM

Read-ART

Write-ART

IO-ART

Fig. 4. Average response time of 4 kinds of storage system

70000

SSD FC SSD+FC TS-with-FDTM

I/O Request

60000 50000 40000 30000 20000 10000 0 <

5

< 10

< 20

< 40

< 60

< 90

<120

<150

<200

200+

I/O Response Time (ms)

Fig. 5. Overall IO system response time distribution of the experiment

In experiment the workload trace is played 4 rounds on the 4 kinds of simulated storage devices, which are SSD disk, FC disk, tiered storage composed by SSD and FC disks, and tiered storage composed by SSD and FC disks with the FDTM policy control. The experiment results are collected from the DiskSim outputs. Fig. 4 shows the average response time results of the experiment. We also give out overall IO system response time distribution in Fig. 5 It could be concluded from the experiment results that the tiered storage system with FDTM policy could provide better IO performance and it’s easy to be implemented in current storage system. However, we just valuated FDTM with a simulated environment and synthetic workload trace, in which we take little consideration on the overhead of FDTM, so this experiment is just used to validate FDTM policy and we aim to valuate it in real environment with ongoing work.

88

5

X. Zhao, Z. Li, and L. Zeng

Related Work

As one of the most important concepts, DMF(Data Management Forum) in SNIA promotes ILM as a special topic in the ﬁeld of data management system research and application. In [14,20], an deep analysis and discussion was done to the relationship between ILM and tiered storage system. There are several systems providing tiered storage system management functions transparently, such as SGI InﬁniteStorage Data Migration Facility [19], VERITAS’ NetBackup Storage Migrator [1] and IBM’s HPSS[23]. Additionally, early research mainly focuses on the management and employment of tertiary storage, an oﬄine storage tier in the tiered storage system, such as the applications in VoD(Video on Demand) system and DBMS[12,15], researches [6,16] mainly focus on the problem of performance in tertiary storage. With the development of tiered storage applications, studies on data classiﬁcation and data placement are more popular, in [21] the necessity and importance in data classiﬁcation and its basis are discussed, and [4,18] give out some speciﬁc methods for classiﬁcation. References [8,9] focused on data placement under diﬀerent storage infrastructure. There are also a lot of researches on data migration, such as to implement online migration by leveraging control theory [10] to reduce the overhead mostly, to describe the migration algorithm with edge-coloring theory [2,7]. All of above researches aim to achieve workload balance under a homogeneous storage environment, and suppose that data objects are independent with each other. However, it is almost impossible that happens in real environment. [22] researches on block-level data migration, but it just takes data access frequency into consideration, which aﬀects its accuracy of candidate migration objects.

6

Conclusion and Future Work

This paper proposes the novel bi-directional block-level data migration policy FDTM for tiered storage system, the migration with double thresholds based on feedback, which could help improve the IO performance of tiered storage system. In addition, we also give out the detail process of block-level data migration, and how to select the migrated candidates, in which the data valuation mechanism could narrow the candidates’ scope and avoid the oscillation migrations by leveraging the relevance between diﬀerent data blocks. The result of experiment shows that the tiered storage system with FDTM policy could provide better IO performance, and it’s easy to be implemented in current storage systems. As ongoing and future work, we’ll focus on the study of data placement, the interaction between data placement and data migration in tiered storage system. Acknowledgments. This work was supported by a grant from the National High Technology Research and Development Program of China (863 Program) (No. 2009AA01A404).

FDTM: Block Level Data Migration Policy in Tiered Storage System

89

References 1. What is storage virtualization. Veritus white paper, http://file.doit.com.cn/upfiles/2006/1027/0_230404_f1.pdf 2. Anderson, E., Hall, J., Hartline, J.D., Hobbs, M., Karlin, A.R., et al.: An experimental study of data migration algorithms. In: Proceedings of the 5th International Workshop on Algorithm Engineering, pp. 145–158. Springer, Heidelberg (2001) 3. Bucy, J.S., Schindler, J., Schlosser, S.W., Ganger, G.R.: The disksim simulation environment version 4.0 reference manual. Technical report cmu-pdl-08-101, carnegie mellon university (2008) 4. Golubchik, L., Khanna, S., Khuller, S., Thurimella, R., Zhu, A.: Approximation algorithms for data placement on parallel disks (2000) 5. Jin, H., Xiong, M., Wu, S.: Information value evaluation model for ilm. In: Proceedings of the 2008 Ninth ACIS International Conference on Software Engineering, Artiﬁcial Intelligence, Networking, and parallel/Distributed Computing, pp. 543–548. IEEE Computer Society Press, Los Alamitos (2008) 6. Johnson, T., Miller, E.L.: Performance measurements of tertiary storage devices. In: Proceedings of the 24rd International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco (1998) 7. Khuller, S., Kim, Y.A., Wan, Y.C.: Algorithms for data migration with cloning. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 27–36. ACM, San Diego (2003) 8. Li, C., Zhou, L.Z., Xing, C.X.: A cost model for data placement and access path selection problem in fc-san. Journal of Software (05) (2004) 9. Li, J.T., Prabhakar, S.: Data placement for tertiary storage. In: Proceeding of the 10th NASA Goddard Conference on Mass Storage Systems and Technologies/19th IEEE Symposium on Mass Storage Systems (MSS 2002), Adelphi, Maryland, USA, pp. 193–207 (2002) 10. Lu, C., Alvarez, G.A., Wilkes, J.: Aqueduct: Online data migration with performance guarantees. In: Proceeding of the USENIX Conference on File and Storage Technologies (FAST), Monterey, pp. 219–230 (2002) 11. Massiglia, P.: Exploiting multi-tier ﬁle storage eﬀectively. Snia tutorial, SNIA (2009), http://www.snia.org/education/tutorials/2009/spring/file/ paulmassiglia exploiting multi-tier file storagev05.pdf 12. Myllymaki, J., Livny, M.: Disk-tape joins: Synchronizing disk and tape access. In: Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, Ottawa, Ontario, Canada, pp. 279–290. ACM Press, New York (1995) 13. Narayanan, D., Thereska, E., Donnelly, A., Elnikety, S., Rowstron, A.: Migrating server storage to ssds: analysis of tradeoﬀs. In: EuroSys 2009: Proceedings of the 4th ACM European Conference on Computer Systems, pp. 145–158. ACM Press, New York (2009) 14. Peterson, M.: Ilm and tiered storage. Snia tutorial (January 2006), http://www.snia.org/forums/dmf/knowledge/ dmf-ilm and tiered storage 20060221.pdf 15. Prabhakar, S., Agrawal, D., Abbadi, A.E., Singh, A.: A brief survey of tertiary storage systems and research. In: Proceedings of the 1997 ACM Symposium on Applied Computing, pp. 155–157. ACM, San Jose (1997) 16. Reiner, B., Hahn, K.: Optimized management of large-scale data sets stored onttertiary storage systems. IEEE, Distributed Systems Online 5(5) (2004)

90

X. Zhao, Z. Li, and L. Zeng

17. Seo, B., Zimmermann, R.: Eﬃcient disk replacement and data migration algorithms for large disk subsystems. ACM Transactions on Storage 1(3), 316–345 (2005) 18. Shah, G., Voruganti, K., Shivam, P., del Mar Alvarez Rohena, M.: Ace: Classiﬁcation for information lifecycle management. NASA Mass Storage Systems and Technologies (2006) 19. Shepard, L.: Sgi inﬁnitestorage data migration facility(dmf) a new frontier in date lifecycle management. White paper, sgi, http://www.sgi.com/pdfs/3631.pdf 20. SNIA: Ilm deﬁnition and scope an ilm framework (July 2004), http://www.snia.org/forums/dmf/programs/ilmi/dmf-ilm-vision2.4.pdf 21. SUN: Best practices in data classiﬁcation for information lifecycle management. Sun white paper, http://www.sun.com/storage/white-papers/ best practices data classification ilm.pdf 22. Wang, D., Shu, J.W., Xue, W., Shen, M.M.: Self-adaptive hierachical storage management in san based on block-level. Chinese High Technology Letters (02) (2007) 23. Watson, R.W.: High performance storage system scalability: Architecture, implementation and experience. In: Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies, pp. 145–159. IEEE Computer Society, Los Alamitos (2005) 24. Xu, N.: A frequency-based self-adaptive data hierarchy policy. SCIENCE and TECHNOLOGY ASSOCIATION FORUM (03) (2009) 25. Zhao, X.N., Li, Z.H., Zeng, L.J.: A hierarchical storage strategy based on blocklevel data valuation. In: The Proceeding of the Fourth International Conference on Networked Computing and Advanced Information Management 2008 NCM 2008, Korea, pp. 36–41. IEEE, Korea (2008) 26. Zhao, X., Li, Z., Zhang, X., Zeng, L.: Block level data migration in tiered storage. In: Proceeding of International Conference on Computer and Network Technology (ICCNT 2010). IEEE Computer Society Press, Bangkok (2010)

Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System Xiao Xu, Weizhe Zhang, Hongli Zhang, and Binxing Fang School of Computer Science and Technology, Harbin Institute of Technology, Harbin China [email protected], {zwz,zhl}@pact518.hit.edu.cn, [email protected]

Abstract. Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site’s content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system’s real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results. Keywords: Distributed Web crawling, search engine, incremental crawl, DHT.

1 Introduction Web search services are becoming more and more important in everyone’s daily life. Their availability and effectiveness largely depend on the efficiency of the underlying crawling systems. Nowadays, due to the huge evolution of the Web in the last 10 years, building a practical and effective search engine system has become an extremely complicated task involving both intelligence and funds which keeps smaller companies out of the door. Inspired by the concept of internet computing[1] and SETI@home[2] , large scale distributed Web crawling systems[3-5] (DWC systems for short) has been applied in practice. The basic concept of these systems is using the personal computing resources voluntarily contributed by the internet users to retrieve Web content from the internet itself avoiding the maintenance costs of huge computer clusters. The system developers either feedback part of the system’s commercial profit or provide distinctive services in order to attract the computer-contributor’s participation. In the work of [6] , we have proposed a DHT-based Web crawling system which solves the node churn and load balancing problems. Like some of the existing C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 91–105, 2010. © IFIP International Federation for Information Processing 2010

92

X. Xu et al.

researches on the Web crawlers published in the past[7-11] , we assumed the system is handling a periodic crawl[12] . Under this assumption, the crawling system visits the whole web until its collection reaches a desirable number of pages. However, this assumption is not quite practical for the commercial search engines which, instead, adopt the idea of incremental crawl[12] . The current main information publishers: news sites, BBS sites and blog sites are generating new Web content in a very high speed. Even the content on a fixed URL changes constantly due to the changes of the DOM structure and the increase of comments and replies[13] . Therefore, in order to keep its indexed data up-to-date, the search engine has to apply different recrawl intervals to different Web site, instead of periodically crawl all the Web sites. Our goal in this paper is to design a fully distributed Web crawling system with the ability to incrementally crawl the Web so that the stored content of all the discovered Web sites can be continuously updated. The existing incremental crawl strategies[12, 14-18] are adopting recrawl intervals according to the change frequency, relevance and information longevity of the Web pages. The intervals may range from minutes to days so that the most important sites (such as news sites which has high update rate) can be scanned frequently meanwhile the rests are not in order to save the networking, storage and computing costs. When calculating the recrawl intervals, the existing works are all under the assumption that the system has a fixed scale (most of the works are using single-machine system). Things are different in the context of internet computing. As the real-time scale of the system is difficult to derive, the recrawl intervals calculated by the existing strategies are very likely to mismatch the system’s current capacity which leads to two kinds of consequences: Requirement exceeds system’s capacity (REC). The recrawl tasks are submitted so frequently that the crawlers within the system cannot handle all of them. Many tasks are queued, timed out even removed due to the crawlers’ abnormal departure. The actual update rate is below the expected recrawl rate. Requirement underestimates system’s capacity (RUC). The system has enough capacity to execute all the recrawl tasks. But the capacity is not fully used, which prevents the system from achieving a higher update rate. In this paper, we firstly make a brief summary of our former works. Unlike the traditional dedicated crawlers, our crawler is designed to be run on contributed personal computers. We have to restrict the crawler from consuming too much bandwidth and local resources. A more detailed crawler implementation is presented including basic data structures involved in the recrawl. Secondly, based on the system design, we propose two scale-adaptable recrawl strategies. The concept of the two strategies is periodically using the system’s successful update frequency (indicating the system scale) to derive a new recrawl rate. The strategies can be used as the additional steps to the existing recrawl strategies which solely focus on the characteristics of the Web pages. The two methods are evaluated through a series of experiments simulating the two situations: REC and RUC. The results show our strategies’ effectiveness. The rest of the paper is organized as follows: Section 2 presents our system design. Section 3 proposes the self-adaptive update strategies. Section 4 provides the evaluations. Section 5 we conclude the paper.

Scale-Adaptable Recrawl Strategies for DHT-Based DWC System

93

2 System Design First, we briefly outline our existing works presented in [6] . The paper mainly describes the system architecture of a new DWC system. The system is proposed to solve two issues:1) Scalability which is crucial for a distributed system; 2) Download time which various due to the network localities.

Fig. 1. The architecture of the proposed DWC system. From left to right, the first figure shows the main system modules; the second figure shows the 3-tier mapping.

As shown in Fig.1, the system consists of a Manager node and a network-locationaware overlay containing all the crawlers. The task of the manager is to maintain the Web site (or channel which will be introduced later in this paper) database and submit recrawl tasks to the overlay. The network-location-aware overlay is achieved by adopting a 3-tier mapping. As is demonstrated in Fig.1, from bottom to top, the 3 tiers are physical tier, coordinate tier and DHT tier. On the coordinate tier, a network coordinate service (NC) maps the physical locations (measured using the network latencies) of the crawlers and the Web hosts to the coordinates in a multi-dimensional network coordinate space. Using the coordinates as IDs and keys, on the DHT tier, all the crawlers join a DHT overlay (currently implemented on CAN), while the Web hosts are inserted to the overlay. By combining the network coordinate space and DHT, we finally achieve a network-location-aware task scheduling method. Because both the DHT tier and coordinate tier are self-organized, the manager doesn’t need to choose to which crawler a task should be assigned, as well as monitor the state of the crawling tasks and the behavior of the crawlers. The Web pages downloaded are not stored on the crawler side. They are sent to a distributed storage system (the storage module) after a series of pre-processing. Because the detailed design of the manager and the crawler which is crucial for this paper is not presented in [6] , we describe them in the following literature. 2.1 Task Design A task is the description of a or a part of a Web host which we call a channel. A channel may represent 3 kinds of Web entries: 1) the whole Web host (such as “sports.sina.com.cn”); 2) a Web host’s sub-directory (such as “news.sina.com.cn/w/”);

94

X. Xu et al.

3) a set of sub-directories on a Web host. The manager maintains a database of channels. When a channel needs to be recrawled, the manager patch the channel into a recrawl task and submit the task the overlay which finally appoints a crawler to run the task. On receiving a recrawl task, the crawler simply run the task according to the task’s fields. A task contains the following fields: The list of seed URLs. The seed URLs are used as the starting point of the crawl. Mostly these URLs points to the index pages. The list also contains the URLs (pointing to this channel) that were discovered during the crawl of other channels (we call these URLs the inter-site URLs). The list of regular expressions to be used as the URL filter. The crawler only downloads the URLs which matches the regular expressions. In case there are many kinds of expressions, we use a list to include all of them. The number of Web pages derived according to historical record. The number is used to decide the size of URLseen data structure on the crawler. As all the downloaded data are transferred to it, the storage module knows the completion of each crawl task and feeds back the latest statistics of the channels to the manager. The feedbacks are also used later in Section 3. The maximum depth to crawl. Defines how the crawl process stops. The array of page content digest. The array contains the 32bit SHA-1 digests of all known pages which is also fed back by the storage module. The array is used to filter the pages that haven’t changed since the last crawl so that the crawler doesn’t have to extract their contents and send them back to the storage module. According to our experience, the array occupies at most 400kB. The digests of discovered inter-site URLs. On discovering inter-site URLs the crawler first checks if they are in this discovered set. If true, the URLs are ignored, else the URLs are transferred back to the storage module with the downloaded contents and finally fed back to the manager. 2.2 Crawler Design Different from the dedicated crawlers, our crawler is designed to be run on the machines contributed by the ordinary internet users who may want to do other things besides crawling. As a result, small data structures, restricted downloading scheme have to be applied. The whole program is run on a low system priority. As is shown in Fig.2, a crawler consists of 3 kinds of modules: task queue, task manager and task threads. Task queue. Crawl tasks submitted to the crawler are first put into the task queue by the task manager. Task thread. Each task thread is a complete crawler thread responsible to download a channel. As the target is only one channel, the size of each data structure involved is quite small compared with the dedicated crawlers. A task thread contains 4 main submodules: URLtodo, URLseen, Pageseen, Robotsfilter and downloader.

Scale-Adaptable Recrawl Strategies for DHT-Based DWC System

95

1) Downloader. The Downloader is responsible for downloading the Web pages. In order to limit the bandwidth usage of each task thread, the downloader opens one TCP persistent connection to the target Web host and doesn’t use pipelining. At the first stage of downloading, the downloader first downloads the Web site’s robots.txt and builds a robots filter. After downloading each Web page, the downloader extracts the URLs in it. All the extracted URLs are matched with the Robotsfilter and the URLseen. If the URLs can pass the two filters, they are inserted into the URLtodo. After downloading, the Web pages are filtered by Pageseen. Then the downloader write the textual content to a temporary disk file. 2) URLtodo. The URLtodo is a FIFO queue used to store the URLs to be downloaded. In order to avoid the queue becoming too long, we restrict the length of the queue to 1000. The URLs exceeds the limit is temporarily written into disk files. 3) URLseen. The URLseen is an array of integers storing the SHA-1 hash of the already discovered URLs in the current crawl process. 4) Pageseen. The Pageseen is the array of page content digest mentioned in 2.1. We use sequential search to find if a Web page is in the Pageseen. 5) Robotsfilter. The Robotsfilter is a list of forbidden paths on the Web host. When the downloading process reaches the maximum depth to crawl, the downloader stops, and transforms to an uploader. The uploader uploads the extracted Web contents (stored in the disk files by the downloader) and the statistics of the channel to the storage module. After that, the task thread terminates itself and release the space of all its data structures.

Fig. 2. The data structures on a single crawler (machine). A crawler consists of 3 main part: the task queue, the task manager and the task threads. Each task thread implements a crawler responsible to download a certain channel.

Task manager. Task manager is responsible for allocating the task threads. It ensures that at any time there are only a restricted number of task threads running in order to save the cost of system resources. When a task thread terminates, task manager fetches a new task from the task queue and allocate a new task thread to run it.

96

X. Xu et al.

3 The Scale-Adaptable Recrawl Strategies (SARSS) In our system, channels are classified according to their change frequency into several ranks we call the priority groups (PGPs). Channels with higher change frequency are assigned to PGP with higher rank. The average change frequency of the channels in each PGP is used as the PGP’s initial recrawl time interval. Therefore, channels with higher change frequency should have shorter recrawl intervals than the others. Here, we don’t focus on how the change frequency is derived as there’s previous works[12, 1418] concerning the issue. Instead, we focus on how the system scale affects the PGP’s initial recrawl time interval. As is discussed in Section 1, the recrawl intervals solely derived from the characteristics of the Web sites may mismatch the system’s capacity (the proof is presented as the simulation result in Section 4). As a result, we have to develop an algorithm to automatically adjust the recrawl intervals to the system’s scale. The initial time intervals are only used as the start point of the adjusting process. However, one cannot directly calculate the PGP’s recrawl intervals according to the scale of the system as there are no reliable real-time statistics on the system scale. Thus the adjustments have to be done according to the records of the manager. In this section, we propose two scale-adaptable recrawl strategies to automatically adjust all the channels’ recrawl intervals in order to maintain high update rate. In the following literature, an update of a channel indicates a successful recrawl in which the content of the channel stored in the storage module can be updated. Due to the node churn within the system, not all recrawls are successful. Therefore, the update rate does NOT equal to the recrawl rate. Instead it indicates the rate of successful recrawls. 3.1 Iterated Interval Reassignment (IIR) The basic idea of the IIR strategy is to periodically reassign each PGP a new update interval derived from the PGP’s latest update record. The reassignment process is called when 50% channels with the highest rank has completed more than 5 successful recrawls (updates) since the last reassignment. The process is run in an isolated thread so that it doesn’t stop the system’s crawl sequence. First, we add 3 records to each channel. 1) historyInterval (hInterval): the average time gap between updates since the last reassignment; 2) historyRecrawlCount (hSubCount): records the number of recrawl tasks (on that channel) submitted to the crawlers since the last reassignment; 3) historySuccessCount (hSucCount): records the number of updates since the last reassignment. Then, each PGP is reassigned a new recrawl interval which is the mean of all channel’s historyIntervals. To this point, we only get intervals from the historical records. In order to achieve a higher update rate, we add a encourage mechanism to the above process. We define the successRate of a PGP as quotient of (hSucCount / hSubCount) and add a static variable lastSuccessRate to record the successRate calculated during the last reassignment. If the successRate is larger than lastSuccessRate, the system would make a positive attempt by decreasing the update interval by a pre-configured parameter δ (<1). The final step is to check the value of each PGP’s new recrawl interval to ensure that the PGPs with lower ranks never have smaller recrawl intervals than the PGPs with higher ranks, at the same time, to prevent the recrawl intervals from growing too large. The above process is written as psuedocode below.

Scale-Adaptable Recrawl Strategies for DHT-Based DWC System

97

The Psuedocode of IIR. The code implements the main function InheritedIntervalReassignment of the IIR. We assume K PGPs, and In PGP[K][], the first dimension indicates the priority number 0..K-1, the second dimension stores the list of channels under a certain priority number const EncourageRate = ¥; static LastSuccessRate[K] = {0, 0, …, 0}; program InheritedIntervalReassignment ( PGP[K][] ) var P:-1..K-1;SuccessRate[K];ActualExecInterval[];NewIntervals[K]:Real; RecrawlCount, SuccessCount, C: Integer; begin P := -1; repeat P:= P + 1; C := -1; repeat C := C + 1; ActualExecInterval[C] := PGP[P][C]. hInterval(); RecrawlCount := RecrawlCount + PGP[P][C].hSubCount; SuccessCount := SuccessCount + PGP[P][C].hSucCount; until C = number of channels in PGP[P] NewIntervals[P] := mean(ActualExecInterval[0..C-1]); SuccessRate[P] := SuccessCount / RecrawlCount; if SuccessRate[P] > LastSuccessRate[P]; then { encourage }; NewIntervals[P] :-= NewIntervals[P]*EncourageRate; end if LastSuccessRate[P] = SuccessRate[P]; until P = K-1; call checkPriorities(NewIntervals[]); Set NewIntervals[] as manager’s recrawl interval; end.

The Psuedocode of checkPriorities. The code is the implementation of function checkPriorities which is called by InheritedIntervalReassignment. const SmallestMultiple = ³; const BiggestMultiple = ; program CheckPriorities ( NewIntervals[K] ) var upperLimit: Real; P: Integer; begin P := 0; repeat P := P + 1; if NewIntervals[P] < NewIntervals[P-1]; then NewIntervals[P]:=NewIntervals[P-1]*SmallestMultiple; end if upperLimit := NewIntervals[P-1] * BiggestMultiple; if NewIntervals[P] > upperLimit; then NewIntervals[P] = upperLimit; end if until P = K-1; end.

3.2 Iterated Capacity Reassignment (ICR) The basic idea of the ICR strategy is to periodically evaluate the system’s capacity and distribute the capacity to each PGP. In this method, the recrawl interval of each PGP shares a common divisor ΔT0PGP which is the recrawl interval of the PGP with the highest rank (rank 0) and ωi which is the multiple between the initial recrawl interval

98

X. Xu et al.

of the ith and the (i+1)th PGP. Then, the recrawl interval of the PGP with rank k k −1

equals to ΔTk

PGP

=

∏ω × ΔT

PGP 0

j

. ΔT0PGP is set to Δt at the system’s initial stage. We de-

j =0

fine ΔT0PGP = θ × Δt . So in each ICR process our goal is to derive a new θ . In the following literature, we present the derivation of how to calculate θ during each assignment process. All the records used in 3.1 are inherited. We assume that there are N channels in the manager’s database; the channels are classified into S ranks (PGPs); the jth channel has an update interval ΔT j ; the channel’s update rate is λ j = 1 ΔT j ; the channel’s average download time is tl . Here we add a record to each channel to reserve its download times during the updates. According to [6] , our scheduling algorithm guarantees the channel’s recrawl task is always assigned to a crawler with low latency. Therefore, unless some exceptional event such as congestion occurs, there won’t be significant difference between the download times. We also assume that during ΔT j , there are on average M crawlers in the system; the number of task threads on the ith crawler equals to CLi . If we want to update each channel j on schedule, the best choice is to complete all the recrawl tasks submitted during ΔT j . Under the assumption that the system’s DHT-based scheduling can sufficiently balance the load of all crawlers, we have the following formula.

∑(λ × t ) × ΔT × N

l

l

j

l =1

1

∑

≤ ΔT j

M

(1)

CLk

k =1

We replace “ ≤ ” with “=”, then (1) can be transformed to:

∑(λ × t ) = ∑CL N

M

l

l

l =1

(2)

k

k =1

The right side of (2) is not an accurate value as the set of crawlers subjects to constant change. But, it can be treated as an approximation of the latest system capacity. To implement an iteration process, we replace the right side of (2) with a history-related value derived from the record of each channel’s previous L updates. 1 × θ × Δt

S −1

∑( k =0

∏ω

−1

∑t ) = history(∑ (λ × t )), ∏ω N

1 k −1

l

l

l

l =1

PGP ( k )

j

=1

j

=1

j =0

(3)

j

j =0

S −1

θ=

∑( k =0

1 k −1

∏

∑t ) l

ωj

PGP ( k )

−1

∑ (λ × t ))) , ∏ω N

(Δt × history(

l

l =1

l

j =0

(4)

j =0

Then θ can be calculated according to (4). In the implementation, we also add the encourage mechanism and checkPriority function at the end of the process.

Scale-Adaptable Recrawl Strategies for DHT-Based DWC System

99

4 Experiments

20%

100%

15%

75% Crawlers %

Percentage of Host Pairs

All the experiments are done under complicated simulations. First, we adopt P2Psim’s King dataset[20] to simulate the network latencies between the crawlers and the Web hosts. Because the size of the dataset (only 1740 nodes) is relatively small, we map the nodes into a coordinate space using Vivaldi[21] and replicate a set of new nodes around each original node. The distance between the original nodes and its replica is a random value scaling from 10% to half of the distance between the original node and its nearest neighbor. Accordingly, we generate two larger datasets: 1) 16000 Web hosts and 1400 crawlers. The original 1740 nodes are divided into 1600 and 140 and we replicate 10 new nodes around each node. 2) 16000 Web hosts and 2800 crawlers. Replicating 1 more node around each of the 1400 crawler nodes.

10%

5%

50%

25%

0%

0%

0

50

100

150

200

250

300

350

400

512Kb/s

1Mb/s

2Mb/s

Bandwidth

20%

50.0%

15%

37.5% Channels %

Percentage of Web hosts

Network Latency (ms)

10%

5%

25.0%

12.5%

0%

0.0%

0k

10k

20k Number of Pages

30k

40k

0

1

2

3

4

Priority Number

Fig. 3. Statistics on the datasets. left to right, top to bottom, the figures show the distribution of: network latencies, bandwidth, page numbers and number of channels on each PGP.

Second, the bandwidth assigned to each crawler has 3 levels: 512Kb/s, 1Mb/s and 2Mb/s. The number of crawlers under each level is decided according to standard normal distribution 1.5*N(1, 0.5). Each Web host is treated as a channel. The number of Web pages on each channel is assigned according to a Web dataset we collected in 2008. The dataset is the result of 1738 crawling tests containing 15,187,511 URLs. Because the dataset only contains 1738 Web hosts, we also have to increase its size using replication. We divide the set of Web hosts (channels) into 5 PGPs. The number of channels under each rank is decided by 1.25*N(2, 1). To each channel in the ith PGP, we assume a change frequency obeying Poisson distribution with λ = 1 θ i , i = 0,1,2,3,4 . θ i indicates the average time interval when most of the Web pages have changed on the channel. For the 5 PGPs, we assume θ 0 = 60 min , θ1 = 120 min , θ2 = 240 min , θ3 = 480 min , θ 4 = 960 min . All the above setups are illustrated in Fig.3. The simulation is based on time steps. Each step represents 1 minute. The whole simulation involves 90000 steps (1500 hours, 62.5 days). The crawlers are all added to the system at the first step. The living time of each crawler obeys the normal distribution

100

X. Xu et al.

N(1000, 200) which means 95% crawlers’ living time is between 608 steps and 1392 steps. After the living time, 1/3 crawlers leave the system and notify the leaving event to their neighbors; the tasks both running and waiting are migrated to the neighbors; the rest 2/3 crawlers leave the system without notifying. The tasks on these crawlers are completely lost. Each dead crawler will rejoin the overlay within a time gap obeying the normal distribution N(600, 200). The maximum number of task threads on each crawler is set to 5. The maximum length of task queue is set to 5. When the task queue is full the newly submitted task is forwarded to a less loaded crawler according to [6] . If a task’s waiting time exceeds the time limit which equals to the channel’s recrawl interval, it is canceled by the crawler. The download time of each Web page is calculated by adding the RTTs, data transfer time and a wait time (fixed to 200ms) involved. The RTT equals to 2 times the network latency in order to simulate the time cost of a GET request under HTTP persistent connection without pipelining. The data transfer time is calculated through dividing the size of the Web page by the speed. The speed equals to the bandwidth of the crawler divided by the number of task threads. Because the 1-minute-per-step setup shields us from directly simulating each downloading process. Therefore, instead, the total download time of all pages on a Web host is pre-calculated once it is submitted to a certain crawler. 4.1 Evaluation Criterias In the simulation, 3 recrawl strategies are compared. They are 1) FIXED: the recrawl interval is fixed to the change frequency of the Web site, ignoring the fluctuation of the system’s capacity. 2) IIR. 3) ICR. We use the following evaluation criterias: The total number of recrawl tasks. The number indicates the load of the manager. Under both REC and RUC, the value should be minimized in order to reduce the manager’s unnecessary load. The number of queued tasks. The number indicates the number of tasks contained in the crawlers’ task queue. Under both REC and RUC, it should be minimized. The system’s throughput. The value indicates the total download rate of the system. Under both REC and RUC, the value should be maximized so that all the capable crawlers are efficiently utilized. Update quality. To each channel, its change rate (the reciprocal of the channel’s change frequency, and the FIXED strategy is using this change rate as the channel’s recrawl ected . In the real run, due to the node churn, the actual update rate rate) is labeled as λexp i actual ected ) is always lower than λexp . Assuming that each PGP i has ( λi i a weight Ri , then the update quality calculated by Formula(5) indicates how well the recrawl strategy can perform against the channels’ actual change frequency. Here we apply weights (used in Table 2 and 4 ) to PGPs (4~0): 2 4 = 16 , 23 = 8 , 2 2 = 4 , 21 = 2 , 20 = 1 . Under both REC and RUC, the update quality should be maximized.

∑ (R × λλ N

updateQuality =

i

i =1

actual i exp ected i

) N

(5)

Scale-Adaptable Recrawl Strategies for DHT-Based DWC System

101

In addition, there are other metrics such as the number of task migrations, load variations, etc. Due to the size of the paper, we don’t present them. 4.2 The Case of REC We firstly examine the performance of IIR and ICR under the case of REC. Here we choose the dataset containing 16000 channels and 1400 crawlers. Table 1. Number of recrawl tasks and average recrawl intervals (steps) under REC. The left 4 colums show the number of recrawls. The rest 4 colums show the average re-recrawl intervals. Rank 4 3 2 1 0 Total

FIXED 0.905E6 3.295E6 2.979E6 0.795E6 0.056E6 8.030E6

IIR 0.251E6 1.212E6 1.512E6 0.471E6 0.038E6 3.485E6

ICR 0.522E6 1.719E6 1.499E6 0.393E6 0.027E6 4.160E6

Rank 4 3 2 1 0 Mean

FIXED 60 120 240 480 960 60

IIR 216 326 473 811 1412 216

ICR 104 230 477 972 1964 104

The left side of Table 1 shows that IIR and ICR dramatically reduce the number of recrawl tasks. To explain the reason, the right side of Table 1 shows the recrawl intervals. As the system’s capacity cannot match the channels’ change frequency (used as the recrawl frequency of the FIXED strategy), the IIR and ICR all increase the recrawl intervals of each rank. From the table, we observe that the IIR narrows the gaps between different ranks; meanwhile, the ICR widens them. However, the above characteristics don’t cause continuous narrowing or widening. We observe that during the whole simulation process, the recrawl intervals don’t change in linear patterns. From Table 1, we observe that there are much more tasks submitted under FIXED strategy. However, such high recrawl rate doesn’t lead to high efficiency. Under FIXED strategy approximately 70% tasks are waiting in the task queue. According to Fig.4, if we assume that the average number of online crawlers is 700, then the length of each crawler’s task queue under FIXED strategy is about 17. In such situation, most queued tasks are timed out and cancelled. On the other hand, the lengths under IIR and ICR are about 2.8 and 1.4 which indicate 83% and 92% reduction on the task queue’s memory cost. IIR has less queued tasks because IIR’s recrawl intervals on rank 0 to rank 2 are larger than that of the ICR’s. The throughput in Fig.4 is calculated through dividing the total size of pages downloaded in 10000 steps by the total time cost (including RTTs, data transfer time and wait time). Because the task threads are not always downloading pages, the throughputs don’t equals to the number of crawlers multiplied with the bandwidth. Instead, they only equals to half of the product. In addition, the curves all present obvious decline in the first 20000 steps. This is due to the decline of the number of crawlers, since we added all the crawlers to the system at the first step. Moreover, although the number of tasks is significantly reduced under IIR and ICR, the throughput of the system doesn’t change dramatically. The throughput under the ICR almost matches that under FIXED strategy. On the other hand, as IIR’s recrawl intervals on rank 4 to rank 2 are larger, the throughput under IIR declines about 14%.

102

X. Xu et al.

FIXED IIR ICR

16000 14000

Queued Tasks

12000 10000 8000 6000 4000 2000 0 0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Steps FIXED IIR ICR

1000 900

Throughput (Mb/s)

800 700 600 500 400 300 200 100 0 0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Steps

Fig. 4. The system performance throughout the simulation under REC. From left to right, the first figure shows the number of queued tasks; the second figure shows the system’s throughput. Table 2. Update quality under REC. The left three columns labeled “w” use the weights assumed in 4.1; the right three columns use 1 as each PGP’s weight. Rank FIXED (w) IIR (w) Update Quality 1.9987 2.0266

ICR (w) 2.0176

FIXED 0.5571

IIR 0.4936

ICR 0.4331

Table 2 shows the update quality of the fixed recrawl strategy, IIR and ICR. As the FIXED strategy represents a very radical approach, its update quality can be treated as the maximum capacity the system can achieve under REC. IIR and ICR’s weighted update qualities are very similar to that of FIXED, which means the update quality has been maximized. Meanwhile, their non-weighted update qualities are lower. We find that the reason is that, as the system’s capacity cannot meet the required recrawl rate, both IIR and ICR decrease the recrawl rates of the lower ranked channels (which accordingly decreases the update rates) in order to ensure high update rate of the higher ranked channels. Since the non-weighted update qualities are calculated without concerning the channels’ importance, they become inevitably smaller. We also notice that even under IIR and ICR which has lower recrawl rate, the average success rate of the tasks is only 30% (the fact is also true under the case of RUC). The phenomenon is mainly caused by the crawler’s non-notified departure in which all the tasks assigned to the crawler are completely lost. We consider this low rate a necessary cost under a fluctuating distributed environment. On the other hand, under FIXED strategy, since there are a lot more tasks timed out in the queue, the success rate become even worse and is less than 15%.

Scale-Adaptable Recrawl Strategies for DHT-Based DWC System

103

4.3 The Case of RUC We secondly examine the performance of IIR and ICR under the case of REC. Here we choose the dataset containing 16000 channels and 2800 crawlers. Table 3. Number of recrawl tasks and average recrawl intervals (steps) under RUC. The left 4 colums show the number of recrawls. The rest 4 colums show the average recrawl intervals. Rank 4 3 2 1 0 Total

FIXED 0.905E6 3.295E6 2.979E6 0.795E6 0.056E6 8.030E6

IIR 0.738E6 3.506E6 3.388E6 0.943E6 0.075E6 8.650E6

ICR 0.975E6 3.428E6 3.805E6 0.691E6 0.052E6 8.953E6

Rank 4 3 2 1 0 Mean

FIXED 60 120 240 480 960 287

IIR 74 110 214 399 782 246

ICR 56 112 190 544 1119 284

Different from the case of REC, the number of recrawl tasks under IIR and ICR in Table 3 don’t decline. Instead, they are increased by 8% and 11.5% compared with the FIXED strategy. The fact indicates that, under the case of RUC, in order to fully utilize the system’s capacity, there’s no way to reduce the load of manager. The bottleneck should be solved by either upgrading the manager’s capacity or deploy a number of managers to distribute the loads. We also found that IIR and ICR perform differently on deciding the recrawl intervals. On one hand, the IIR increases the intervals of higher-ranked (rank 4) channels and decrease that of lower-ranked (rank 3 to 0) channels. On the other hand, ICR decreases the intervals of higher-ranked (rank 4 to 2) channels and increase that of lower-ranked (rank 1 to 0) channels. Through the comparison we conclude that ICR is more likely to ensure the high re-scrawl rate of high-ranked channels, while IIR performs better if taking all ranks into consideration. Both Fig.5 shows the increase in the number of queued tasks and system throughput under IIR and ICR. To our surprise, the increase in Fig.5 (300%-400% in the left figure and 200% in the right figure) significantly exceeds the increase (8%-11.5%) in Table 3. On one hand, we consider the increase in queued tasks no harm to the crawlers. If we assume that the average number of online crawlers is 1400 (half of the 2800 crawlers), then the length of each crawler’s task queue under FIXED, IIR and ICR is about 0.36, 2 and 3. The numbers are low enough to prevent the time out. On the other hand, by re-adjusting the recrawl intervals of each rank, the system’s download capacity is fully used. Under IIR and ICR, the system’s throughput reaches 2 times of the throughput achieved under the case of REC (which also indicates the growth in the number of crawlers). But under the FIXED strategy, the system’s throughput doesn’t change compared with REC. Table 4 shows that IIR and ICR significantly increase the weighted update quality by 91% and 72.6% compared with FIXED strategy, which is satisfactory. We further notice that the non-weighted update qualities of IIR and ICR exceeds 1, which means the update rate exceeds the channels’ change frequency. Nevertheless, we believe the freshness of the Web contents can still be improved since the Web pages on the channels don’t change all at once (instead, their change frequency is supposed to obey the Poisson distribution[12] ). Another important issue is that if a crawler visits a Web host too often, it brings additional load to the Web host. A feasible solution is to add

104

X. Xu et al.

Queued Tasks

an upper limit to the recrawl rate (e.g. 2 times of the channel’s change frequency) and, if the system’s capacity permits, submit more channels to crawl. FIXED IIR ICR

16000 15000 14000 13000 12000 11000 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 -1000 0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Steps FIXED IIR ICR

1000 900

Throughput (Mb/s)

800 700 600 500 400 300 200 100 0 0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Steps

Fig. 5. The system performance throughout the simulation under RUC. From left to right, the first figure shows the number of queued tasks; the second figure shows the system’s throughput. Table 4. Update quality under RUC. The left three columns labeled “weighted” use the weights assumed in 4.1; the right three columns use 1 as each PGP’s weight. Rank FIXED (w) IIR (w) Update Quality 4.7873 9.1474

ICR (w) 8.2611

FIXED 1.0013

IIR 1.9827

ICR 1.9701

5 Conclusions In this paper, we propose the design of a distributed incremental crawling system. The system is designed to utilize the machines contributed by ordinary internet users as the crawlers. Based on this design, we propose two new recrawl strategies to adapt the recrawl frequency to the system’s capacity in order to make full use of the contributed machines. The strategies show their efficiency under a series of simulations. Acknowledgments. This paper was partially supported by the National Natural Science Foundation of China under Grant No.60703014; the National Grand Fundamental Research 973 Program of China under Grant No.G2005CB321806; the Specialized Research Fund for the Doctoral Program of Higher Education, SRFDP No. 20070213044; the National High-Tech Research and Development Plan of China under Grant Nos. 2007AA01Z442; the National High-Tech Research and Development Plan of China under Grant Nos. 2009AA01Z437.

Scale-Adaptable Recrawl Strategies for DHT-Based DWC System

105

References 1. Foster, I.: Internet Computing and the Emerging Grid. Nature (2000) 2. Werthimer, D., Cobb, J., Lebofsky, M., Anderson, D., Korpela, E.: SETI@HOME— Massively Distributed Computing for SETI. Comput. Sci. Eng. 3, 78–83 (2001) 3. YaCy Distributed Web Search, http://yacy.net 4. FAROO Real Time Search, http://www.faroo.com 5. Majesti-12: Distributed Web Search, http://www.majestic12.co.uk 6. Xu, X., Zhang, W.Z., Zhang, H.L., Fang, B.X., Liu, X.R.: A Forwarding-based Task Scheduling Algorithm for Distributed Web Crawling over DHTs. In: The 15th International Conference on Parallel and Distributed Systems (ICPADS 2009), pp. 854–859. IEEE Computer Society, Shenzhen (2009) 7. Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2, 219–229 (1999) 8. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A Scalable Fully Distributed Web Crawler. Software—Practice & Experience 3(8), 711–726 (2004) 9. Loo, B.T., Cooper, O., Krishnamurthy, S.: Distributed Web Crawling over DHTs. Technical report, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley (February 1, 2004) 10. Singh, A., Srivatsa, M., Liu, L., Miller, T.: Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: The SIGIR Workshop on Distributed Information Retrieval, pp. 126–142 (2004) 11. Lee, H.T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: Scaling to 6 Billion Pages and Beyond. In: The 17th International Conference on World Wide Web, pp. 427–436 (2008) 12. Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: The 26th International Conference on Very Large Data Bases (VLDB 2000), San Francisco, pp. 200–209 (2000) 13. Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web Changes Everything: Understanding the Dynamics of Web Content. In: The 2nd ACM International Conference on Web Search and Data Mining (WSDM 2009), pp. 282–291. ACM, Barcelona (2009) 14. Edwards, J., McCurley, K., Tomlin, J.: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In: The 10th International Conference on World Wide Web, pp. 106–113. ACM, Hong Kong (2001) 15. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal Crawling Strategies for Web Search Engines. In: The 11th International Conference on World Wide Web, pp. 136–147. ACM, Honolulu (2002) 16. Cho, J., Molina, H.G.: Effective Page Refresh Policies for Web Crawlers. ACM Trans. Database Syst. 28, 390–426 (2003) 17. Pandey, S., Olston, C.: User-centric Web Crawling. In: The 10th International Conference on World Wide Web, pp. 401–411. ACM, Chiba (2005) 18. Olston, C., Pandey, S.: Recrawl Scheduling based on Information Longevity. In: The 17th International Conference on World Wide Web, pp. 437–446. ACM, Beijing (2008) 19. Brin, S., Page, L.: The Anatomy of A Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30, 107–117 (1998) 20. P2PSim-Kingdata, http://pdos.csail.mit.edu/p2psim/kingdata/ 21. Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A Decentralized Network Coordinate System. In: SIGCOMM 2004, pp. 15–26. ACM, Portland (2004)

Power Efficient Scheduling for Hard Real-Time Systems on a Multiprocessor Platform Peter J. Nistler and Jean-Luc Gaudiot University of California, Irvine Irvine, CA, USA [email protected], [email protected]

Abstract. An online, real-time scheduler is proposed to minimize the power consumption of a task set during execution on a multiprocessor platform. The scheduler is capable of handling the spectrum of task types (periodic, sporadic, and aperiodic) as well as supporting mutually exclusive, shared resources. The solution presented is a user adjustable scheduler which ranges from producing an optimal schedule which requires the minimum power during the worst case execution scenario to producing a suboptimal schedule which aggressively minimizes power during the typical execution scenario. Keywords: Real-Time, Scheduler, Power Management, Multiprocessor.

1 Introduction Multi-core systems are designed to bring the processor power consumption and heat density down to a manageable level without sacrificing performance. This claim is based on the dynamic power characterization of a CMOS device: (1), where P is power, C is the device capacitance, V is the voltage, and f is the frequency. At the same time, we are entering a world of ubiquitous computing.: there are microprocessors in watches, phones, televisions, simple kitchen appliances, cars, and power grids. Many of these systems need to operate within certain time bounds to ensure proper function. Embedded systems are often designed within a prescribed power envelope. This power constraint might be defined by thermal boundaries or by energy constraints, such as in a battery or solar cell-operated environment.. Performance and power requirements must be carefully weighed against each other to ensure that the system functions as desired. With the current trend towards multiprocessors and the need to minimize power, there is a demand for a hard real-time multiprocessor scheduler that optimizes for minimal power usage.

2 Prior Art Much work has dealt with creating power efficient, real-time schedulers, with one of two goals: minimize the power for the worst case scenario. or for the typical scenario.. The first type of scheduler, (e.g., Chen et al. [1]) assumes that the task will always C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 106–120, 2010. © IFIP International Federation for Information Processing 2010

Power Efficient Scheduling for Hard Real-Time Systems

107

require the maximum possible execution time possible. Under this assumption, the scheduler scales back the operating frequency of the processor so that the task completes exactly at its deadline. This is an optimal solution for scheduling a given task set on a multiprocessor while providing hard real-time guarantees. The second type of scheduler, (e.g., Gruian [2] and Malani et al. [3]), assumes that the task will execute in an average time and sets out to minimize the power. The scheduler sets the operating frequency such that the task on average will finish execution by a set time . If the task does not finish executing by then the scheduler increases the operating frequency such that the task can finish the task execution by . Fig. 1 depicts the differences in the speed scaling schemes.

Fig. 1. Scheduling Strategies

Finally, Chen and Hsueh [4] introduced a novel framework for executing periodic real-time tasks on uniform multiprocessors. They developed an optimal, simple, clear, and easily visualized scheduler.. Their framework is a core component of this work. The Precaution Cut Greedy (PCG) scheduling algorithm is fluid in that each task executes at a constant rate during its executable period. It is also P-fair in that the work allocation and work executed are within one quantum at all times.The basic quantum employed by the PCG algorithm is a novel construct called a Time and Local Execution Requirement (T-Ler) plane. The T-Ler plane can be visualized as a right isosceles triangle where the y-axis is the system execution requirement (work) and the x-axis is time. The plane is created by overlaying each task’s fluid schedule inside the triangle and aligning the time boundaries with scheduling events. By assuming the period of every task is equal to its deadline, the only scheduling events in Chen and Hsueh’s model are the task periods. The entire system schedule is broken into a series of T-Ler planes. Inside each plane, the tasks can be executed in any fashion, provided that all the required work is completed by the end of the plane. The PCG algorithm provides an optimal solution for scheduling on uniform multiprocessors, through the application of the fluid scheduling model.

3 Problem Definition The existing power efficient real-time schedulers are not well suited for use in their current form. If a scheduler can only handle task sets where all the tasks share one common deadline; if it can only handle periodic tasks; if it cannot deal with even the

108

P.J. Nistler and J.-L. Gaudiot

simplest of locks or other shared resources; then the scheduler has limited applicability to actual implementations. The focus of this work is to produce that well-rounded scheduler, a scheduler designed with an actual system in mind. The first part of this work establishes the hardware and power model used in the system. The second part describes the basic scheduler and supporting framework that minimizes the power consumption on a multiprocessor. This approach is strictly for hard real-time systems. The system will contain support for any mix of periodic, sporadic, and aperiodic tasks. The support for these non-periodic tasks requires that the scheduler and framework are online algorithms. The scheduler shall also maintain a strict ordering of accesses to shared resources such as memory mapped IO structures and locks. The result is an online scheduler and supporting framework that minimizes the power consumption of a multiprocessor while scheduling hard real-time tasks. The third part of this work explains how to use the scheduler and supporting framework to support the two different approaches to scheduling. The first approach describes a high-reliability system. The system is simple, clean, and guarantees an optimal solution for the worst case execution scenario. It is a combination and further extension of the work done by [1] and [4]. The second approach describes a more commercial solution, optimized for the typical execution scenario. It combines the high-reliability approach with the idea of stochastic scheduling. The work is broken up into three different sections. Section 4 discusses what the power optimal configuration of the hardware looks like. Section 5 extends Chen and Hsueh’s scheduling framework for supporting periodic, sporadic, and aperiodic tasks while supporting shared resources. Section 6 describes how to tailor the scheduler so that it can proactively reduce the system power.

4 Power Optimal Hardware Configuration This section addresses the problem of selecting processor operating frequencies so that the system can support the requirements of a given task set. The power optimal hardware configuration and a simple algorithm for determining it are provided. 4.1 The Processor / Task Relationship A relationship needs to be established between the processor capabilities and the task execution requirements. First a multiprocessor model is selected to establish a foundation for the work. Based on this model’s relationship between the software and hardware some bounds can be established to find the minimum operating frequencies required to support the task set. There are two different high-level multiprocessor models that need to be explained before the real work can begin. The first model is identical parallel machines. Each of the processors in the system is identical to all the other processors and operates at the same frequency. Thus all the processors in the system perform identically, allowing tasks to freely migrate without affecting the task’s execution. This model is the easiest to support at the operating system and scheduler level.

Power Efficient Scheduling for Hard Real-Time Systems

109

The second model is uniform parallel machines. Similar to the identical parallel machines the tasks can freely migrate as an identical instruction set used on all the processors. The difference is that the processing capabilities of the processors are / where is the only linearly related to each other. The linear relation is amount of work executed and is the time elapsed. Each processor in this system is allowed to take on any positive value for . Regardless of the scheduler’s goals it must be aware of how the processors compare to each other to make a good decision. The multiprocessor model being used in this research is a special case of the uniform parallel machine model. The processors are physically implemented as identical parallel machines. However, each of the processors can independently operate at a unique clock frequency. There are several reasons for taking this approach. The first reason is that the software designer only needs to optimize the code to execute on a single processor microarchitecture. The second reason is that this approach relates processing capability and operating frequency to a simple, linear formula. If the processing capability is normalized to the maximum operating frequency, then it follows that . The final reason is that running separate clock frequencies to each core allows a fine-tuning of the power consumption. As described earlier the amount of work (machine instructions) that a processor can complete in time t is defined by: . The tasks being scheduled are de, , where task has a worst case execution time at scribed by the triplet: the maximum frequency , a relative deadline , and a period . Note that is a measure of time, rather than a measure of work. It follows that the work required to execute must be where is the processing capability of a processor at its maximum frequency. Recall with that 1, the relationship simplifies to . The work of a task equals its worst case execution time. A common metric for determining schedule feasibility in real-time systems is task utilization of a processor . The utilization on a fixed frequency, uniprocessor is described as / . This equation needs a slight update in order for it to work on a variable frequency system; the actual execution time must meet the equality . When the utilization is combined with the equation for it follows that . This is to say that a processor must have a processing capability in order for the task to finish by its deadline. Further, it becomes apparent that the minimum operating frequency of the processor executing is simply . This equation is a lower bound on the selectable frequencies that can be used while executing a given task. This utilization model can be extended to the system level, where it can be seen that the total utilization of the system is characterized by 1

(2)

where is the number of processors and is the number of tasks in the system. As a quick check for task set feasibility, note that if, at any point in time either 1 or 1 then the task set can never be successfully scheduled on the system.

110

P.J. Nistler and J.-L. Gaudiot

4.2 The Ideal Processor Frequencies This section shall investigate the optimal processor speeds for a given workload, without regard for the real-time constraints. The only goal of this section is to determine the optimal operating frequency of each processor such that a given amount of work can be accomplished in a set amount of time. As mentioned earlier, the power of a system is quadratic relative to the operating voltage of the device and linear in terms of . Furthermore, the relationship between the required operating voltage and the frequency on dynamic voltage scaling hardware is assumed to be a linear relationship ( ). Since is linear with the result is that when frequency and voltage are scaled together. The power model can be easily extended to support multiprocessor systems. By summing the power of each individual processor, the system’s power is modeled by (3) This is not simply the power of one processor multiplied by , because each of the and processors can take on a unique frequency. From the earlier equality equation (2), it can be seen that a system must maintain a total processing capability Σ to ensure that all the tasks in a given span will complete by their deadlines. If , then the system is capable of processing more work than the task set demands. On the other hand, if then the system cannot process all the task set demands, which means that the task set is infeasible by the system. Theorem 1. The system power to achieve a total processing capability when , .

is minimized

Case 1: 1 This case describes a uniprocessor system. All the work must be executed on the one processor. Case 2: 2 In this case the total processing capacity er characterization of

, which leads to the system pow-

. 1 holds This system functions correctly as long as the original assumption 0 true. The result is that the minimum power is achieved when the work is evenly distributed across the processors. 2 Case 3: 2 The optimal solution for this case can be established with a proof by contradiction. Assume that there is a system where the optimal power is achieved by unevenly distributing the work among all the processors. This assumption can be expressed by the formula

Power Efficient Scheduling for Hard Real-Time Systems

111

1 . Where 1 1 The first equation states that there exists a state where one processor deviates from the average execution speed / by while other processors adjust their frequency to ensure that . The deviation must be bounded to ensure that none of the processors are set to a negative frequency. Solving the first equation for gives the result 3 . 1 When this equation is evaluated across the bounds of it follows that 3 / . This result violates the bound on ; one of the processors must run at a negative frequency in order for the first equation to hold true. This is nonsense. Therefore by contradiction, the minimum system power is achieved at ,

1

.

(4)

The result is that the hardware consumes the least power when the processors all operate at . This is to say the optimal solution from the hardware’s perspective is to adjust each processor such that . Therefore, the power optimal scheduler must try to produce a schedule where all the tasks can be feasibly scheduled with this average operating frequency. 4.3 Power Optimal Software Configuration This section shall address the problem of selecting processor operating frequencies such that the system consumes the minimum power while maintaining task set feasibility. The system must find a solution that is as close to the hardware optimal solution as possible while producing a feasible configuration. This section will establish the requirements for the scheduling algorithm. The job of selecting optimal processor frequencies can be reduced to arranging all the task utilizations into bins such that the total value of each bin is as close to as possible. It is important to note that tasks in the system are assumed to be nondivisible. One task cannot be split into parallel threads and simultaneously executed on multiple processors to shorten the execution time. The task is only allowed to execute on one processor at a time. However the task is allowed to migrate between any processors without penalty. The feasibility constraint for uniform multiprocessors has already been described by Funk et al. [5]; their result follows. Theorem 2. Feasibility Condition on Uniform Multiprocessors by Funk et al. [5] Consider a set , ,…, of tasks indexed with non-increasing utilization for all ). Let Σ for all . Let denote a system of (i.e., where for all . uniform processors with processing capabilities , , … , Σ for all . Task set can be scheduled to meet all deadlines on uniform Let multiprocessor platform if and only if the following constraints hold: and , for all 1, … , .

112

P.J. Nistler and J.-L. Gaudiot

The speed scaling algorithm needs to select processor frequencies as close to as possible while meeting the constraints of Theorem 2. The first and easiest constraint is to ensure that only processors are provided to the scheduler. This is accomplished by idling or powering down any excess processors. Next the algorithm must ensure that the list of tasks is sorted by . If the most demanding task in the set is less than or equal to , then it is possible to schedule all the tasks when the active processors operate at . The solution for this case meets the constraints of Theorem 2 and is optimal from a power perspective. When the value is greater than a slightly different approach needs to be taken. In order to ensure that Theorem 2 is upheld set . In this situation one of the processors in the system must operate at in order to meet the constraints. Further, if runs at then it is operating at the minimum frequency possible for meeting the deadline of . In this situation the deviation from the necessary frequency for maintaining task set feasibility has been reduced to zero. This processor is running at the optimal frequency for the current condition. Once the frequency of has been thus set it can be ignored for the remainder of the speed scaling algorithm. Likewise, since the work required by has been accounted for, this task is ignored as well. The speed scaling algorithm is then recursively called on the reduced set of processors and tasks, where . By induction, this speed scaling algorithm produces a minimum power system configuration achieving task set feasibility. This speed scaling algorithm will be implemented as a piece of the LTF-M algorithm developed by Chen et al. in [1].

5 Building the Scheduler The scheduler being proposed is composed of three components: the framework, the speed scaling algorithm, and a T-Ler plane scheduling algorithm. The framework is the mechanism which breaks the timeline into a series of T-Ler planes. It defines the boundaries of each plane and keeps track of which task needs to be executed in the plane. The framework provides the current task set to the speed scaling and T-Ler plane scheduling algorithms. The speed scaling algorithm, in turn, sets the operating frequency of each processor such that the system consumes the least power while maintaining a feasible hardware configuration. The task set, the length of the current plane, and the processor configuration are finally passed to the plane scheduling algorithm. The plane scheduling algorithm then performs its role of executing the tasks to meet the deadline on the current hardware configuration.

Fig. 2. Breakdown of the Scheduler

Power Efficient Scheduling for Hard Real-Time Systems

113

The framework based on Chen and Hsueh is overly constrained. It can only execute a subset of periodic tasks, it does not dynamically adjust to reduce power when a task completes early, it assumes that the scheduler can run in zero time, and it cannot handle shared resources, major hindrances to an actual implementation. This section describes the changes needed to allay these shortcomings. 5.1 Supporting More Task Types The first shortcoming of the framework proposed by Chen and Hsueh in [4] is that it only supports a highly restricted set of task types. The authors assumed that the period of every task equals its deadline, . This assumption is mutually exclusive with the ability to support sporadic and aperiodic tasks. The solution is to extend the framework so that it supports task sets where . The plane scheduler requires that every task’s period and deadline line up with a T-Ler plane boundary. The framework will set the plane boundaries accordingly. The downside of allowing the deadline and period to take on different values is that there may be more T-Ler planes within the hyperperiod. Naturally, more T-Ler planes directly correspond to more invocations of the scheduler. A sporadic task can be viewed as a specific type of periodic task. It has unique values and , an arrival time , and ∞. However the sporadic task cannot be injected directly into the current task set or deadline violations might ensue. This is because the current T-Ler plane has already been configured for the existing task set. Either the processor frequencies would require adjustment within the plane, or the sporadic task must wait to begin execution until the subsequent plane. The framework will wait for the subsequent plane. There are two reasons driving this decision. The first reason is the high execution cost required to adjust the schedule mid-plane. The second reason is to minimize the number of changes to the processor frequencies. This decision to wait can result in a large number of rejected sporadic tasks which might have been executable, if only they began executing earlier. This rejection rate can be reduced by having the framework insert extra plane boundaries into the system, thereby reducing the average length of the plane. The trade-off is an increased number of scheduler invocations. The sporadic task acceptance test for this scheduler is based on Theorem 2. If this rule holds true, while including

in , for all T-Ler planes in which the sporadic task executes (from the first boundary encountered after the sporadic task arrived, through the task’s deadline), then it can be accepted. With this the scheduler is capable of handling sporadic tasks while providing real-time guarantees. In a similar manner aperiodic tasks can be executed whenever the system has free resources. This means that provided , for all 1, … , remains true, 1 the scheduler can assign the aperiodic task a utilization of and schedule the task within the current T-Ler plane.

114

P.J. Nistler and J.-L. Gaudiot

5.2 Exploiting Hidden Speed-Ups The second shortcoming in Chen and Hsueh’s work is that the work allocation and execution is assumed to be P-fair. In order to enforce this P-fair model, each task would need to be heavily instrumented to monitor its progress. The solution is to use a weaker model which states that at all times the actual work executed is greater than or equal to the actual work allocation. This model gives the flexibility to run a task faster than the worst case execution time. This is important for implementation on a real system as there are many hidden factors which can improve the execution time. Several common examples are branch predictors, caches, and break statements nested inside of a loop. The weaker model allows the system to exploit these hidden speed-ups and does not require instrumentation of the code. This weaker fairness model is only used in the framework. It cannot be driven into the scheduling algorithm without risking a significant increase in complexity. When a task finishes its execution it informs the framework that it has completed. The framework will remove the task from the task set until its next period, if applicable. The system utilization for the new plane is: where is the utilization of the completed task. Inside of a given T-Ler plane the scheduling algorithm will not be aware that a task is executing faster than worst case. The tasks will be scheduled onto the processors according to their utilizations which at this point is dictated by the worst case execution time. The result is that the schedule for the plane can be determined at the beginning of the plane. This pre-determined schedule will be followed for the duration of the plane without any further invocations to the scheduling algorithm.. The framework can further minimize the system power by inserting extra plane boundaries into the hyperperiod. The extra plane boundaries reduce the delay from when the task finishes to when the hardware configuration will be adjusted, at the next invocation of the speed scaling algorithm. As the time between boundaries approaches zero, the tasks are removed as soon as they finish. This means that the scheduler never requests more work from the system than is absolutely necessary. The price for this power savings is the overhead of calling the scheduler more frequently. This price is not as dramatic as it first seems if the framework inserts boundaries to break up a larger T-Ler plane into identical length smaller T-Ler planes. In the case where no task finishes early the schedule produced for each plane will be identical. remain unchanged which guarantees the same result from the The task set and speed scaling and scheduling algorithm. The framework should recognize this and use the same schedule, saving significant execution time. 5.3 A Lightweight Scheduling Algorithm Recall that the speed scaling and scheduling algorithms will be invoked prior to the start of every T-Ler plane. The large execution complexity will become unwieldy as the number of planes increase. The LTF-M algorithm developed by Chen et al. in [1] is a lightweight algorithm which optimally sets the processor frequencies and schedules tasks, provided that all the tasks share a common start time and a common deadline. These requirements are met at the level of the T-Ler plane. The LTF-M algorithm

Power Efficient Scheduling for Hard Real-Time Systems

115

only has a complexity yet is still optimal both from the feasibility perspective and from a power perspective. 5.4 Shared Resources

local remaining execution requirement

The next improvement for the scheduler is sharing resources. Currently the scheduler requires that every task executes independently with no ordering between them. However, a resource often allows only one task to access it at a time. This section shall explain how the scheduler framework supports mutually exclusive, shared resources and introduces time based locks to enforce the ordering between tasks. Support for shared resources will be handled through mutex locks. These locks require a minor modification to the scheduler framework and do not violate any realtime guarantees. The change is most clearly explained with the help of Fig. 3.

T1

T

2

0

t

a

t

b t

t c

d time

t

f

Fig. 3. Shared Resources in the Fluid Schedule

The system is executing two tasks, and , where the shaded regions represent the time at which the shared resource is held by each task. The location and execution time of these critical sections must be provided to the framework. The framework will need to know the last possible moment at which the critical section may begin, and , and the last possible moment that the critical section may be exited, and . There will be no conflicts provided that or while the tasks execute according to their fluid schedules. However, within a T-Ler plane each task does not necessarily execute according to its fluid schedule. For example, may not begin execution in a given plane until 0. This delayed start may result in a critical section finish time ′ . In this case would enter its critical section at , violating the deadline guarantee. The simple solution is to force the tasks to acquire and release the resource at fixed points in time: e lock must be requested and released at the granularity of a whole TLer plane. This is accomplished by having the framework insert plane boundaries at the requesting edge of a critical section. Provided that in the original fluid schedule, setting only as a boundary ensures that ′ .

116

P.J. Nistler and J.-L. Gaudiot

As the framework prepares to pass the upcoming plane’s task set and deadline to the LTF-M algorithm it will perform a quick check. If there are multiple calls to a common lock, then the framework will insert a boundary at the beginning of the second request. Effectively the current plane will be shortened to run from the current time to the newly inserted boundary. (It is at this point that the framework can introduce extra, evenly spaced boundaries which can be used to reduce power and lower the sporadic task rejection rate.) The smaller, conflict-free plane will then be passed onto the LTF-M algorithm. This solution ensures that at no point can multiple tasks be given access to a single, shared resource. As mentioned in the last section a task may execute faster than its fluid schedule, resulting in a task requesting a resource early. There are two possible cases that might be encountered. The first case is when requests the resource while is holding it. The outcome is that will be forced to block its execution until after releases the lock. There is no problem here, because is not required to be executing yet. The lock naturally solved the problem.The second case is when executes significantly faster than such that it requests the lock before requests the lock. If is allowed to acquire the lock first, then the system can no longer guarantee the deadline for . In order to satisfy the hard real-time constraints, the system needs a method for selectively granting locks to the tasks. The task cannot employ the “compare and swap” instruction on the processor as the system would not be able to enforce the desired order. And, directing all lock requests through the operating system introduces overhead. The solution to the problem is using time based locks. This lock is constructed such that only the scheduler / operating system can set the value of a lock to a nonzero value. Moreover, the value of the lock will correspond to the task ID. In order to use the lock, the task will verify that its ID corresponds to the lock’s value. If the values match, then it proceeds with the assurance that it has sole possession of the lock. Otherwise, the task will be required to continually check the value of the lock until the value matches the task ID. When the task wishes to release the lock it will set the lock value to 0. Notice that once the task releases the lock it will not be able to reacquire the lock on its own. In the system described here, the framework shall be responsible for granting the locks. Because the framework has used the requesting edge of a resource as a T-Ler plane boundary, it can be shown that only one task will require a given resource within a plane. If multiple tasks require the same resource in a plane, then the tasks are not schedulable because . Therefore, if the tasks were originally implemented with no resource conflicts, the scheduler can set the lock value to the one task that requires the resource. This time based lock only works because the execution time of each critical section has been well defined. The critical code sections will be strictly ordered according to the task utilization values to ensure that all deadlines are met. The tasks can directly check the value of the lock with a read instruction and release the lock with a write instruction. Therefore, if the fluid task schedule does not show any resource conflicts then the system will function correctly.

Power Efficient Scheduling for Hard Real-Time Systems

117

5.5 Running the Scheduler This section shall focus on evaluating the execution requirement for the scheduler and how to invoke the scheduler while meeting the system’s requirements. The execution requirement of the scheduler can be described by the complexity of the framework and the LTF-M algorithms. With bounds on the computational complexity, the worst case execution time of the scheduler can be determined for a given task set. This is to say · where is jointly determined by the hardware implementation and the actual code implementing the scheduler. Using the previously derived complexity of the LTF-M algorithm and the complexity of the Framework algorithm the total scheduler complexity is lg lg where is the number of locks in the system. That means lg lg for each plane. This bound on the worst case execution time for the scheduler, allows the system to schedule the scheduler as a task. The only real-time constraint on the scheduler is that it finishes execution before the start of the T-Ler plane. The assumption hitherto has been that the scheduler ran between the end of the previous plane and start of the subsequent plane. However, the power consumption can be reduced by running the scheduler in the background. The system should set . By running the scheduler in the pre, vious plane it ensures that it completes the schedule before the next plane begins, while providing the largest possible deadline without overlapping the execution periods of any scheduler invocation. In this case the scheduler appears to be a recurring sporadic task. In each plane the task arrives with unique and for the plane. The average power consumption for running the scheduler in the previous plane is lower than running the scheduler between the planes. This can be shown by equations (3) and (4). The average power required for running the scheduler between planes is ,

. The power required for the scheduler itself is based on one processor operating at for time. The execution of both the task set and scheduler must happen within the original for the plane. However, to calculate the average power when running the scheduler in the previous plane requires the assumption that the work required to schedule the next plane is the same as the work required for the current plane. Setting the work requirements to be equal creates a fair comparison with as both cases execute the same , amount of work within the same overall timeframe. The power required for running the scheduler in the previous plane is 1 ,

.

The equation shows that in the ideal case the power consumed in the task set increases in the same fashion as adding a sporadic task to the system.

118

P.J. Nistler and J.-L. Gaudiot

When 0 the average power consumed by each approach is going to be moves towards its upper bound of both equations monotonidentical. As ically increase. The difference in the approaches lies in what the equations increase towards. Taking the limits of both equations lim lim

,

∞ 1

,

.

Since when 0 and they both monotonically , , . It is more power efficient to run the scheduler increase, , , inside the previous T-Ler plane than to run it between the planes.

6 Unique Solutions for Unique Problems The scheduler developed up to this point is the high-reliability scheduler. It optimizes the task set execution for the task’s worst case execution, assuming that the task always requires execution time during the period. Consequently, it sets the task to have a fluid schedule where / . The system adjusts the power reactively, only adjusting once certain that the task is performing faster. / In an ideal situation each task in the system would execute with , corresponding to the minimum speed required to finish the task by the deadline. The cannot be known in advance. The problem with this scenario is that the value , instead, which is assumed to be relatively solution is to use a predicted value , . The trade-off is that an incorrect value may cause the system to close to , consume more power than the high-reliability approach. This section will extend the previously developed high-reliability scheduler to proactively reduce system power. Ideally the scheduler would know the value , before the task began execution. But the reality is that this value cannot be known in advance with complete certainty. No prediction is completely accurate. Therefore, the scheduler must be capable of optimizing the task execution based on an inaccurate prediction. The scheduler cannot trust the inaccurate prediction without inheriting the potential risk of missing a deadline. The optimization must be made while ensuring that it is possible for to execute for by the deadline, . The task utiliza, tion does not need to be constant during execution, provided that on average . The resulting schedule is similar to the stochastic model depicted in , Fig. 1. If the task finishes execution early it will require less power than the WCET model. The scheduler must ensure that it is possible for the worst case execution to be completed exactly at the deadline. . At this point the scheduler is provided with a predicted execution time , for For the scheduler to set a task utilization it must generate a deadline , completing the predicted amount of execution. Using these two values the scheduler / , from time 0 to , . If the task does can execute with , not complete by , then it will set higher for the remaining execution.

Power Efficient Scheduling for Hard Real-Time Systems

119

The scheduler must make a correct decision when it selects , . The value should be large enough to minimize between time 0 to , while ensuring that / 1. It must rein the latter half of execution , , main possible for a processor to execute the remaining work. This results in an upper . The scheduler can also develop a bound for the value , , lower bound on the predicted deadline. The lower bound provided by the high/ . As long as the predicted deadline reliability design is , , meets these bounds the deadline can be met. Additionally when the , the system power is at least as low as the high-reliability design. , falls inside of its boundaries determines the reliabilThe location at which , ity of the system. At the lower bound the value , has no impact on the system. In this case the schedule is non-aggressive and is optimized for the worst case. The result is a schedule identical to the high-reliability design. Conversely when , has a significant impact on the system. is at its upper bound, the accuracy of , A correct prediction uses the least power possible while guaranteeing the deadline. Meanwhile an incorrect prediction will require more power than the high-reliability schedule. This approach is not without its penalty. The average power can be worse with an inaccurate predictor. Even more terrible, the instantaneous power consumption can spike. In the case where , , the utilization suddenly increases at , corresponding to an increase in the required power. However, if all the tasks , share a common deadline, every task is poorly predicted, and each task requires the complete worst case execution time , then the total system power will dramatically increase. The utilization of each task directly before the deadline would be 1. The power required to meet this demand is huge. More importantly it means that if then the system cannot guarantee task set feasibility when it aggressively selects . When the prediction is at the upper bound the scheduler is non-optimal, , reduced to . is This worst case feasibility scenario only happens when both the value , is selected aggressively. Conversely when inaccurate and the value , is correct or , is at its lower bound the system is optimal. Using this , bit of intuition the system designer can tailor the system for his needs. This commercial scheduler is not that different from the high-reliability scheduler. The only significant change is that the new design has the task execute with a variable utilization. The actual utilization value used is selected as a trade between potential power savings and system reliability. As the system is adjusted to reduce power more aggressively the guaranteed feasibility bound reduces from Theorem #1 to .

7 Conclusion and Future Work A power efficient, hard real-time scheduler for multiprocessor platforms is developed in this paper. The original goal to develop a system that would be useable in an actual system implementation imposes several constraints on the scheduler. It needs to support the spectrum of real-time tasks: periodic, sporadic, and aperiodic. The scheduling

120

P.J. Nistler and J.-L. Gaudiot

mechanism also needs to be scalable while supporting mutually exclusive, shared resources. In the end all of the goals are achieved while maintaining a low complexity of lg lg where is the number of tasks and is the number of mutually exclusive, shared resources in the system. The solution is also an optimal scheduler when the lower bound of , is chosen. In this case it optimizes the schedule for minimum power during the worst case execution scenario. However, the scheduler allows for more aggressive power reductions according to the needs of the system designer. The trade-off is that the feasibility bound is reduced. This scheduler is useful for an actual implementation. It takes into consideration that processor speeds should be changed sparingly, context switches are minimized, and it takes its own execution time into consideration. The future work is to implement and test this scheduler on an actual hardware platform and get feedback from industry’s real-time community. Acknowledgements. This work is partly supported by the National Science Foundation under Grant No. CCF-0541403. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References 1. Chen, J., Hsu, H., Chuang, K., Yang, C., Pang, A., Kuo, T.: Multiprocessor EnergyEfficient Scheduling with Task Migration Considerations. In: 16th Euromicro Conference on Real-Time Systems, pp. 101–108. IEEE Computer Society, Washington (2004) 2. Gruian, F.: Hard Real-Time Scheduling for Low-Energy Using Stochastic Data and DVS Processors. In: 16th International Symposium on Low Power Electronics and Design, pp. 46–51. ACM, New York (2001) 3. Malani, P., Mukre, P., Qiu, Q., Wu, Q.: Adaptive Scheduling and Voltage Scaling for Multiprocessor Real-Time Applications with Non-Deterministic Workload. In: DATE 2008: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 652–657. ACM, New York (2008) 4. Chen, S., Hsueh, C.: Optimal Dynamic-Priority Real-Time Scheduling Algorithms for Uniform Multiprocessors. In: Proceedings of the 2008 Real-Time Systems Symposium, pp. 147–156. IEEE, Washington (2008) 5. Funk, S., Goossens, J., Baruah, S.: On-line Scheduling on Uniform Multiprocessors. In: Proceedings of the 2001 Real-Time Systems Symposium, pp. 183–192. University of North Carolina at Chapel Hill, North Carolina (2001)

Storage Device Performance Prediction with Selective Bagging Classification and Regression Tree Lei Zhang1 , Guiquan Liu1 , Xuechen Zhang2 , Song Jiang2 , and Enhong Chen1 1

University of Science and Technology of China 2 Wayne State University [email protected], [email protected], {xczhang,sjiang}@eng.wayne.edu, [email protected]

Abstract. Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment and conﬁguration. Based on bagging ensemble, we proposed an algorithm named selective bagging classiﬁcation and regression tree (SBCART) to model storage device performance. In addition, we consider the caching eﬀect as a feature in workload characterization. Experiments indicate that caching eﬀect added in feature vector can substantially improve prediction accuracy and SBCART is more precise and more stable compared to CART. Keywords: Performance prediction, Storage device modeling, CART, Ensemble learning, Bagging.

1

Introduction

Today’s high-end storage systems are highly complex and conﬁgurable, and the automation of storage management is a critical research challenge. One key issue in the automation of storage management is the placement of data sets onto various devices in the storage system, or how to map a workload of speciﬁc characteristics onto a appropriate storage device for high service quality and system utilization. To this end, the system must be able to predict the performance a device can provide in its service of a particular workload. Performance prediction for storage systems have long been studied. Among them are three methods which are particularly useful and eﬃcient. They are analytic device modeling, simulation and emulation, and black-box modeling. It is a cumbersome task to build accurate analytic models for disk drives because of their nonlinear and state dependent behavior. Ruemmler and Wilkes [1] developed analytic disk models that take into account disk head positioning, platter rotation, and data caching and read-ahead. The model is further improved by Worthington et al. [2], resulting in a widely used disk simulator, DiskSim [3], which represents the state of the art in disk simulation. DiskSim simulates almost C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 121–133, 2010. c IFIP International Federation for Information Processing 2010

122

L. Zhang et al.

all performance-relevant components of a disk, including device drivers, buses, controllers, adapters, and caches. Emulators go one step further than simulators. In addition to modeling performance, they can interoperate with real systems. For example, MEMS devices can interact with existing system [4]. While disk arrays are widely used in the high-end storage systems, a lot of research work focuses on the modeling and simulation of disk arrays [5–7]. Among the work, the Pantheon storage system simulator [7, 8] was built to support the rapid exploration of design choices of storage systems in HP AutoRAID advanced disk array technology [9] and TicherTAIP parallel RAID architecture [10]. Uysal et al. developed a composite analytic model of mid-range disk array and reported its accuracy within 15% of actual measurements [6]. Compared with simulations and emulations, analytic models are much faster. However, they cannot capture as many details as simulators and emulators. Both methods rely heavily on human expertise on the targeted system and thus are called white-box approach. Given suﬃcient time and expertise, the white-box approach can work well in exploring design space for a particular device. Unfortunately, such time and expertise is not always available for high-end storage systems because the systems are usually complex and opaque. In addition, some information, such as patented use of algorithms and optimizations, is not disclosed. Furthermore, the technical trend towards storage consolidation in large data centers hints that building an accurate model or simulator using white box method cannot be a general solution in serving a variety of very diﬀerent workloads. In contrast, the so-called black box approach treats the storage system as a black box without knowing the internal components or algorithms and can accommodate workloads of diﬀerent characteristics. In this approach, the training data sets, which contain quantiﬁed description of characteristics of input I/O requests and their corresponding response times from the system, are recorded in a table [12] and fed into a statistic model [13] , or a machine learning model [14, 15]. Wang et al. [15] proposed to use classiﬁcation and regression tree (CART) method as a black-box model for performance prediction as it is easy to ﬁt for diﬀerent workloads, has good interpretability, and provides good approximations to highly nonlinear mappings. However, the CART model has its critical drawback – it is not stable, in the sense that a small change of the data set can lead to a drastic change of the result (more details will be described in section 2.1). This paper addresses the issue by using ensemble algorithms to improve and enhance the accuracy and stability of the basic CART model. In this paper, we propose the selective bagging CART (SBCART) model, in which we modify the bagging algorithm for regression ensemble. In the model built on top of the CART model, we ﬁrst train N models by bagging and then select n representative models from N models, where n < N . Compared with the CART model, SBCART can provide more precise and more stable performance predictions for modeled storage devices. In addition, an important measure missing in the feature vector designed by Wang [15] is about caching eﬀect, which

Storage Device Performance Prediction with SBCART

123

Fig. 1. (a) Regression tree constructed with the Financial I/O trace containing 60 instances; and (b) regression tree of 56 instances, which is produced by removing four randomly selected instances from the tree in (a). The shaded nodes represent leaf nodes containing the predicted values. The Nmin is set to 25. More details on the Financial trace can be found in Section 3.

makes a substantial diﬀerence on prediction accuracy. We include the measure in the vector to make good predictions with the SBCART model. The remainder of this paper is organized as follows. Section 2 describes the SBCART model. Section 3 presents our experimental results, and Section 4 concludes the paper.

2 2.1

The SBCART Models The CART Model

CART [16] is a nonparametric model which uses historical data to construct so-called decision trees. Trees are built top-down recursively beginning with a root node. At each step in the recursion, the CART algorithm determines which predictor variable and its value in the training data best split the current node into child nodes. The best split should minimize the diﬀerence among the instances in the child nodes. In other words, a good split produces child nodes with instances that contain similar values. Trees are grown to be excessively large with hundreds of levels if there is no pruning step. Furthermore, a tree of too large size may cause over-ﬁtting, in the sense that it may perform poorly in predicting independent data. There are two pruning algorithms: optimization by minimum number and cross-validation. In the ﬁrst pruning algorithm, splitting is stopped when the number of instances in the node is fewer than predeﬁned required minimum Nmin . This approach is eﬃcient, easy to apply and can produce consistent results. However, it requires the calibration of new parameter Nmin . In the second pruning algorithm, the procedure of cross validation is mainly based on the optimal proportion between

124

L. Zhang et al.

the complexity of the tree and the misprediction error. With the increase of tree size, the misprediction error is decreasing and reaches 0 when the tree grows into maximum tree. Unfortunately, it usually generates complex decision trees that perform poorly on independent data. Therefore, a critical operation is to ﬁnd the optimal proportion between the tree complexity and misclassiﬁcation error. Cross-validation does not require adjustment of any parameters. However, it is expensive to apply this pruning algorithm. Once the tree is built, an instance can travel the pruned tree to make a prediction. At each tree node, either left branch or right branch is taken according to the outcome of comparison of the instance with the split variable and its value of the node. Finally, the instance reaches the leaf node whose value will be the predicted value. However, CART may produce unstable decision trees. Insigniﬁcant modiﬁcation of learning instances, such as elimination of a few instances or changing split variables and values, could lead to radical changes in decision trees. As Figure 1 shows, the decision tree constructed with 60 instances in Figure 1(a) is very diﬀerent from the tree in Figure 1(b), where only four randomly selected instances are removed. 2.2

Ensemble Learning

The goal of the ensemble learning method is to construct a collection (an ensemble) of individual models to improve the accuracy and performance of a single model. Many researchers have demonstrated signiﬁcant performance improvements through ensemble methods [17, 18]. Figure 2(a) shows the basic steps involved in the training of a series of models with training data and in the using of voting strategy to predict new data samples. Two of popular techniques for constructing ensembles are bagging [19] and the adaboost family of algorithms [20]. Both methods invoke a base learning algorithm many times with diﬀerent training sets. In bagging, a training set is derived by forming a bootstrap replica of the original training set, and each training record has the same weight. Compared with the bagging, the adaboost algorithm maintains a set of weights over the original training set and adjusts these weights after each model is trained with a base learning algorithm. The adjustments increase the weights of examples which are poorly predicted by the base learning algorithm and decrease the weights of examples which are well predicted. Bagging generates diverse classiﬁers or models only if the base learning algorithm is unstable, that is, small changes to the training set lead to signiﬁcant changes of the learned classiﬁer or model. Bagging can be viewed as an approach of improving prediction accuracy by exploiting the instability, because the composite model can eﬃciently reduce the variation of individual models. Adaboost requires less instability than bagging, because it can make much larger changes in the training set. As CART is instable, we propose to use the ensemble of bagging to improve the prediction accuracy of individual CART models.

Storage Device Performance Prediction with SBCART

125

Fig. 2. (a) Using ensemble learning to improve the accuracy of individual basic CART models; and (b) using selective ensemble learning to improve the accuracy of a basic model. In (a), bagging or adaboost can construct a series of models M1 , M2 , ..., Mk and then predict the unknown samples by using voting strategy; In (b), we ﬁrst construct k models by bagging or adaboost, then select some representative models from k models, and use voting strategy to predict new samples.

2.3

The SBCART Method

Bagging is one of the widely used ensemble learning algorithms. Each training set is constructed by forming a bootstrap replica of the original training set. Thus, some samples in the original training set may appear many times in bootstrap data set while other samples may not appear. Prior research indicates that bagging can substantially improve the eﬀectiveness of the unstable basic learning models [19]. However, as the number of ensemble models increases, the space and time cost will increase linearly. Many methods have been proposed to address the issue by using diﬀerent classiﬁcation methods [21–23]. Zhou et al. proposed a method to select a portion of a whole model tree and use genetic algorithms to prune the scale [21]. Bakker et al. proposed to cluster all models and then select representative models in each class to prune whole models [22]. Martinez-munoz et al. proposed a method to prune trees in the ordered bagging ensembles [23]. However, those pruning methods are pretty complicated. In contrast, our SBCART algorithm is proposed to solve the problem on regression and the pruning method is simple. We adopt CART as the basic model. First, we create k models by bagging. Second, we sort the k models by median relative error on the training set. Finally, we select the ﬁrst 20%-50% of whole models to prune the scale. Figure 2(b) shows the basic steps involved in the training of a series of models on the training data, the selection of a part of whole models, and the using of voting strategy to predict new data samples based on the pruned models. Compared with ﬁgure 2(a), ﬁgure 2(b) adds a selective (pruning) function to prune the k models. The SBCART algorithm is described as follows using pseudo code.

126

L. Zhang et al.

Input: D: the dataset containing d samples; M: CART(the basic model); k: the number of models; s: the number of pruned models Output: Pruned models M ∗ ; Training Phase: (1)for i=1 to k do //Bagging (2) Sampling with replacement, yield Di (remove duplicated instances); (3) Create the model Mi based on Di ; d |M (X )−y | (4) Compute error(Mi ) (error(Mi ) = d1 j=1 i yjj j ); //median relative error (5)end for Pruning Phase: (1)Order k models by error(Mi ) in ascending order; (2)Get the first 20%-50% of the ordered models Prediction Phase: Using the pruned models to predict testing data X (1)for i=1 to s do i )) (2) Wi = log( (1−error(M + 1); //assign weight for each model error(Mi ) (3) Vi = Mi (X); //predicted values (4)end for (5)Normalize n the Wi ; (6)return i=1 Wi ∗ Vi In the algorithm, we remove duplicated instances in Di while sampling with replacement in training phase, because the duplicated instances in the training set can have a negative eﬀect on choosing the best split variable and lead to signiﬁcant changes of the structure of the tree. In our experiments, the size of the data set is reduced by half (5000 instances to 2500 instances) and the time of tree construction is reduced from 7 seconds to 1.25 seconds after deleting duplicated instances in Di . In the algorithm, the weight function in the prediction i )) + 1), this can guarantee the weight to be posiphase is set to log( (1−error(M error(Mi ) tive. Furthermore, in order to get enough representative models, we choose small pruning proportion (20% for example) if the k is large, and choose large proportion (50% for example) if the k is small. Compared with unpruned bagging ensembles of CART model, the SBCART has big advantages in both space cost and computation time because the scale of models is reduced to about 20%-50% of the whole model.

Storage Device Performance Prediction with SBCART

127

Fig. 3. (a) Training a SBCART model based on observed response times; and (b) using the model to predict response times

2.4

Predicting Performance with SBCART

Our goal is to build a model for a given storage device to predict device performance as a function of I/O workload. We use the UMass traces [24] which deﬁne a workload as a sequence of I/O requests. Each request Ri is characterized with ﬁve attributes: application speciﬁc unit (ASU), logical block address (LBA), size (SIZE), opcode (OPCODE), and timestamp (TIMESTAMP). The ASU is a positive integer representing the application speciﬁc unit; The LBA ﬁeld is a positive integer that describes the ASU block oﬀset of the requested data; The SIZE ﬁeld is a positive integer that describes the number of requested bytes, where the size of a block is contained in the description of the trace ﬁle; The OPCODE ﬁeld is a single, case insensitive character that deﬁnes the direction of the transfer, R or r indicates a read operation, W or w indicates a write operation; The TIMESTAMP ﬁeld is a positive real number representing the oﬀset in seconds for this I/O request from the beginning of the trace. Our approach uses SBCART to approximate the function. We assume that the model construction algorithm can take any workloads on a device for model training. Figure 3 shows the basic steps involved in the training of a model based on the observed response times and using the model to predict system response, which is per-request response time in this study. Model construction does not require any information about the internals of the modeled device. Therefore, the methodology is generally enough to model any device. We compared our SBCART model with CART model in Table 1. Various aspects are listed, including prediction error, stability, interpretability, robustness to outliers, ability to handle irrelevant input, model construction time, and prediction time. We list these aspects in the order of their importance to the storage performance prediction. Good stability indicates that a small change of the training data set cannot lead to signiﬁcant change of the prediction results. Interpretability describes a model’s ability to infer the importance of input

128

L. Zhang et al.

Table 1. Comparison between SBCART and CART models when they are used to predict per-request response time

variables. Robustness describes a model’s ability to respond to noisy data sets. Irrelevant input refers to features that have little predictive value. We only compare SBCART with CART, and the comparison of other regression methods can be found in [15]. The two models are constructed using the ﬁrst 5000 instances of Financial user4 trace and run on another 5000 instances of the same trace for testing (More details on the trace can be found in Section 3). The parameter k in SBCART is set to 20 and Nmin in CART is set to 10. As shown in Table 1, the prediction error (median relative error) of SBCART is lower and the stability is better compared to CART, as the composite models can reduce variance of individual models. The construction of SBCART model takes a longer time period and the space overhead of this model is higher, because SBCART needs to build k diﬀerent models. However, the higher costs are well affordable in the systems for storage device performance prediction. Furthermore, the model construction time for SBCART can be reduced with parallel execution as each bootstrap modeling is independent. Overall, the SBCART method proposed for storage device performance prediction is more stable and more precise than CART.

3 3.1

Experiments Request Feature Vector

Our request Feature Vector (FV) for Ri contains the following variables: Request Vector Ri = [T imeDif fi(1), ..., T imeDif fi(k), LBNi , LBN Dif fi (1), ..., LBN Dif fi (m), Sizei , RWi , Seq(i), Hit(i)] where T imeDif fi(l) = T imeStampiT imeStampi−l (l = 1, 2, ..., k), LBN Dif fi (k)=LBNi -LBNi−k (k = 1, 2, ..., m); The ﬁrst k variables measure the temporal burstiness of the workload when Ri arrives. The next m + 1 variables measure the spatial locality in terms of the distance of two continuous requests. Seq(i) indicates whether the request is a sequential access; Sizei and RWi is related to the data transfer time. Hit(i) indicates whether a request is hit in the cache.

Storage Device Performance Prediction with SBCART

3.2

129

Devices and Traces

We use DiskSim [3] to simulate a disk (Seagate ST32171W) of 7200RPM. We replay all the traces on the device to obtain the training data set. We use the UMass traces [24] consisting of Financial traces and WebSearch traces. The Financial traces are from OLTP applications at two large ﬁnancial institutions (relatively more sequential) and the WebSearch traces are from a popular search engine (relatively more random). There are several ﬁelds in the record for a request in UMass trace ﬁle. The ﬁrst ﬁeld is the ASU, which is related to application. In our experiments, we assume that one user runs one application on the server. Therefore, ASU number can be considered as a user ID. We randomly chose two ASU numbers and ﬁltered out all the requests for each of these ASUs, respectively. Accordingly, we obtained WebSearch-user1 and WebSearch-user2 traces from WebSearch1.spc, Financialuser2 and Financial-user4 traces from Financial1.spc. We built our models based on those traces. 3.3

Evaluation Methods

For evaluation, we use the trained device models to predict response time for ˆ a single request. We deﬁne the relative prediction error as |Y Y−Y | to show the accuracy of diﬀerent modeling algorithms. We also show the average, 90th, 80th, and 70th percentile relative errors of response time for diﬀerent data sets. Based on the above four users’ traces, we trained four models: M odelF in2 , M odelF in4 , M odelW eb1 , M odelW eb2 respectively. One hundred thousand requests are obtained for each user from the original trace and half of the requests are used for training while half of them are used for testing. In our experiments, k in T imeDif fi(k) is set to 3 and m in LBN Dif fi (m) is set to 5. The k in SBCART is set to 20, the pruning proportion is set to 50% and the Nmin of CART is set to 10. 3.4

Experiment Results

Figure 4 compares the median relative errors of the two models (SBCART and CART) in modeling the Seagate ST32171W disk on Financial (M odelF in2 , M odelF in4 ) and WebSearch (M odelweb1 , M odelweb2 ) traces, respectively. Overall, the SBCART-based device models provide better prediction accuracy in predicting the average, 90th, 80th and 70th percentile response times than CART. We can make several observations from the experiment results. First, feature vector must be designed to include all relevant measures. An important measure missing in the feature vector designed by Wang et al. [15] is about caching eﬀect, which makes a substantial diﬀerence on prediction accuracy. As hitting in the buﬀer cache is basically determined by temporal locality of accessed blocks [25], we propose to maintain an approximate LRU stack to eﬃciently track recency of requested blocks and use it as a measure in the vector. As shown in Figure 4(a), CART-cache can reduce the error from 25.12%, 25.04%,

130

L. Zhang et al.

Fig. 4. Comparison of SBCART and CART on four traces: M odelF in2 , M odelF in4 , M odelW eb1 , M odelW eb2 . CART-nocache shows that cache information (Hit) is not considered in feature vector and CART-cache shows that the cache information is used as a measure in feature vector.

91.64%, 23.18% to 15.01%, 14.15%, 15.99%, 13.48% on M odelW eb1 , M odelW eb2 , M odelF in2 , M odelF in4 , respectively. We can see that the median relative error is reduced by about 10% on M odelW eb1 , M odelW eb2 , M odelF in4 , and by about 75% on M odelF in2 . We also observed that the traces of M odelW eb1 , M odelW eb2 , and M odelF in4 are relatively more random and the trace of M odelF in2 is relatively more sequential, Therefore, sequential workloads like Financial-user2 are more sensitive to the caching eﬀect, and addition of the cache information can greatly reduce the prediction error. Second, SBCART can improve the accuracy and stability of CART. As shown in Figure 4(a), the SBCART-nocache can improve prediction accuracy by about 5% compared to CART-nocache, and the SBCART-cache can improve the accuracy by about 3% compared to CART-cache. We can see that by using the measure of Hit in the feature vector and the ensemble method, the prediction accuracy can be increased by about 13% for the relatively random workloads and about 70% for the relatively sequential workloads. As shown in Figure 5, the SBCRAT is more stable than CART when the training data set changes, because selective ensemble models can reduce the variance of individual models.

Storage Device Performance Prediction with SBCART

131

Fig. 5. Comparison of stability between CART and SBCART. X-axis shows the number of training records missing from the ﬁrst 5000 instances of WebSearch-user1 trace and its testing data from another 5000 instances from the same trace.

Finally, it is more diﬃcult to predict response times at high percentiles. As shown in Figures 4(b), (c) and (d), the median relative errors are reduced by about 5%, 8% and 11%, respectively, compared to Figure 4(a). We can observe that SBCART can consistently produce more precise predictions than CART. In summary, the SBCART model as well as the workload characterization (feature vector) used in the modeling can produce more accurate predictions and is more stable than the CART model.

4

Conclusions

Storage device performance modeling is an important component in self-managed storage systems, especially in high-end storage systems. Our SBCART model takes a workload as input and predicts its performance on the modeled device eﬃciently and accurately compared to the CART model. Based on bagging algorithms, we proposed a selective bagging classiﬁcation and regression tree (SBCART) model using the basic model CART model. Our experiment results show that the SBCART model as well as the workload characterization (feature vector) used in the modeling can produce more accurate predictions and is more stable than the CART model. Acknowledgements. The authors gratefully acknowledge the support of the Fundamental Research Funds for the Central Universities, the National Natural Science Foundation of China (No.60833004 and No. 60775037), the National High Technology Research and Development Program of China (863 Program,No.2009 AA01Z123), and Specialized Research Fund for the Doctoral Program of Higher Education (No.20093402110017). This research was also partially supported by US National Science Foundation under grant CAREER CCF 0845711.

132

L. Zhang et al.

References 1. Ruemmler, C., Wilkes, J.: An introduction to disk drive modeling. IEEE Computer 27(3), 17–18 (1994) 2. Worthington, B., Ganger, G., Patt, Y.: Scheduling algorithms for modern disk drives. In: Proc. of the ACM SIGMETRICS Conference, vol. 22, pp. 241–251. ACM, New York (1994) 3. The DiskSim Simulation Environment (v3.0), Parallel Data Lab, http://www.pdl.cmu.edu/DiskSim/ 4. Griﬃn, J.L., Schindler, J., Schlosser, S.W., Bucy, J.S., Ganger, G.R.: Timingaccurate storage emulation. In: FAST 2002 on File and Storage Technologies, pp. 75–88. USENIX Assoc., Monterey (2002) 5. Barve, R., Shriver, R., Gibbons, P.B., Hillyer, B.K., Matias, B.K., Vitter, J.S.: Modeling and optimizing i/o throughput of multiple disks on a bus. In: ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pp. 83–92. ACM, New York (1999) 6. Uysal, M., Alvarez, M., Merchant, A.: A modular, analytical throughput model for modern disk arrays. In: 9th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems Conference, MASCOTS, Cinncinnati, pp. 183–192 (2001) 7. Wilkes, J.: The Pantheon storage-system simulator. Technical Report HPL-SSP95-14, Storage Systems Program, Hewlett-Packard Laboratories (1996) 8. Aicheler, U.: A visual user interface for the pantheon storage system simulator. Technical Report HPLSSP961, Storage Systems Program, Hewlett-Packard Laboratories (1996) 9. Wilkes, J., Golding, R., Staelin, C., Sullivan, T.: The HP AutoRAID hierarchical storage system. ACM Transactions on Computer Systems 14(1), 108–136 (1996) 10. Cao, P., Lim, S.B., Venkataraman, S., Wilkes, J.: The TickerTAIP parallel RAID architecture. ACM Transactions on Computer Systems 12(3), 236–269 (1994) 11. Schindler, J., Ganger, G.R.: Automated disk drive characterization. CMU SCS Technical Report CMU-CS-99-176 (1999) 12. Andenson, E.: Simple table-based modeling of storage devices. Technical Report HPL-SSP-2001-04, HP Laboratories (2001) 13. Kelly, T., Cohen, I., Goldszmidt, M., Keeton, K.: Inducing models of black-box storage arrays. Technical Report HPL-SSP-2004-108, HP Laboratories (2004) 14. Mesnier, M.P., Wachs, M., Sambasivan, R.R., Zheng, A.X., Ganger, G.R.: Modeling the relative ﬁtness of storage. In: Joint International Conference on Measurement and Modeling of Computer Systems. ACM, New York (2007) 15. Wang, M., Au, K., Ailamaki, A., Brockwell, A., Faloutsos, C., Ganger, G.R.: Storage device performance prediction with cart models. In: 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS, USA (2004) 16. Breiman, L., Friedman, J., Stone, C.J., A., O.R.: Classiﬁcation and regression trees. Chapman and Hall CRC, Boca Raton (1984) 17. Kohavi, R., Kunz, C.: Option decision trees with majority votes. In: 14th International Conference on Machine Learning, Morgan Kaufman, San Francisco (1997) 18. Bauer, E., Kohavi, R.: An empirical comparison of voting classiﬁcation algorithms: Bagging, boosting, and variants. Machine Learning 36(1-2), 105–139 (1999) 19. Breiman, L.: Bagging predictors. Machine learning 24(1), 123–140 (1996)

Storage Device Performance Prediction with SBCART

133

20. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: 13th International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1996) 21. Zhou, Z.H., Tang, W.: Ensembling neural networks: Many could be better than all. Artiﬁcial Intelligence 137(1-2), 239–263 (2003) 22. Bakker, B., Heskes, T.: Clustering ensembles of neural networks. Neural Networks 16(2), 261–269 (2003) 23. Mart nez2mu noz, G., Su rez, A.: Pruning in ordered bagging ensembles. In: 23th International Conference on Machine Learning, pp. 1266–1273. IEEE, Piscataway (2006) 24. Umass trace repository, http://traces.cs.umass.edu/index.php/Storage/Storage 25. Jiang, S., Zhang, X.: LIRS: an eﬀcient low inter-reference recency set replacement to improve buﬀer cache performance. In: ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pp. 31–42. ACM, New York (2002)

Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs HyeongOk Lee1, Hyun Sim1, JungHyun Seo2, and Mihye Kim3,* 1

Department of Computer Education & Engineering, Sunchon National University, 413 Jungangno Suncheon Chonnam, South Korea {oklee,simhyun}@scnu.ac.kr 2 Research & Development Team, Jeonnam Technopark, 315 Maegokdong Suncheon Chonnam, South Korea [email protected] 3 Department of Computer Science Education, Catholic University of Daegu, 330 Hayangeup Gyeonsansi Gyeongbuk, South Korea [email protected]

Abstract. Bubble-sort, macro-star, and transposition graphs are interconnection networks with the advantages of star graphs in terms of improving the network cost of a hypercube. These graphs contain a star graph as their sub-graph, and have node symmetry, maximum fault tolerance, and recursive partition properties. This study proposes embedding methods for these graphs based on graph definitions, and shows that a bubble-sort graph Bn can be embedded in a transposition graph Tn with dilation 1 and expansion 1. In contrast, a macro-star graph MS(2, n) can be embedded in a transposition graph with dilation n, but with an average dilation of 2 or under. Keywords: Interconnection network, Embedding, Dilation.

1 Introduction Applications in engineering and scientific fields such as artificial intelligence, CAD/CAM (Computer-Aided Design and Computer-Aided Manufacturing), and fluid mechanics require hundreds of operations for data processing, which has led to increased interest in high-performance computers with a large number of processors. Consequently, interest in parallel processing, in which more than one processor simultaneously executes multiple tasks or a part of one program, has risen dramatically. However, it has proven challenging to design effective parallel algorithms. Parallel algorithms are usually designed for a particular parallel computer architecture, so to optimize algorithm design it is necessary to understand parallel computer architectures. There are a number of major architectures, and several methods exist for classifying them. One of the most well-known methods is Flynn’s taxonomy, which categorizes architectures into four groups based on the number of instruction and data streams available in the architecture. Of the four types, MIMD (Multiple Instruction, *

Corresponding author.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 134–143, 2010. © IFIP International Federation for Information Processing 2010

Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs

135

Multiple Data streams) is the most common and it can simultaneously execute multiple instruction and data streams within one system. MIMD parallel computers can be classified into two types based on memory architecture: multi-processor systems with shared memory and multi-computer systems with distributed memory. In a multi-computer system, each process has its own memory and is connected to other processors via an interconnection network. In a multi-computer system, each process has its own memory and is connected to other processors via an interconnection network. Inter-processor communication is achieved by sending messages among computers through the network, and a data-driven system is used for computations [1], [2]. The performance of the multicomputer system depends on the performance of each processor as well as the structure of the interconnection network and the applied algorithms. The characteristics of the interconnection network greatly influence overall performance and scalability of a multi-computer system. The most well-known topologies of interconnection networks are the mesh, hypercube, and star graph. The commonly used parameters for evaluating the performance of interconnection networks are degree, diameter, symmetry, scalability, fault tolerance, and embedding. The embedding of interconnection networks is intended to analyze the interrelationship between graphs to observe whether a certain graph G is included in or interrelated with another graph H. The evaluation of embedding is significant: if graph G can be efficiently embedded in graph H with less cost, then the method developed in the interconnection network with graph G can be used in the interconnection network with graph H at less cost [3], [4]. In previous works [5], [6], we analyzed embedding methods for some star variations such as star, matrix-star, Rotator-Faber-Moore, and pancake graphs. In this paper we analyzed embedding methods for bubble-sort, macro-star, and transposition graphs, which are well-known as variations of the star graph. This work extended earlier work to develop a method for embedding between bubble-sort and transposition graphs [7].

2 Related Work An interconnection network can be represented as an undirected graph G = (V, E), with each processor presented as a node (vertex) v of G, and the communication channel between those processors presented as an edge (v, w). V(G) and E(G) represent the set of nodes and edges of graph G, respectively. That is, V(G) = {0, 1, 2, …, n–1} and E(G) consists of pairs of distinct nodes from V(G). There exists an edge (v, w) between two nodes v and w of G if and only if a communication channel between v and w exists [8]. If we classify the interconnection networks proposed up to now, we can divide them into the mesh variation with n×k nodes [9], the hypercube variation with 2n nodes [2], [3], [10], the star graph variation with n! nodes [8], and the odd graph variation with combination 2nCn nodes [4]. A variation of the star graph represents nodes using n distinct symbols, and the number of nodes is approximately n! nodes. Star [8], [11], bubble-sort [12], pancake [11], transposition [13], macro-star [14], rotator [15], and Faber-Moore [16] graphs have been proposed as variations of the star graph. The graphs have a smaller node degree and diameter than a hypercube with a similar number of nodes.

136

H. Lee et al.

A macro-star graph MS [14] is an interconnection network that improves the network cost of a star graph by generalizing the star graph. The size and degree of MS are determined by parameters l and n. A macro-star graph MS(l, n) has (nl+1)! nodes, (n+l-1) degree, and (nl+1)! * (n+l-1) edges. The address of each node is represented as a permutation of k (=nl+1) distinct symbols. In other words, a node corresponds to a permutation. An edge exists between nodes u and v in MS(l, n) if and only if the permutation of node v can be obtained from that of node u by applying each of, as defined below, two edge generators Tj and Si where 2 ≤ j ≤ n+1 and 2 ≤ i ≤ l. A macrostar graph MS(l, n) can be defined as shown in Eq. (1), where k distinct symbols = {1, 2, .., k}, and a permutation of , U = u1:k = u1u2...ui...uk, ui .

∈

∈ ∈

V(MS(l,n))={U=u1:k│ui,uj , ui≠uj, i≠j, 1≤i,j≤k}, E(MS(l,n))={(U,V)│U,V V(MS(l, n)) satisfying U=Tj(V) or U=Si(V), 2≤j≤n+1, 2≤i≤l}. (1) Two edge generators Tj and Sn,i are defined in the macro-star graph MS(l, n) to formulate a link (edge) from a node to another node. The edge generator Tj is defined to create a permutation by interchanging the first symbol u1 with the jth symbol (uj) of a given node. With a given node U = u1:k = u1u2...ui...uk, the permutation of the node generated by Tj will be Tj(U) = uju2:j-1u1uj+1:k. Another edge generator Sn,i is defined to create a permutation by interchanging the sequence of symbols u(i-1)n+2:in+1 with the sequence of symbols u2:n+1 in a given node. With a given node U = u1:k, the permutation of the node generated by edge Sn,i will be Sn,i(u1:k) = u1u(i-1)n+2:in+1un+2:(i1)n+1u2:n+1uin+2:k. The edge generator Sn,i is represented simply as Si. Here, the symbol sequence u(i-1)n+2:in+1 is referred to as a cluster [14]. Fig. 1a shows a top view of an MS(2, 2) graph, while Fig. 1b presents the details of the level 2 cluster ‘23’. Each circle corresponds to a cluster. The smaller circles in the internal domain of the inclusive circle in Fig. 1b are nodes whose second clusters consist of ‘23’; that is, the full permutation of node 145 is 14523, that of 541 is 54123, and so on. The permutation generated by the edge generator T2 is 41523 in node 14523; that is, T2(14523) = 41523 and T3(14523) = 54123. When the edge generators Tj and Si are sequentially applied to the permutation of a certain node U, they are represented as Si(Tj(U)), and simply SiTj(U). For example, the sequence of generators S2(T2(14523)) will create the permutation 42315 in 14523. First, T2 generates 41523 and then S2 provides 42315 [5]. a)

12

21 31 41

13

14

15

23

24

25

32 42

34 43

541

35

52

53

54

23

451

45

T2

145

41

51 51

45

b)

15 415

S2 514

154

54

Fig. 1. Example of a macro-star graph MS(2, 2)

14

Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs

137

An n-dimensional bubble-sort graph Bn [12] consists of n! nodes and n(n–1)!/2 edges. The address of each node is represented as a permutation of n symbols {1, 2, 3, ..., n}. An edge exists between two arbitrary nodes v and w if and only if the corresponding permutation of the node w can be obtained from that of v by interchanging two adjacent symbols v and w in the permutation. The bubble-sort graph Bn can thus be defined as shown in Eq. (2), where n distinct symbol sets = {1, 2, .., n}, and a permutation of , B = b1b2...bn, bi ∈ . As the number of i- dimensional edges adjacent to B is equal to n–1, the bubble-sort graph Bn is a regular graph of degree n–1 and has a diameter of n(n–1)/2. It is also a hierarchical interconnection network because it can partition the graph with the edge as the center. It is node- and edge-symmetric as well as bipartite and includes Hamiltonian cycles.

∈

V(Bn) = {(b1b2...bn)│bi , i≠j, bi≠bj} E(Bn) = {(b1b2...bibi+1...bn)(b1b2...bi+1bi...bn)│(b1b2...bi...bn)

∈V(B ), 1≤i≤n–1}. n

(2)

An n-dimensional transposition graph Tn [13] consists of n! nodes and n(n-1)n!/4 edges. The address of each node is represented as a permutation of n distinct symbols, and an edge exists between two nodes v and w if and only if the corresponding permutation of the node w can be obtained from that of v by interchanging the positions of any two arbitrary symbols from {1, 2, .., n} in v. A transposition graph Tn can be defined by Eq. (3) with n distinct symbols = {1, 2, .., n}, and a permutation of , P = p1p2...pn, pi . The transposition graph Tn is a regular node symmetric graph with n(n-1)/2 degree, because an edge exists between the permutation that consists of n symbols and that in which two arbitrary different symbols are interchanged. It has maximum fault tolerance with a diameter of n–1 and a fault diameter of n. It also includes Hamiltonian cycles.

∈

V(Tn)={(p1p2...pi...pn)│pi , i≠j, pi≠pj} E(Tn)={(p1p2...pi...pj...pn)(p1p2...pj...pi...pn)│(p1p2...pi...pj...pn)V(Tn), 1≤i,j≤n, i≠j}. (3)

3 Embedding Analysis The embedding of one graph G into another graph H is a mapping mechanism for examining whether graph G is included in the structure of graph H, and how they are interrelated. This can be interpreted as simulating one interconnection topology using another. The embedding of graph G into a graph H is defined as a function f = (ø, ρ) where ø maps the set of vertices in G, V(G) one-to-one into the set of vertices in H, V(H), and ρ corresponds to each edge (v, w) in G to a path in H that connects nodes ø(v) and ø(w). Parameters for evaluating the efficiency of an embedding method include dilation, congestion, and expansion. The dilation of edge e in G is the length of the path ρ(e) in H, and the dilation of embedding f is the maximum value of all dilations in G. The congestion of edge e' in H is the number of ρ(e) included in e', and the congestion of embedding f is the maximum number of all edge congestions in H. The expansion of embedding f is the ratio of the number of vertices in H to the number in G [6].

138

H. Lee et al. 1

Embedding edge e (3, 6) into G2

2

with dilation 2

3 e

4

5

6

Graph G1

7

1

2

5

6

3

4

8

7

congestion 2

Graph G2

Fig. 2. Mapping example of G1 into G2

For instance, let each node (vertex) in the set V(G1) be mapped onto the corresponding same node number in the set V(G2) in Fig. 4. Then, edge e (3, 6) in G1 can be mapped onto edges (3, 1) and (1, 6) or edges (3, 7) and (7, 6) in G2 (i.e., to the path from 3 to 6 in G2). Let us assume that edge e (3, 6) in G1 is mapped onto edges (3, 1) and (1, 6) in G2. Here, the dilation of this embedding is 2 because the length of the path ρ(e) in G2 is 2. As well, we can see that the congestion is 2 because edge e (1, 3) in G2 is routed by two edges (1, 3) and (3, 6) in G1, and the expansion is 8/7 [6]. When we assume that R(=r1r2...ri...rj...rn) is a node in an arbitrary graph and V is adjacent from R via dimensional edge J in the graph, it is represented as V = J(R). And when we assume that node V is reached from node R by applying dimensional edges J and K in sequence, we denote the edge sequence as <J, K> and V = K(J(R)). Sequentially applying edge sequence <J, K> to node R means that at the first time unit, the permutation of node J(R) is generated from node R via dimensional edge J, and at the second time unit, the permutation of K(J(R)) is created from node J(R) via dimensional edge K (i.e., V = K(J(R))). The basic principles of embedding applied in this study are as follows. Node mapping to bubble-sort, transposition, and macro-star graphs is based on one-to-one mapping with identical node numbers. When mapping two adjacent nodes (U, V) of a source graph to a target graph, the dimensional edge sequence is defined using the edge definition of the target graph. This sequence is formulated with dimensional edges of the target graph used for the shortest path from ø(U) to ø(V). The dilation of embedding is represented as the number of dimensional edges required for the shortest path. Theorem 1. A bubble-sort graph Bn can be embedded into a transposition graph Tn with dilation 1 and expansion 1. Proof. In the bubble-sort graph Bn, if the permutation of a node B is b1b2b3...bi1bibi+1...bn, then the permutation of node B' adjacent from node B is b1b2...bi+1bibi-1...bn. The edge that connects nodes B and B' is called the i-dimensional edge (2 ≤ i ≤ n). There exist (n-1) edges of dimension i in Bn. When mapping nodes B and B'(=b1b2...bi+1bibi-1...bn) in Bn onto nodes T(=t1t2...titi+1...tn) and T'(=t1t2...ti+1ti...tn), respectively, in Tn, we analyze the dilation of this mapping by referring to the length of the shortest path routing from node T to node T' in Tn. In the transposition graph Tn, there exists an edge between nodes V and W if the corresponding permutation to the node W can be obtained from that of V by interchanging the positions of any two arbitrary symbols from {1, 2, .., n} in V. Here, we can see that the nodes T(t1t2...titi+1...tn) and T'(t1t2...ti+1ti...tn) in Tn are adjacent to each other by the edge

Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs

139

definition of transposition graph Tn. Therefore, a bubble-sort graph Bn can be embedded into a transposition graph Tn with dilation 1 and expansion 1. Corollary 2. A bubble-sort graph Bn is a sub-graph of a transposition graph Tn. Theorem 3. The dilation cost of embedding a transposition graph Tn into a bubblesort graph Bn is O(n). Proof. The transposition graph Tn and bubble-sort graph Bn are both node-symmetric. When mapping the transposition graph Tn and the bubble-sort graph Bn, node T(=t1t2t3...ti...tj...tn) in Tn maps onto node B(=b1b2b3...bi...bj...bn) in Bn, and each node T' of n(n–1)/2 nodes, which are adjacent to node T via edge T(i, j), maps onto B', whose address is the same as that of each corresponding node T'(i < j). We prove Theorem 3 using the mapping case of two nodes T(=t1t2t3...ti...tj...tn) and T'(=tnt2t3...ti...tj...t1), adjacent to each other via edge T(1, n), onto a bubble-sort graph. The node T', which is adjacent to node T via edge T(1, n), is the permutation tnt2t3...ti...tj...t1 in which the first symbol and the nth symbol are interchanged with each other from the permutation of node T. Let us map nodes T(=t1t2t3...ti...tj...tn) and T'(=tnt2t3...ti...tj...t1) in Tn onto nodes B(=b1b2b3...bi...bj...bn) and B'(=bnb2b3...bi...bj...b1) in Bn. Here, nodes T and T' are adjacent to each other, but nodes B and B' in Bn are not. Thus, we analyze dilation based on the length of the shortest path routing from B and B' in Bn. In the bubble-sort graph Bn, the dimensional edge sequence required for the shortest path routing from node B to node B' is <1,2,3,...,n-1,n-2,n-3,...,3,2,1>. First, the first symbol b1 of node B can be moved to the last position (nth position) using the dimensional edge sequence <1,2,3,...,n-1>, since only an edge exists between two nodes in which two adjacent symbols are interchanged with each other. That is, by sequentially applying the dimensional edge sequence <1,2,3,...,n-1> to node B(=b1b2b3...bi...bj...bn), we can obtain the permutation b2b3...bi...bj...bnb1. Next, the permutation bnb2b3...bi...bj...bn-1b1 (i.e., the permutation of destination node B') is obtained by orderly applying the edge sequence to the node of the permutation b2b3...bi...bj...bnb1. Here, the number of dimensional edges applied for routing from B to B' in Bn is equal to 2n-3. Therefore, we can say that the dilation cost for this embedding process is O(n). Theorem 4. A macro-star graph MS(2, n) can be embedded into a transposition graph T2n+1 with dilation n. Proof. We prove Theorem 4 by dividing it into two edges, Ti and Sj, which connect two arbitrary nodes in the macro-star graph MS(2, n). Case 1. Edge Ti, 2 ≤ i ≤ n+1 In the macro-star graph MS(2, n), the node adjacent to node U(=u1u2u3...ui...un+i...u2n+1) via edge Ti is Ti(U)(=uiu2u3...u1......un+i...u2n+1) and it is denoted as U'. When we map nodes U(=u1u2u3...ui......un+i...u2n+1) and U'(=uiu2u3...u1......un+i...u2n+1) in MS(2, n) onto nodes T(=t1t2t3...ti...tn+i...t2n+1) and T'(=tit2t3...t1...tn+i...t2n+1) in T2n+1, we can see that the nodes T and T' in T2n+1 are adjacent to each other through edge T(1, i) according to the edge definition of the transposition graph. Hence, a macro-star graph MS(2, n) can be embedded into a transposition graph T2n+1 with dilation 1.

140

H. Lee et al.

Case 2. Edge Sj, j = 2 Edge Sj, which is incident on node U(=u1u2u3...ui......un+i...u2n+1), can exist only where j = 2, because the macro-star graph MS(2, n) consists of two modules. In MS(2, n), the node adjacent from U(u1u2u3...ui......un+i...u2n+1) via edge S2 is S2(U)(=u1un+1un+2un+3...un+i...u2n+1u2u3u4...ui...un+1), and it is denoted as U'. When mapping nodes U and U'(=u1un+1un+2un+3...un+i...u2n+1u2u3u4...ui...un+1) in MS(2, n) onto nodes T(=t1t2t3...ti...tn+i...t2n+1) and T'(=t1tn+1tn+2tn+3...tn+i...t2n+1t2t3t4...ti...tn+1) in T2n+1, it can be seen that the nodes T and T' in T2n+1 are not adjacent to each other by the edge definition of the transposition graph. Thus, we analyze the dilation of this mapping using the number of edges used for the shortest path routing from node T to node T' in T2n+1. The dimensional edge sequence required for routing from node T(t1t2t3...ti...tn+i...t2n+1) to node T'(tit2t3...t1...tn+i...t2n+1) in T2n+1 is . In other words, edge T(i, n+i), which interchanges the symbols ti and tn+i, which are in the same position of the first and second modules in the permutation of node T(t1t2t3...ti...tn+i...t2n+1), is used n times, so the permutation identical to node T'(tit2t3...t1...tn+i...t2n+1) is generated. Therefore, its dilation is n. When we map a macro-star graph MS(2, n) onto a transposition graph T2n+1, the worst dilation of this embedding is n, but most edges are mapped by dilation 1. Accordingly, it may be advisable to prove that the average dilation reaches to the smallest constant. Theorem 5. A transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 5. Proof. In this embedding, we map node T(t1t2t3...ti...tj...t2n+1) in the transposition graph T2n+1 onto node U(u1u2u3...ui...uj...u2n+1) in the macro-star graph MS(2, n), and node T' onto node U', which has the same permutation with T' among the nodes in MS(2,n). The permutation of the node adjacent to node T(t1t2t3...ti...tj...t2n+1) via edge T(i, j) in T2n+1 is T'=t1t2t3...tj...ti...t2n+1. Here, nodes U(u1u2u3...ui...un+i...u2n+1) and U'(u1u2u3...uj...ui...u2n+1) in graph MS(2, n) are not adjacent to each other, thus we analyze dilation using the number of edges used for the shortest path routing from node U to node U' in MS(2, n). We prove Theorem 6 by dividing it into three cases depending on the values of i and j in T(i, j). Case 1. i, j ≤ n–1, i < j In the edge T(i, j), which connects nodes T(t1t2t3...ti...tj...t2n+1) and T' in the transposition graph T2n+1, values of i and j smaller than (n+2) mean that the two symbols can be interchanged only from the first symbol to (n+1)th symbols. The occurrence of the interchange only from the first symbol to the (n+1)th symbol in node U(u1u2u3... ui... uj... u2n+1) of MS(2,n), in which node T(t1t2t3... ti... tj... t2n+1) of T2n+1 is mapped, means that a symbol interchange occurs among the symbols that consist of the first cluster of node U. Therefore, the edge sequence required for the shortest path routing from node U to node U' is <Ti, Tj, Ti>, because the permutation of node U' is u1u2u3...uj...ui...un+2un+3...u2n+1, and nodes U and U' are not adjacent to each other. The routing process from U to U' using this edge sequence <Ti, Tj, Ti> is as follows. First, node Ti(U)(=uiu2u3...u1...uj...un+2un+3...u2n+1) is reached from node U(u1u2u3...ui...uj...un+2un+3...u2n+1) via edge Ti, which interchanges the first symbol with the ith symbol ui in node U; that is, the node Ti(U) is adjacent to node U through edge Ti. We then get to node TjTi(U)(=uju2u3...u1...ui...un+2un+3...u2n+1) from node Ti(U)

Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs

141

via edge Tj, which interchanges the ith and jth symbols in node Ti(U). Next, we reach node TiTjTi(U)(=u1u2u3...uj...ui...un+2un+3...u2n+1) from TjTi(U) via edge Ti, which interchanges symbols uj and u1 in node TjTi(U). Now, we can see that the permutation of node TiTjTi(U), which is obtained by sequentially applying the edge sequence <Ti, Tj, Ti> to node U, is the same as the permutation of node U'. Therefore, two nodes T(t1t2t3...ti...tj...t2n+1) and T' adjacent via edge T(i, j) in the transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 3. Case 2. i ≤ n+1, j ≥ n+2 In the edge T(i, j) which connects nodes T(t1t2t3...ti...tj...t2n+1) and T' in T2n+1, i ≤ n+1 and j ≥ n+2 mean that two symbols are interchanged with each other, and of these two, the one based on the (n+1)th position is positioned before it, and the other locates next to it. The occurrence of the interchange based on the (n+1)th position of a symbol in node U(u1u2u3...ui...uj...u2n+1) of MS(2,n), in which node T(t1t2t3...ti...tj...t2n+1) of T2n+1 is mapped, means that an interchange occurs between one symbol of the first cluster and one symbol of the second cluster in node U. Since the permutation of node U' is u1u2u3...ui...un+2un+3...uj...u2n+1, and nodes U and U' are not adjacent to each other, the edge sequence required for the shortest path routing from node U to node U' is <Ti, S2, Tj, S2, Ti>. The routing process from U(u1u2u3...ui...un+1un+2un+3...uj...u2n+1) to U'(u1u2u3...ui...un+1un+2un+3...uj...u2n+1) using this edge sequence <Ti, S2, Tj, S2, Ti> is as follows. First, node Ti(U)(=uiu2u3...u1...un+1un+2un+3...uj...u2n+1) is adjacent to node U(u1u2u3...ui...un+1un+2un+3...uj...u2n+1) via edge Ti, which interchanges the first with the ith symbol ui in node U; then node S2Ti(U)(=uiun+2un+3...uj...u2n+1u2u3...u1...un+1) is adjacent to node Ti(U) through edge S2, which swaps the first and the jth cluster uj in Ti(U). Next, node TjS2Ti(U)(=ujun+2un+3...ui...u2n+1u2u3...u1...un+1) is adjacent to node S2Ti(U) through edge Tj, which interchanges the first symbol ui with the symbol uj in the first cluster in S2Ti(U). After that, node S2TjS2Ti(U) (=uju2u3...u1...un+1un+2un+3...ui...u2n+1) is reached from TjS2Ti(U) via edge S2, which exchanges the first cluster in which the symbol ui exists with the second cluster in node TjS2Ti(U). Then, we get to node TiS2TjS2Ti(U) (=u1u2u3...uj...un+1un+2un+3...ui...u2n+1) from node S2TjS2Ti(U) via edge Ti, which interchanges the symbols uj and u1 in node S2TjS2Ti(U). Because the permutation of node TiS2TjS2Ti(U), which is obtained by sequentially applying the edge sequence <Ti, S2, Tj, S2, Ti> to node U, is identical to the permutation of node U', we can see that two nodes T and T' adjacent via edge T(i, j) in the transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 5. Case 3. i, j ≥ n+2 In edge T(i, j), which connects nodes T(t1t2t3...ti...tj...t2n+1) and T' of the transposition graph T2n+1, because the values i and j are larger than (n+1), the two symbols can be interchanged with each other only from the (n+2)th symbol to the (2n+1)th symbols. The interchanges from the (n+2)th symbol to the (2n+1)th symbols in node U(u1u2u3... ui... uj... u2n+1) of MS(2,n), in which node T(t1t2t3... ti... tj... t2n+1) of T2n+1 is mapped, represent the interchanges that occur among the symbols that consist of the second cluster of U. Because the permutation of node U' is u1u2u3...un+1un+2...uj...ui...u2n+1, and nodes U(u1u2u3...un+1un+2...ui...uj...u2n+1) and U' are not adjacent to each other, the edge sequence required for the shortest path routing from node U to node U' is <S2, Ti, Tj, Ti, S2>. The routing process from node U and node U' is as follows.

142

H. Lee et al.

We first use edge generator S2 to exchange the first and second clusters in which symbols ui and uj exist, because the interchange between the symbols in node U occurs only between the symbols positioned in the first column and the symbols positioned in the first cluster. Thus, node S2(U)(=u1un+2...ui...uj...u2n+1u2u3...un+1) is adjacent to node U(u1u2u3...un+1un+2...ui...uj...u2n+1) via edge S2, and node TiS2(U)(=uiun+2...u1...uj...u2n+1u2u3...un+1) is adjacent to node S2(U) via edge Ti, which places symbol ui of S2(U) at the first position. Following this, node TjTiS2(U)(=ujun+2...u1...ui...u2n+1u2u3...un+1) is reached from node TiS2(U) via edge Tj, which interchanges the first symbol ui with symbol uj in the first cluster in S2Ti(U). Next, node TiTjTiS2(U)(=u1un+2...uj...ui...u2n+1u2u3...un+1) is connected to TjTiS2(U) by edge Ti, which interchanges the first symbol uj with u1 in node TjTiS2(U). Then, we use edge S2 to swap the first and second clusters in node TiTjTiS2(U), and reach node S2TiTjTiS2(U)(=u1u2u3...un+1un+2...uj...ui...u2n+1) from node TiTjTiS2(U) through edge S2. Here, we can seen that the permutation of node S2TiTjTiS2(U), which is obtained by sequentially applying the edge sequence <S2, Ti, Tj, Ti, S2> to node U, is identical to the permutation of node U'. Therefore, the edge T(i, j) that connects nodes T(t1t2t3...ti...tj...t2n+1) and T' in the transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 5. Consequently, all nodes in a transposition graph T2n+1 can be mapped one-to-one onto a macro-star graph MS(2, n), and an edge in T2n+1 can be embedded into MS(2,n) with dilation of 5 or under.

4 Conclusion The star graph, a well-known topology of MIMD multi-computer systems with distributed memory, is a small diameter, node-symmetric, hierarchical, and maximum fault-tolerant interconnection network. The transposition graph not only improves the fault tolerance of multi-computer systems, but also shares the advantages of a star graph. It also contains a star graph as its sub-graph. The bubble-sort and macro-star graphs also share the advantages of the star graph. In this paper, we proposed methods for embedding bubble-sort, transposition, and macro-star graphs into one another, which have been introduced as variations of the star graph. These graphs have the same number of nodes and also the same number of symbols in a node. The proposed embedding methods are based on one-to-one mapping of two arbitrary nodes U and U' of a source graph G onto two nodes in a target graph G'. We assumed that two mapped nodes in G' are connected with a minimum of edges based on the edge definition of target graph G'. Then, we analyzed dilation by the number of edges used for the shortest path routing between two mapped nodes in G'. Embedding analysis using the edge definition of graphs is possible because the bubble-sort, transposition, and macro-star graphs are all node-symmetric. The results of this study indicate that bubble-sort graph Bn can be embedded into transposition Tn with dilation 1 and expansion 1. In addition, macro-star graph MS(2, n) can be embedded into transposition graph T2n+1 with dilation n, but with an average dilation of 2 or less. Acknowledgement. This research was supported by Basic Science research program through the National research Foundation of KOREA(NRF) funded by the Ministry of Education, Science and Technology(2009-0086676).

Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs

143

References 1. Feng Athas, W.C., Seitz, C.L.: Multicomputers: Message-Passing Concurrent Computers. IEEE Computer, 9–24 (August 1988) 2. Saad, Y., Schultz, M.H.: Topological Properties of Hypercubes. IEEE Trans. Comput. 37(7), 867–872 (1988) 3. Azevedo, M.M., Bagherzaeh, N., Latifi, S.: Low Expansion Packing and Embeddings of Hypercubes into Star Graphs: A Performance-Oriented Approach. IEEE Parallel and Distributed Systems 9(3), 261–274 (1998) 4. Ghafoor, A., Bashkow, T.R.: A Study of Odd Graphs as Fault-Tolerant Interconnection Networks. IEEE Trans. Computers 40(2), 225–232 (1991) 5. Kim, J.S., Seo, J.H., Sim, H., Lee, H.O.: Embedding Algorithm Between the Macro-star Graph and the Matrix-star Graph. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Application. PDPTA, pp. 418–426 (2006) 6. Kim, M.H., Kim, D.W., Lee, H.O.: Embedding Algorithms for Star, Bubble-Sort, RotatorFaber-Moore, and Pancake Graphs. In: Hsu, C.-H., Yang, L.T., Park, J.H., Yeo, S.-S. (eds.) ICA3PP 2010. LNCS, vol. 6082, pp. 348–357. Springer, Heidelberg (2010) 7. Seo, J.H., Lee, H.O., Jang, M.S.: Node Mapping Algorithm Between Macro-star Networks and Transposition Networks. In: Proceedings of the Korean Information Science Society Conference 2008, vol. 35(1B), pp. 584–587 (2008) 8. Akers, S.B., Krishnamurthy, B.: A Group-Theoretic Model for Symmertric Interconnection Network. IEEE Trans. Comput. 38(4), 555–565 (1989) 9. Yang, X., Tang, Y.Y., Cao, J.: Embedding Torus in Hexagonal Honeycomb Torus. IET Computers & Digital Techniques 2, 86–93 (2008) 10. Park, J.H., Lim, H.S., Kim, H.C.: Embedding Starlike Trees into Hypercube-Like Interconnection Networks. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 301–310. Springer, Heidelberg (2006) 11. Berthome, P., Ferreira, A., Perennes, S.: Optimal Information Dissemination in Star and Pancake Networks. IEEE Trans. on Parallel and Distributed Syst. 7(12), 1292–1300 (1996) 12. Chou, Z.T., Hsu, C.C., Sheu, J.P.: Bubblesort Star graphs: A New Interconnection Network. In: 9th International Parallel Processing Symposium, pp. 41–48 (1996) 13. Latifi, S., Srimani, P.K.: Transposition Networks as a Class of Fault-Tolerant Robust Networks. IEEE. Trans. Comput. 45(2), 230–238 (1996) 14. Yeh, C.H., Varvarigos, E.A.: Macro-Star Networks: Efficient Low-Degree Alternatives to Star Graphs. IEEE Trans. Parallel and Distributed Systems 9(10), 987–1003 (1998) 15. Corbett, P.F.: Rotator Graphs: An Efficient Topology for Point-to-Point Multiprocessor Networks. IEEE Trans. Parallel Distributed Syst. 3(5), 622–626 (1992) 16. Faber, V., Moore, J. W.: High-degree, Low-diameter Interconnection Networks with Vertex Symmetry: The Directed Case. LA-UR-88-1051, LosAlmos National Lab., Los Almos (1988)

An Eﬃcient Simulation Algorithm for Cache of Random Replacement Policy Shuchang Zhou Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China [email protected]

Abstract. Cache is employed to exploit the phenomena of locality in many modern computer systems. One way of evaluating the impact of cache is to run a simulator on traces collected from realistic work load. However, for an important category of cache, namely those of random replacement policy, each round of the na¨ıve simulation can only give one out of many possible results, therefore requiring many rounds of simulation to capture the cache behavior, like determining the hit probability of a particular cache reference. In this paper, we present an algorithm that eﬃciently approximates the hit probability in linear time with moderate space in a single round. Our algorithm is applicable to realistic processor cache parameters where the associativity is typically low, and extends to cache of large associativity. Experiments show that in one round, our algorithm collects information that would previously require up to dozens of rounds of simulation. Keywords: Simulation, Cache memories, Stochastic approximation.

1

Introduction

Modern computer systems depend heavily on the eﬃcacy of cache in exploiting locality for improving performance. There are hardware cache built into processors and disk drivers, as well as software cache used in operating system kernels, proxy and ﬁle servers. One way of evaluating cache’s impact is to run a cache simulator on traces collected from realistic work load. The simulator will mimic the conﬁguration, topology and replacement policy of the simulated cache. However, for an important category of cache, namely those of random replacement policy, as the replacement policy randomly determines one among multiple candidates for eviction, the na¨ıve simulation only constitutes a Monte Carlo simulation, therefore can only give one out of many possible results. In particular, in a single round, na¨ıve simulation cannot give the hit probability of each cache reference, which is of particular interest in program analysis[3][4]. If we maintain n copies of possible cache states and simulate access sequences on these states simultaneously, then the time and space requirement of the simulation will be equal to running n copies of na¨ıve simulation in parallel, with no C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 144–154, 2010. c IFIP International Federation for Information Processing 2010

An Eﬃcient Simulation Algorithm

145

gain in eﬃciency. In contrast, we present an algorithm that directly approximates the hit probability. Experiments show that in one round, our algorithm collects information that would previously require up to dozens of rounds of simulation. The rest of paper is organized as follows. Section 2 gives some background for our algorithm. Section 3 describes our algorithm. Section 4 evaluates our algorithm by simulating realistic work load. Section 5 overviews related work.

2

Background

Cache maps addresses to values. In a fully-associative1 cache under random replacement policy, when the cache receives a request for an address A, the cache will look up all its content to see if the cache line containing A is already in the cache. If the matching succeeds, the request causes a hit event, and the corresponding value in cache is returned to the request. Otherwise, the request causes a miss event and the cache will fall back to some backup mechanism to satisfy the request. In case of miss event, a random place from the cache is picked, regardless of whether it originally contains valid data or not.2 The original content at the place will be evicted to allow storing the address A and its corresponding value just retrieved through backup mechanism. Two nearest references to the same cache line forms a reuse window, with references to other cache lines in between. Cache of random replacement policy3 are found in Translation Look-aside Buﬀers[6], and processor cache in ARM [13] processor family. The Monte Carlo simulation of random replacement policy faithfully models the hardware by using (pseudo) random numbers to select the target for eviction. However, the selection is path dependent. For example, assume a fully-associative cache of size two and let < a, e > denote that cache contains a and e as content, with diﬀerent letters representing diﬀerent cache lines, and assume an access sequence of d, a. As d is not in cache, it causes a miss event, leaving the cache in either < a, d > or < d, e >. Now the next access a will exhibit diﬀerent behavior under the two possible cache states: for < a, d > it will be a hit event, and for < d, e > it will be a miss event. An observer that simulates sequence d, a multiple times will either see miss, miss or miss, hit, with equal likelihood. However, such a phenomenon can not be observed in a single round of simulation. To make the matter worse, the number of possible cache states keeps growing after each miss event, and in consequence requires more rounds of simulation to 1

2

3

Hereafter we will limit our attention to fully-associative cache. In a set-associative cache, as the mapping from the address to the cache set is ﬁxed, the selection of eviction targets in diﬀerent cache sets are statistically independent. Hence a setS associative cache of size S and associativity M is equivalent to M fully-associative cache operating in parallel, each of size M . Even if the replacement algorithm takes care to not evict valid data when there are free slots, as the cache is soon ﬁlled up with valid data, there will be no diﬀerence in practice. Sometimes referred to as pseudo random replacement policy due to diﬃculty, if not impossibility, of obtaining true randomness.

146

S. Zhou

capture the cache behavior. As mentioned above, maintaining n copies of possible cache states and simulating access sequences on these states simultaneously will not improve eﬃciency compared to running n copies of simulation in parallel, or running n rounds of simulation in sequence. In theory, we can represent the state of a cache under random replacement policy as a probability distribution among all possible states, and each possible address as a transition probability matrix. The simulation of cache is then reduced to multiplying the original probability distribution Z with a chain of matrices. However, the probability distribution has M size, and each transition i=0 i Z ) non-zero elements, where Z is the number of distinct matrix has O(M M i=0 i elements in trace, rendering this simple approach infeasible in general.4 Instead, we present an algorithm that implicitly calculates the probability distribution.

3

Algorithm

3.1

The Foundation

Assume the trace of cache lines of an access sequence with indices in logical time to be: a0 , a1 , .., aN . We represent the miss event of time i by an indicator random variable[1] Xi , such that Xi = 1 when a miss event happens at i and Xi = 0 otherwise, forming another sequence: X0 , X1 , .., XN . The indicator for hit event of time i is just 1 − Xi . The hit probability of ai is 1 − E(Xi ), where E(x) is x’s expectation. The expected count of hit event of a sequence S is just i∈S (1 − E(Xi )). Let the cache size be M . We can determine Xi in the following way. At time i, we inspect the subsequence {aj |j < i, aj = ai }. If the subsequence is empty, then ai will not be present in cache, hence Xi is 0. Otherwise, there is a reuse window and we assume the largest index of the subsequence, i.e. the start of the reuse window, is k. We let the number of misses since k be Zi . i−1 l=k+1 Xl , if there is a reuse window Zi = (1) ∞, otherwise. Due to linearity of expectation, we get i−1 l=k+1 E(Xl ), if there is a reuse window E(Zi ) = ∞, otherwise.

(2)

Proposition 1 E(Xi ) = 1 − E((1 − 4

n k

1 Zi ) ). M

is the number of k-element subsets of an n-element set.

(3)

An Eﬃcient Simulation Algorithm

147

Proof. We observe that ak is deﬁnitely in cache after time k. As every miss 1 between k and now will evict ak with probability M , the probability of ak still being present in cache, which is also the probability of observing a hit event, is 1 ) to the power of number of misses since (excluding) k. Then with a given (1 − M Zi , we have: 1 Zi E(Xi |Zi ) = 1 − (1 − ) . (4) M Taking the expectation over Zi gives equation 3. However, as Xi are related to the preceding elements in {Xi }, hence in general E(aXi +Xj ) = E(aXi )E(aXj ). Therefore we cannot calculate right hand side of equation 3 easily. However, we note an approximation: Proposition 2 When M is large, E(Xi ) ≈ 1 − (1 −

1 E(Zi ) ) . M

(5)

Proof. Let E(Xi ) = 1 − E((1 − When Zi = ∞, δ = 0 as both (1 − When Zi < ∞ δ =

1 E(Zi ) M)

1 E(Zi ) M) 1 E(Zi ) ln(1− M )

= 0 and E((1 −

1 Zi M) )

(6) = 0.

1 Zi M) ) 1 Zi ln(1− M )

= (1 −

− E((1 −

=e

− E(e

(7)

)

1 1 ))n ))n (E(Zi ) ln(1− M (Zi ln(1− M − E( ∞ ) n=0 n=0 n! n! n 1 ∞ )|) (| ln(1− M (E(Zin ) − E(Zi )n ). − n=0 (−1)n n!

∞

=

1 Zi 1 E(Zi ) ) ) = 1 − (1 − ) + δ. M M

(8) (9) (10)

∞ We note δ is the sum of an inﬁnite converging alternating series n=0 (−1)n an 0 0 where it can be observed that a0 = a1 = 0 as E(Zi ) = E(Zi ) and E(Zi1 ) = (| ln(1−

1

M E(Zi )1 . As δ is an alternating series with coeﬃcients n! rapidly , we may expect δ ≈ 0 and then obtain approximation 5.

)|)n

decreasing

Now we can can use equation 2 and 5 to approximate 1 − E(Xi ), the hit probability of each cache reference. However, we are unable to give a rigorous bound of δ and have to resort to empirical evaluation in section 4. 3.2

The Plain Algorithm

The key observation is that we can use a map m and an oﬀset b to eﬃciently calculate E(Zi ) from preceding E(Xi ). When we are at i, we want m[x] + b to equal to the number of misses in the reuse window, i.e. since last occurrence of x to time i. By deﬁnition, E(Zi ) = m[ai ] + b. If x is not seen before time i, we let m map x to ∞.

148

S. Zhou

We see m and b can be constructed and maintained in the following way. – Initialization: ∀x, m[x] ← ∞, b ← 0. – Maintenance: At time i, b ← b + E(Xi ), m[ai ] ← −b. We can use a hash table h to implement m, and let the absence of x from h to indicate m[x] = ∞. As maintenance of h can be done in O(1), we have obtained an algorithm that approximates {E(Xi )} with time complexity O(N ). 3.3

The -Approximation

The main problem with the algorithm presented above is that the hash table size will grow up to the number of distinct elements in trace. We observe from equation 5 that within an absolute error of the hit probability 1 − E(Xi ), we can stop the backward summation of E(Xi ) at l when (1 − That is when

i≥l

1 i≥l E(Xi ) ) ≤ . M

E(Xi ) ≥

log 1 . log(1 − M )

(11)

(12)

log Let K be log(1− 1 . Intuitively, we are ignoring reuses that are under probM) ability, which means we can prune h of an element x if h[x] ≥ K − b under -approximation of 1 − E(Xi ). In light of this observation, we can improve the space complexity using slidingwindow technique. We split h into two hash tables h and h , and use them in round-robin fashion. Both tables are used for query. Putting an address to one table will remove its entry from another table, if there should be one. We use a counter c to keep track of when to swap h and h , and record the number of swaps with d.

– Initialization: c ← 0, d ← 0, h and h points to two empty hash tables. – Maintenance: At time i, c ← c + E(Xi ), b ← b + E(Xi ), h [ai ] ← −b. If c > K then c ← c − K, clear h , swap h and h , and d ← d + 1; otherwise continue the loop. Proposition 3. The -approximation is correct and works within linear time log 1 and O( log(1− 1 ) ≈ O(M ln( )) space. ) M

Proof. Correctness: We observe that c = b − dK. When x is updated, it holds that h [x] + b = 0. As b is monotonically increasing, it holds that 0 ≤ h [x] + b = h [x] + c + dK. As c ≤ K, we have h [x] ≥ −c − dK ≥ −(d + 1)K. h contains

An Eﬃcient Simulation Algorithm

149

elements of step d− 1, therefore h [x] ≥ −dK. At the time of clearing h , c > K, all values x in h satisfy h [x] ≥ −dK = K − (dK + K) > K − (dK + c) = K − b

(13)

and can be safely pruned. Complexity: Let s be the sum of size of h and h , we next show that s ≤ 2K. At time i, if ai is found in h or h , then s will not increase, but b may be incremented; if ai is not found in either table, both s and b are incremented by 1. Thus increment of s is bounded by the increment of b when c goes from 0 to K. As h and h are cleared every two steps, s ≤ 2K. In this way we achieve log space complexity of O(K) = O( log(1− 1 ). As the O(K) cost of clearing tables M) is amortized among at least K cache references, the time complexity remains O(N ).

4

Empirical Evaluation

We implement two algorithms: the plain and the -approximation variation of our algorithm presented above, and a na¨ıve simulation algorithm that directly simulates current cache state. We perform our experiments on traces collected by HMTT, a platform independent full-system memory trace monitoring system[5]. The traces are collected from running LINPACK[10] and CPU2000[11] benchmark. Both our algorithm and the average of many rounds of na¨ıve simulation give an approximation of hit probability of each reference. In ﬁgures 1, 2, 3, 4, 5, we compare the inferred probability from average of 5, 50, and 500 rounds of the na¨ıve simulation, and the plain and -approximation variation of our algorithm. We use trace fragments of length 106 and use the average of 500 rounds of the na¨ıve simulation as reference, which is depicted in the ﬁgures as squaredotted lines with absolute value of hit probability as x-value and the distribution as y-value. All the other lines depict the distribution of absolute error of hit probability from the reference. A curve decreasing faster towards right indicates better approximation precision. It can be observed that both the plain and approximation variation of our algorithm provide approximation of precision comparable to what is obtained from average of 50 rounds of na¨ıve simulation, and consistently outperforms average of 5 rounds. Our algorithm is applicable to realistic processor cache parameters, for example when M = 2, and extends to cases of large associativity. As indicated by the proof of 5, larger M leads to better approximation precision, such that our algorithm outperforms average of 50 rounds when M ≥ 8. We also observe that choosing to be 0.01 or even 0.1 only mildly aﬀects the precision of approximation. Experiments with other M values and benchmarks show similar results and are not shown here for lack of space. M = 1 case is not considered as in this case eviction targets are picked deterministically.

150

S. Zhou

"" $

! "#

Fig. 1. Absolute error of various methods when M=2 for LINPACK

"" $

! "#

Fig. 2. Absolute error of various methods when M=8 for LINPACK

An Eﬃcient Simulation Algorithm

151

"" $

! "#

Fig. 3. Absolute error of various methods when M=64 for LINPACK

"" $

! "#

Fig. 4. Absolute error of various methods when M=4 for SWIM

152

S. Zhou

"" $

! "#

Fig. 5. Absolute error of various methods when M=16 for WUPWISE

5

Related Work

Cache simulation has long been an important tool in studying the impact of cache on program execution[7]. Much of literature is devoted to study of cache under Least Recently Used policy[8]. However, techniques that are developed for studying LRU cache, like the reuse distance technique, do not apply to cache of random replacement policy in general. In [12], random replacement cache is investigated under a simplifying assumption that miss rate should stay stable over a short period of execution. Under this assumption, [12] uses approximation 5 to calculate the gross cache hit probability. In contrast, we exploit approximation 5 without the assumption, and are able to give hit probability of each cache reference. [15] and [16] also make other simplifying assumptions to approximate the gross hit probability. Most of previous studies on cache of random replacement policy use average of a few rounds of na¨ıve simulation as an approximation of the hit ratio of each cache reference. For example, in [9], which studies the impact of replacement policy on instruction cache, average of three rounds of simulation is used.

6

Conclusion

For an important category of cache, namely those of random replacement policy, it would be necessary to perform many rounds of na¨ıve Monte Carlo simulation to approximate the hit probability of each cache reference. To improve the eﬃciency, we devise a simulation algorithm that can approximate the probability in a single round. We start with a plain algorithm and then reduce its

An Eﬃcient Simulation Algorithm

153

space complexity through -approximation. The -approximation variation of the algorithm works in linear time complexity, and has space complexity of log 1 O( log(1− 1 ) ≈ O(M ln( )), where M is the size of cache. Experiments show M) that in one round, our algorithm collects information that would previously require up to dozens of rounds of simulation.

Acknowledgments We thank Ma Chunhui for interesting discussions that lead to this paper. We thank Bao Yungang, the author of HMTT[5], for providing the trace ﬁles and insightful comments. We would also like to thank the anonymous reviewers for their precious remarks which help improve this paper.

References 1. Introduction to algorithms. MIT Press, Cambridge (2001) ¨ 2. Fang, C., Carr, S., Onder, S., Wang, Z.: Reuse-distance-based miss-rate prediction on a per instruction basis. In: Proceedings of the 2004 Workshop on Memory System Performance MSP 2004, Washington, D.C., June 8, pp. 60–68. ACM, New York (2004) 3. Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. In: Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, PLDI 2003, San Diego, California, USA, June 9-11, pp. 245–257. ACM, New York (2003) 4. Beyls, K., D’Hollander, E.H.: Reuse Distance-Based Cache Hint Selection. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 265–274. Springer, Heidelberg (2002) 5. Bao, Y., Chen, M., Ruan, Y., Liu, L., Fan, J., Yuan, Q., Song, B., Xu, J.: HMTT: a platform independent full-system memory trace monitoring system. In: Proceedings of the 2008 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2008, Annapolis, MD, USA, June 2-6, pp. 229–240. ACM, New York (2008) 6. Sweetman, D.: See MIPS Run, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006) ISBN 0-12088-421-6 7. Sugumar, R.A., Abraham, S.G.: Multi-conﬁguration simulation algorithms for the evaluation of computer architecture designs. Technical report, University of Michigan (1993) 8. Mattson, R.L., Gecsei, J., Slutz, D., Traiger, I.L.: Evaluation techniques for storage hierarchies. IBM System Journal 9(2), 78–117 (1970) 9. Smith, J.E., Goodman, J.R.: A study of instruction cache organizations and replacement policies. SIGARCH Comput. Archit. News 11(3), pp. 132–137 (1983) 10. http://www.netlib.org/linpack/ 11. http://www.spec.org 12. Berg, E., Hagersten, E.: Fast data-locality proﬁling of native execution. SIGMETRICS Perform. Eval. Rev. 33(1), 169–180 (2005) 13. ARM Cortex-R4 processor manual, http://www.arm.com

154

S. Zhou

14. Guo, F., Solihin, Y.: An analytical model for cache replacement policy performance. In: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2006/Performance 2006, Saint Malo, France, June 26-30, pp. 228–239. ACM, New York (2006) 15. Chandra, D., Guo, F., Kim, S., Solihin, Y.: Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture. In: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, HPCA, February 12-16, pp. 340–351. IEEE Computer Society, Washington (2005) 16. Suh, G.E., Devadas, S., Rudolph, L.: Analytical cache models with applications to cache partitioning. In: Proceedings of the 15th International Conference on Supercomputing, ICS 2001, Sorrento, Italy, pp. 1–12. ACM, New York (2001) 17. Agarwal, A., Hennessy, J., Horowitz, M.: An analytical cache model. ACM Trans. Comput. 7(2), 184–215 (1989)

DABGPM: A Double Auction Bayesian Game-Based Pricing Model in Cloud Market Shifeng Shang, Jinlei Jiang, Yongwei Wu, Zhenchun Huang, Guangwen Yang, and Weimin Zheng Department of Computer Science and Technology Tsinghua National Laboratory for Information Science and Technology Tsinghua University, Beijing, 100084, China [email protected], {jjlei,wuyw,huangzc,ygw,zwm-dcs}@tsinghua.edu.cn

Abstract. Recently IT giants such as Google, Amazon, Microsoft, and IBM are gearing up to be a part of the Cloud and begin to sell their cloud services. However, the current market trading mechanism is inflexible, and the price is not reasonable enough in some situation. Therefore, we first propose a cloud market framework for people to build a uniform and fully competitive cloud market where users can buy resources from different companies and exchange their idle resources in a more flexible way. Then we define a double auction Bayesian Game-based pricing model (DABGPM) for the suggested cloud market and discuss how to develop an optimal pricing strategy for this model. Our work, we think, makes a good example of more flexible and more reasonable cloud resources trading. Keywords: Cloud Computing; Pricing Model; Double Auction; Cloud Market.

1 Introduction Cloud computing is becoming a computing buzzword both in industry and academia. Academic efforts include Nimbus [1], Aneka[2][3], OpenNebula [4], and Tsinghua Cloud[5]. In industry, more and more vendors are gearing up and begin to sell their cloud services, for example, Amazon’s EC2 and S3 [6], Google’s GAE [7], Microsoft Azure [8], Rackspace[9], GoGrid[10], and VPS.net[11], to name but just a few. These services charge users for the resources consumed on a per-use basis. Currently, most companies adopt a fixed rate pricing strategy, and user can get a great discount through pre-pay method. We argue such a pricing model is not perfect. Firstly, it results in resources waste for applications (e.g., financial analysis) that users only need to run once a month for hours. Secondly, the pre-pay method will potentially make users be locked to certain providers for long with little chance to receive better and cheaper services from other vendors. Thirdly, it even gets expensive to use the fixed rate pricing model in some situation [21]. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 155–164, 2010. © IFIP International Federation for Information Processing 2010

156

S. Shang et al.

Table 1 shows the services prices of different vendors, where the basic configuration of compute is of 1 GB (=109 bytes) RAM, and 40 GB Disk. We can see that the prices of compute range from $0.06 to 0.12 with a maximum difference as much as $0.06 per hour for the same resource. Obviously, it would be attractive to establish a uniform and fully competitive cloud computing resources trading market where users can run their applications using resources from different companies as well as exchange unused resources. Table 1. The cloud services prices of different vendors Price Type

Amazon

Windows Azure

Google

GoGrid

Rackspace

Compute CPU /hours Storage GB/month

$0.085/linux $0.12/windows

$0.12

$0.10

$0.10

$0.06

$0.15

$0.15

$0.15 first 0.5GB free

$0.15 first 10GB free

$0.15

Data Upload GB

$0.10 Free through June 30, 2010

$0.10

$0.12

$free

$0.08

Data Download GB

$0.17 $0.13 if>10TB

$0.15

$0.10

$0.29

$0.22

The main contributions of this paper are two-fold. Firstly, we propose a cloud market framework for people to build a uniform and fully competitive cloud market where users can trade cloud resources in a more flexible and more effective way. Secondly, we propose the DABGPM model for our cloud market and show how to develop an optimal pricing strategy for this model. The remainder of this paper is organized as follows. Section 2 is the related work. Section 3 gives an introduction to our global cloud market framework. In Section 4, we define the DABGPM model and discuss how to develop pricing strategies based on it. The last section is the conclusion and future work.

2 Related Work There exist many economy-based resource allocation models in grid research [12], mainly including autonomous pricing mechanism [13], pricing based on the combinatorial auction [14], and pricing based on the commodities market model and load prediction [15]. They applied the auction model, the bargaining model, the combinatorial auction model, and the commodity market model, respectively. But these models usually focus on users without taking providers’ price requirements into consideration. Recently, many research projects such as Tycoon [16] and Shirako [17] have been proposed for trading resource allocations. These projects focus on allocating resources based on economy factor rather than a real market. In other words, they do not take into account the full competition between providers.

DABGPM: A Double Auction Bayesian Game-Based Pricing Model

157

With the rapid advancement of QoS (quality of services) and the establishment of security and interoperability standards for cloud computing, more and more companies began offering cloud resources (e.g., computing, storage, platform and software) and the cloud market is becoming more and more competitive. It is the time to establish a uniform and fully competitive cloud market for trading cloud resources.

3 Cloud Market Framework To facilitate cloud resources trading, we propose a uniform and fully competitive cloud market framework as shown in Fig. 1. There are three main types of participants in this framework, namely the Cloud Market Place (CMP), User Agent (UA), and Cloud Resource Provider (CRP).

Fig. 1. The Cloud Market Framework

3.1 Cloud Resource Provider Cloud Resource Provider (CRP) is a datacenter that sells its resources to users and profits from it. The resources include computing power, storage space, network bandwidth, etc. CRP has five components, that is, the Auction Agent, the Admission Control, the Resource Scheduler, the Pricing and Accounting, and the SLA Monitor. • The Auction Agent is responsible for registering resource offers into the Cloud Resource Directory of CMP. It is also auction agent’s duty to generate a bid and submit it to the CMP. • The Admission Control receives the auction result from the auction agent, and decides whether to do the requested tasks or not.

158

S. Shang et al.

• The Resource Scheduler is responsible for allocating resources according to the corresponding request. It is also in charge of balancing the supply and demand of cloud resources, providing feedback in terms of economic incentives for both cloud consumers and providers. • The Pricing and Accounting is responsible for charging users for the resources consumption; • The responsibility of SLA Monitor is to accept the user’s request and guarantee the user-requiring service quality. Cloud providers must consider and meet different QoS requirements of each individual consumer as negotiated in specific SLAs (Service Level Agreements). 3.2 Cloud Market Place CMP is composed of the Cloud Resource Auctioneer, the Pricing Algorithm, the Cloud Resource Directory, the Bank and the Insurance Service. • The Cloud Resource Auctioneer is responsible for collecting the bids of resources made by the UAs and CRPs. Based on the corresponding bidding information, the auctioneer runs the double auction algorithm to determine the winning UAs and CRPs. It is also in charge of informing the UAs and CRPs who have participated in the auction of the auction results. • The Pricing Algorithm is responsible for generating specific allocation results and the corresponding pricing information. The price calculated in this component will be sent to all the UAs and CRPs who participate in the trade. • The Cloud Resource Directory (CRD) allows participants to locate providers or consumers with the right offers. The cloud market place periodically clears bids received from market participants. • The Bank ensures that financial transactions pertaining to agreements between participants are carried out. • The Insurance Service provides insurances against a number of events that would harm both providers and users: (1) the cloud resources cannot be delivered to users; (2) the resource availability is interrupted; (3) users do not pay the resources consumed. 3.3 User Agent User agents help users make the appropriate choice of resources and auction types. There are four components within a user agent, that is, the Resource Planning Service, the Resource Discovery Agent, the Auction Agent, and the SLA Monitor. • The Resource Planning Service (RPS) helps users to determine the most appropriate computing capacity. The service will analyze the existing infrastructure, the deployed applications, and the resource requirements before suggesting a course of action according to users’ deadline, budget, and so on.

DABGPM: A Double Auction Bayesian Game-Based Pricing Model

159

• The SLA Monitor uses the result of RPS to form a service quality request and then sends the request to the auction agent. It is also in charge of monitoring the agreed SLAs to guarantee service quality. • The Resource Discovery Agent is responsible for finding the resources according to users’ SLA requirements. It contacts the CRD module of CMP to obtain the list of resources that meet these requirements. • The Auction Agent is responsible for generating the list of resources returned by the CRD. For each resource available, it makes a bid within the user’s SLA requirement and submits it to the CMP.

4 Pricing Strategies Based on the DABGPM Model As more and more IT Giants pave the way for creating a market infrastructure to trade services, a completely open bilateral cloud market will be established in the near future. Since cloud resource providers aim to maximize profit and consumers aim to minimize spending, how to do pricing is important to both sides. Here we proposed the DABGPM pricing model based on [18, 19, 20] and analyzed pricing strategies with it. 4.1 The DABGPM Model In a fully open and complete competitive global cloud resources exchange market, providers and buyers may not know each others’ valuations of cloud resources; it is a strategic game with incomplete information. The economic profit of the two sides (providers and buyers) is related to the strategies of their own and their opponents. Suppose that the total requirement of all users is Y. The providers and the consumers form a double auction, that is, they can decide whether to exchange Y resources. Let Vu and Vp be the expected resource price of the user and the provider respectively. Obviously, the provider knows the value of Vp, but doesn’t know the value of Vu, whereas the consumer knows Vu, but doesn’t know Vp. According to the above statements, the DABGPM pricing model involves the following elements: • a pair of resource provider and consumer, denoted by N=(s, b). • a set of states, denoted by Ω={(Vp, Vu)}. Indeed, Ω records the possible quoting prices of provider s and consumer b. • a set of signals, denoted by T={t1, t2, …}. Each signal indicates a resource quoting action. • a signal function τi(tÆΩ), which means signals will trigger state change. • a probability function that describes the possibility of each state in Ω. Assume that Vp and Vu are uniformly distributed on [0,1], and the provider and the user bids and offers at the same time. Obviously, we have Ps [0,1] and Pb [0,1]. If Ps≤Pb, two sides will make a deal at price P=kPs+(1-k)Pb; Otherwise, the deal will not happen. The payoff function of resource provider is as follows:

∈

∈

160

S. Shang et al.

　　　　

　　　if　p ≤ p 　　　　　　　(1) 　　　　　　　　　　　 if　p >p

⎧ kp + (1 − k ) p b − V p Us = ⎨ s 0 ⎩

s

b

s

b

The payoff function of user b is as follows:

　　 if　p ≤ p 　　　　　　　(2) 　　　　　　　　　　　 if　p >p

⎧ V − (kps + (1 − k ) p b ) Ub = ⎨ u 0 ⎩

　　　　

s

b

s

b

4.2 The Solution of DABGPM Model According to the double auction process and rules, Vp and Vu is independent and uniformly distributed on [0,1], the selling price of provider s is Ps(Vp); the bidding price of buyer b is Pb(Vu), the strategy p*s ( Vp ) , p*b ( Vu ) is a optimal strategy if and only if the following two conditions are met.

（

）

(1) p*s ( Vp ) is one solution to the optimization question:

　　　　max [k(p + E[(p (V ) | (p ( V ) ≥ p )]) − c]P{p (V ) ≥ p }　　　　　(3) s

Ps

b

u

b

u

s

b

u

s

E[(pb(Vu)|pb(Vu)]≥ps)] is the expected price when the providers’ selling price is lower than consumers’ bidding price. (2) p*b ( Vu ) is one solution to the optimization question:

　　　　max [V -kp -kE[p (V ) | p Pb

u

b

s

p

b

　　　

≥ p s ( Vp )]]P{ pb ≥ ps ( Vp )}

(4)

E[( ps ( Vp ) |p b ≥ ps ( Vp ) ) is the expected price when the providers’ selling price is lower than consumers’ bidding price, p*s ( Vp ) and p*b ( Vu ) are the optimal strategy of provider s and consumer b respectively. Assuming the provider and consumer are all adopting a linear bidding pricing strategy, the solution of DABGPM is below: ⎧⎪ p s ( V p ) = α s + β s V p ( 5) ⎨ ⎪⎩ p b ( Vu ) = α b + β b Vu Vu is defined as of the uniform distribution, hence, p b is also uniformly distributed on the interval [α b , α b + β b ] , therefore

　　　　　　　　

　　　　　　　　　　　　　　　

P {p b ( Vu ) ≥ p s } = P {α b + β b Vu ≥ p s } =

α b + β b − ps βb

　　　

1 αb + βb xdx ∫ β b ps 1 E ⎣⎡ p b ( Vu ) | pb ( Vu ) ≥ ps ⎦⎤ = = ( ps + α b + β b ) P {p b (Vu ) ≥ ps } 2

　

(6)

　　　　　(7)

DABGPM: A Double Auction Bayesian Game-Based Pricing Model

161

　　　　　max [kp + k2-1 ( p + α + β ) - V ] ⋅ α +ββ - p 　　　　　　　　　(8) b

Ps

s

s

b

b

b

s

p

b

Applying its first-order condition, we get: ps =

k 1 Vp ( αs + βs ) + k +1 k +1

(9)

Thus it can be seen, if consumer b adopting a linear strategy, the optimal response of cloud resource provider is also linear. As the same, because Vp is of uniform distribution, hence, p s is also of uniform distribution on the interval [α s , α s + β s ] , therefore: p − αb P {p b ≥ ps (Vp )} = P {p b ≥ α s + β s Vp } = b (10) βs

　　　　　　　　

E ⎡⎣ p s ( V p ) | p b ( Vu ) ≥ p b ≥ p s ( V p ) ⎤⎦ =

1 pb ∫ xdx β s αs

P {p b ≥ p s ( V p )}

=

1 ( pb + αs ) 2

　　　　(11)

Inserting equation (10) and (11) into the effective function of buyer, we can obtain:

　　　　max [V − ((1 − k ) p Pb

u

b

+

p −α k ( pb + α s ))] ⋅ b b 2 βs

　　　　　(12)

Applying its first-order condition, we get:

　　　　　　　p

b

=

k 1 as + Vu 1+ k 1+ k

　　

　

(13)

Synthesizing equation (5), (9) and (13), we can obtain a pricing solution between cloud resource providers and buyers: k 1 ⎧ ⎪ ps (V p ) = 2k + 1 + k + 1 V p ⎪ k ∈ [0,1] ⎨ k2 1 ⎪ p (V ) = + V u ⎪⎩ b u (1 + k )(1 + 2k ) k + 1

　

　　　　　　　　　　(14)

4.3 Pricing Strategy Analysis

The Formula (14) has given an equilibrium pricing strategy between cloud providers and consumers; we will analyze the characteristics of it. For the sake of simplicity, we assume that k = 0.5 and thus the formula (14) changes into (15):

162

S. Shang et al.

1 2 ⎧ ⎪⎪ ps (Vp ) = 4 + 3 Vp ⎨ ⎪ p (V ) = 1 + 2 V ⎪⎩ b u 12 3 u

　　　　　　　　　　

　　　　　　　　　　　

(15)

(1) Based on the linear equilibrium strategy, the highest bidding price of consumer is p b (1) = 0.75 , and the lowest price of provider is ps ( 0 ) = 0.25 ; (2) If V p > 0.75 , the selling price of provider ps (V p ) = 0.25 + 2 / 3Vp is lower than the real cost. Even if it is higher than the highest bidding price p b (1) = 0.75 , the trade will not happen.

Fig. 2. The Solution of DABGPM Model

(3) When Vu < 0.25 , although the bidding price is higher than the cost, it is lower than the lowest selling price, ps ( 0 ) = 0.25 , the trade will not happen either. This can be seen from Fig. 2. The two lines ps (V p ) = 0.25 + 2 / 3Vp and p b (Vu ) = 1 / 12 + (2 / 3)Vu indicate the optimal pricing strategy of provider and consumer is the bold line in Fig. 2. (4) From the above analysis, we can conclude that ps (V p ) = 0.25 + 2 / 3Vp and p b (Vu ) = 1 / 12 + (2 / 3)Vu is the optimal pricing strategy of provider and consumer respectively. For example, when the cost of provider is 0.3, the optimal price is 0.45, when the valuation of buyer v is 1.0, the optimal price is 0.75.

5 Conclusion With the development of cloud technology, there will be more users and companies renting hardware and/or software resources from the cloud instead of buying and maintaining by themselves. As more and more IT giants invest into the technology, cloud computing has evolved into a big piece of cake. It is a trend that a uniform and complete competitive cloud resources market will emerge in the near future. For such a market, the trading strategy and reasonable price is important. To facilitate such a

DABGPM: A Double Auction Bayesian Game-Based Pricing Model

163

market, we proposed a complete competitive cloud market framework and a double auction Bayesian game-based pricing model and discussed pricing strategies based on this model. We think the work reported here makes a good example of more flexible and more reasonable cloud resources trading and would promote the applications of cloud computing.

Acknowledgement This Work is supported by Natural Science Foundation of China (60803121, 60773145, 60911130371, 90812001, 60963005), National High-Tech R&D (863) Program of China (2009AA01A130, 2006AA01A101, 2006AA01A108, 2006AA01A111, 2006AA01A117).

References 1. http://www.nimbusproject.org/ 2. Buyya, R.: Market-Oriented Cloud Computing: Vision, Hype, and Reality of Delivering Computing as the 5th Utility. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE Computer Society, Los Alamitos (2009) 3. Chu, X., Nadiminti, K., Jin, C., Venugopal, S., Buyya, R.: Aneka: Next-Generation Enterprise Grid Platform for e-Science and e-Business Applications. In: Proceedings of the Third IEEE International Conference on e-Science and Grid Computing, pp. 151–159. IEEE Computer Society, Los Alamitos (2007) 4. OpenNebula Project, http://www.opennebula.org/ 5. Zheng, W.M.: An Introduction to Tsinghua Cloud. Science in China Series F: Information Science 53(5) (2010) 6. Amazon Elastic Compute Cloud (EC2), http://aws.amazon.com/ec2/ 7. http://www.microsoft.com/windowsazure/ 8. Google App Engine, http://appengine.google.com 9. http://www.rackspace.com/index.php 10. http://www.gogrid.com/index.v2.php 11. http://www.vps.net/ 12. Buyya, R., Abramson, D., Venugopal, S.: The Grid Economy. Proceedings of the IEEE 93(3), 698–714 (2005) 13. Jin, Y., Shoubao, Y., Maosheng, L., Qianfei, F.: An Autonomous Pricing Strategy toward Market Economy in Computational Grids. In: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC 2005), vol. II, pp. 793–794. IEEE Computer Society, Los Alamitos (2005) 14. Schwind, M., Gujo, O., Stockheim, T.: Dynamic Resource Prices in a Combinatorial Grid System. In: Proceedings of the The 8th IEEE International Conference on E-Commerce Technology and The 3rd IEEE International Conference on Enterprise Computing, E-Commerce, and E-Services. IEEE Computer Society, Los Alamitos (2006) 15. Zhao, X., Xu, L., Wang, B.: A Dynamic Price Model with Demand Prediction and Task Classification in Grid. In: Proceedings of the Sixth International Conference on Grid and Cooperative Computing, pp. 775–782. IEEE Computer Society, Los Alamitos (2007)

164

S. Shang et al.

16. Lai, K., Rasmusson, L., Adar, E., Zhang, L., Huberman, B.A.: Tycoon: An implementation of a distributed, market-based resource allocation system. Multiagent Grid Syst. 1, 169– 182 (2005) 17. Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker, D., Yocum, K.G.: Sharing networked resources with brokered leases. In: Proceedings of the Annual Conference on USENIX 2006 Annual Technical Conference. USENIX Association, Boston (2006) 18. Ren, Y.L., Zou, X.Y.: Bidding game model of a power generation company based on firstprice sealed auction. Journal of Systems Engineering 18(3), 248–254 (2003) 19. Fang, D.B., Wang, X.J., Zhang, Y.X., Liu, K., Wang, P., Zhu, Z.Y.: A Double auction Bayesian model with supplier and demander in open bilateral electricity market. Power System Technology 27(12) (2003) (in Chinese) 20. Gibbons, R.: A Primer in Game Theory. Pearson Higher Education, London (1992) 21. Wilkening, J., Wilke, A., Desai, N., Meyer, F.: Using clouds for metagenomics: A case study. In: IEEE International Conference on Cluster Computing and Workshops CLUSTER (2009)

NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism Changyou Xing, Li Yang, and Ming Chen Institute of Command Automation, PLA University of Science and Technology Nanjing, Jiangsu, China, 210007 {xcy,yl,cm}@plaust.edu.cn

Abstract. More than half of today’s Internet traffic is generated by P2P applications, and most of which is the BitTorrent file sharing traffic. The random neighbor selection mechanism of BT puts unprecedented pressure on the network operators and service providers. In this paper firstly we give a theoretical analysis on the impact of neighbor selection mechanism on BitTorrent performance as well as network resource cost. And then based on network performance prediction techniques, we propose a network performance aware BitTorrent traffic optimization mechanism NPA-BT, which uses network performance information to help overlay constructing. Simulation results show that NPA-BT can select peers in the same AS network and with good performance as neighbors, and thus the BitTorrent download time and the cross AS network traffic are all decreased. Keywords: Network performance aware, traffic optimization, BitTorrent Modeling.

1 Introduction Peer-to-Peer (P2P) applications have become immensely popular in the Internet. Network measurements shows that P2P traffic accounts for more than half of the Internet traffic [1]. Among P2P applications, file sharing is perhaps the most popular application. Nodes in P2P system construct their own application layer overlay network. However, most of current P2P systems ignore the performance and structure of under layer physical network, and two neighbor nodes in the overlay network may be far from each other in the physical network. Recent studies showed that the routing mechanisms in P2P overlay networks always violate ISP routing policies, and average P2P bit traverses 1000 miles and 5.5 metro-hops [2]. 50%-90% of existing local pieces in active users are downloaded externally [3]. For one thing, these problems make P2P applications waste network resources and increase the difficulty of ISP network engineering; for the other, these problems also decrease the P2P application performance itself. One of the key reason for such a problem is that the transparency of network performance to upper P2P applications. Internet provides best effort packet delivery

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 165–176, 2010. © IFIP International Federation for Information Processing 2010

166

C. Xing, L. Yang, and M. Chen

service to applications, with no network performance information to its applications. Due to the overload problems, it is also impossible to get the network performance using active measurement. In the former works, we proposed network delay and available bandwidth prediction models NetPharos [4] and PathGuru [5], which can predict network performance based on partial measurement. Using the information provided by them, we can help P2P applications choosing peers in the same ISP networks and with good performance as neighbors, so as to increase P2P application performance and decrease cross ISP network traffic. Based on this idea, we propose a network performance aware P2P traffic optimization mechanism named NPA-BT, which uses the AS and performance information to construct application layer overlay network. Since BitTorrent traffic is more than half of today’s P2P traffic, we use BitTorrent as our analysis object, but the analysis method is also suitable to other P2P applications. The rest of this paper is organized as follows: Section 2 presents a short survey of related works on P2P application layer traffic optimization. Section 3 models the BitTorrent file downloading process from different point of view, and analyzes the relationship between nodes performance and their AS networks, and provides the network performance aware BitTorrent traffic optimization mechanism NPA-BT. Section 4 evaluates the performance of NPA-BT by simulation. Finally, Section 5 summarizes our work.

2 Related Works Essentially, P2P traffic optimization is solving the mismatch problem between overlay network and physical network. The P4P [6] research project has developed a framework which ISPs can use to convey network information to P2P applications. The framework of the P4P project is based on two main architectural entities: the itracker and the p-distance. With a thorough measurement analysis, [3] showed that regular BitTorrent is totally network-unaware, and they proposed that ISPs intercept P2P-traffic at edge routers and redirect them to P2P-clients within the same ISP. However, they do not investigate such a solution in detail. [7] proposed to improve P2P-locality through network layer topology information, and they gave a biased neighbor selection mechanism, but it needs to get ISP network topology information first, which is hard to get, if not impossible. [8] presented an oracle based ISP and P2P application collaboration mechanism, in which each ISP maintain an oracle server, P2P client can query the oracle to get information about the underlying network, and then it can select node with the best performance as its neighbors. [9] presented an approach to reducing costly cross ISP traffic by using CDN redirection information. Yu presented a hierarchical architecture CBT (Clustered BitTorrent) for grouping BitTorrent-like P2P applications peers in proximity, However they also only focused on network proximity metric, and did not take network download available bandwidth into consideration [10]. For P2P file sharing applications such BitTorrent, the most concern thing is how to decrease file distribution time; but for ISP, the most concern thing is how to decrease cross ISP network traffic and network resource utilization. Unfortunately, most of

NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism

167

current traffic optimization mechanisms only focus on one of the two aspects, and cannot achieve the two aims simultaneously.

3 Network Performance Aware P2P Traffic Optimization Mechanism NPA-BT In this section, we will firstly analyze the impact of neighbor nodes selection mechanism on resource download time, discuss the optimization object from different point of view, and then we study on relationships between nodes distances and AS networks. Finally we propose a network performance aware P2P traffic optimization mechanism NPA-BT. 3.1 Theoretical Analysis of BitTorrent Neighbor Selection Mechanism 3.1.1 BitTorrent File Download Time Model Suppose each peer node in BitTorrent system selects n nodes randomly to download data, and the basic parameters are shown as follows:

x(t ) : number of lechers at time t; y (t ) : number of seeds at time t;

λ : arrival rate of new peers; μ : upload rate of peers; c : download rate of peers;

θ : abort rate of download peers; γ : leave rate of seed in the system; η : file share efficiency, most time close to 1. Suppose each peer selects k peers and (n-k) seeds to download data, and μ is the average upload rate of these n nodes. Using the Markov chain based fluid math model, we can give the variation rate of download nodes and seeds as follows:

dx(t ) = λ − θx(t ) − min{cx(t ), kμηx(t ) + (n − k ) μy (t )} dt

(1)

dy (t ) = min{cx(t ), kμηx(t ) + (n − k ) μy (t )} − γy (t ) dt

(2)

Most times the download rate of a node is much larger that its upload rate, thus we can ignore the restriction of upload rate.

min{cx(t ), kμηx(t ) + (n − k ) μy (t )} = kμηx(t ) + (n − k ) μy (t ) (3) To simplify the model, we suppose there are no seeds in the system. That is, each node will quit the system as soon as it completes the file download process. Then equation (1) can be simplified as follows:

dx(t ) = λ − θx(t ) − nμηx(t ) dt

(4)

168

C. Xing, L. Yang, and M. Chen

Here we mainly pay our attention to the performance analysis of system in the _

stabilization state, lim x(t ) = x , and t →∞

dx(t ) = 0 , thus dt _

_

0 = λ − θ x − nμη x

(5)

By solving equation (5) we can get the following result.

λ

_

x=

(6)

θ + nμη

Little law can be used to estimate the average download time of each node during the system stable state, and by using this law we can get equation (7) _

_ λ −θ x _ x = (λ − θ x)T λ

(7) _

In which T represents the average download time, and λ − θ x represents average download rate. From equation (7) we can work out the result of the average download time T. _

T=

x

λ

=

1 θ + nμη

(8)

Generally we suppose η = 1 approximately, and the variation of θ on the system can also be ignored. Then from equation (8) we can see that in BitTorrent file sharing system, the download time is determined by the value of nμ . By choosing nodes with high upload rate as neighbors, the download time can be decreased dramatically. Thus, in order to optimize the BitTorrent file download time, we should make each peer to select peers with the highest upload available bandwidth as neighbors. 3.1.2 Network Resource Optimization Model The upper model mainly focuses on how to decrease the download time of BT peers, and it takes no consideration on network resource usage information during data download period. For example, if the data is downloaded from different ISP networks, the ISP should pay for the cross IPS network traffic. And even in the same ISP, selecting a close peer as neighbor will also save network resources. Thus, we will extend the upper model, and take both ISP network resource and BT download rate optimization into consideration. Suppose Bij (t ) represents the download rate of peer Hi from peer Hj at time t; Di (t ) represents the download rate restriction of peer Hi at time t; U j (t ) represents the upload rate restriction of peer Hi at time t; Lij (t ) represents the distance cost from peer Hi to peer Hj at time t; and Wij (t ) represents the trans-AS cost from peer Hi to peer Hj at time t. We also suppose that a peer will leave the network as soon as it finishes the download process, and there are n nodes in the system. From the viewpoint of P2P performance optimization, we want to maximize the download rate of the n nodes at any time t, which can be represented as follows.

NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism n

max ∑

n

∑B

i =1 j =1, j ≠ i

n

∑B

s.t.

j =1, j ≠ i n

∑B

i =1,i ≠ j

ij

ij

169

ij

(t )

(t ) ≤ Di (t )

(t ) ≤ U j (t )

(9)

∀H i ∀H j

Bij (t ), Di (t ),U i (t ) ≥ 0

∀H i , H j

On the other hand, from the viewpoint of ISP network resource optimization, we want to minimize the total network resource cost at any time t. n

min ∑

n

∑ (L

ij

i =1 j =1, j ≠ i

s.t. Lij (t ) > 0,

(t ) + Wij (t )) Bij (t ) ∀H i , H j , i ≠ j

⎧⎪m(t ) Wij (t ) = ⎨ ⎪⎩0

(10)

，

H i and H j in different AS m(t ) > 0 H i and H j in the same AS

The upper expression describes the optimization objects at each time t. Here we take the full data download period into consideration, and suppose Ti and Ci represent the time and cost of peer Hi needed to download a file separately. Generally, the BitTorrent peer downloads resources from different neighbors simultaneously. Suppose Hi downloads M ij ( M ij ≥ 0) data from Hj, and then the finish time of Hi is determined by the maximum download time from the n-1 peers, that is

⎛M ⎞ Ti = max⎜ ij ⎟ j ⎜ B ⎟ ⎝ ij ⎠

(11)

Without loss of generality, we suppose the size of downloaded file is 1, and there is no replicated download problem, then the total number of data downloaded by each node Hi must be 1, that is n

T

j =1

0

∑∫

Bij (t )dt = 1

(12)

To simplify the model, we ignore the limit of Di. That is, the download capability of a peer is large enough, so that the download rate of a node is the sum of all upload rates it gets from other nodes. Under this constraint, the best download schedule mechanism is that Hi partitions data into n-1 blocks and download them from n-1 neighbors separately, and the size of each block should be consistent with the upload rate of the corresponding peer, so that Hi can finish the n-1 download simultaneously. The worst download schedule mechanism is that Hi downloads the whole data from a node with the smallest upload bandwidth. Thus the download finish time of node Hi is

1 1 ≤ Ti ≤ min Bij ∑ Bij j

j

(13)

170

C. Xing, L. Yang, and M. Chen

On the other hand, the network resources Hi costs to complete its data download is determined by its download bandwidth, data transfer distances and the cross AS network cost. n

Ti

j =1

0

Ci = ∑ ∫ ( Lij (t ) + Wij (t )) Bij (t ) dt

(14)

Using the result of expression (12), we can get

min ( Lij ) ≤ Ci ≤ max ( Lij + Wij ) j

j

(15)

Thus, from the viewpoint of resource download, P2P application wants to minimize the total download time, and ISP wants to minimize network resource cost. The two optimization object can be represented as follows. n n ⎧ ⎛ M ij ⎞ ⎟ ⎪min ∑ Ti = min ∑ max ⎜ j ⎜ B ⎟ ⎪ i =1 i =1 ⎝ ij ⎠ ⎨ n n n Ti ⎪ ⎪min ∑ Ci = min ∑ ( ∑ ∫0 ( Lij (t ) + Wij (t )) Bij (t )dt ) i =1 j =1, j ≠ i i =1 ⎩

(16)

Taking equation (13), (15) and (16) together, we can see that the optimization of P2P application download time is determined by choosing high available bandwidth nodes, which is consistent with our upper model. But for the ISPs to optimize network resource utilization, the distances between neighbors and the cross AS traffic should be decreased. Thus to achieve this aim, the BitTorrent clients should select nodes in the same ISP network or close networks as neighbors. Besides, links between different ASes always form bottlenecks, thus selecting nodes in the same AS network as neighbors will also provide high download rate, and the two optimization objects can be satisfied simultaneously. Based on this discussion, we will propose the network performance aware BitTorrent traffic optimization mechanism NPA-BT in the next section. 3.2 Network Performance Aware BitTorrent Traffic Optimization Mechanism NPA-BT As discussed above, to optimize BitTorrent traffic, the neighbor selection mechanism should be modified. Each peer should select nodes in the same AS network or close to it, and have high bandwidth as its neighbors. An intuitive idea is firstly using NetPharos and PathGuru to predict the distances and available bandwidths between candidate peers, and then selecting nodes with the lower delay and higher bandwidth as neighbors. But an open question is whether nodes with short distances are in the same AS network? To answer this question, we firstly give an analysis on a measurement dataset. Ledlie uses a typical BitTorrent client Azureus to measure distances from 283 PlanetLab nodes to 156658 Azureus peers running all around the world. The measurement dataset includes about 9.5 x 107 items, and each item includes node IP address and distances [11]. We use this dataset to study the nodes distances and AS relationships. Firstly we use RouteView [12] information to find the AS number and

NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism

171

country information of each IP address, and then estimate the relationship between distance and AS. Fig. 1 shows the AS distribution of nodes with different distances, from which we can see that there is no direct relationship between nodes distance and their AS networks, and nodes with small distances may locate in different ASes. For example, nodes that have distances less than 2 ms are located in more than 5 ASes, which means that when using distances as the criteria to select neighbors, we cannot achieve the aim of decreasing inter AS traffic. 25

Number of Ases

20 15 10 5 0 1

2

3

4

5 6 Delay (ms)

7

8

9

10

Fig. 1. Number of AS distribution with different delays

Besides, we also investigate the distance variation when selecting nodes in the same AS or the same country as neighbors. In mathematics, variance is used to describe the scattering degree of a variable, but this metric is related to the quantity of the variable, and thus we use the ratio of standard deviation and expectation to eliminate such impact. If the expectation of a variable X is E ( X ) , and its variance is D ( X ) , we define its standard deviation and expectation ratio as

DE ( X ) =

D( X )

. DE ( X ) = 0 means the value of X has no fluctuation at E( X ) all. Otherwise the larger DE ( X ) is, the more serious X fluctuates. Fig. 2 shows the standard deviation and expectation ratio of nodes distances in the same AS or the same country, from which we can see that no matter which kind is considered, the distances fluctuate seriously. This result means if we only use in the same country or AS network as neighbor selection criteria, we can reduce inter AS traffic, but we cannot guarantee to select nodes with short distances. From the upper discussion we can see that only using distances or AS relationships as neighbor selection criteria has limitations, and thus we combine the network performance information and AS information during neighbor selection. Firstly we find nodes in the same AS network as candidates, and then select nodes that have small distances and high available bandwidth from the candidates as neighbors. Besides, though the random neighbor selection mechanism of BitTorrent has the low efficiency problem, it can enhance the system robustness. If nodes all select peers in its AS network as neighbors, the BitTorrent overlay may be isolated. Wang showed that resources in a BitTorrent overlay network have typical region feature. For example, videos with a special language

172

C. Xing, L. Yang, and M. Chen

are only popular in regions that use that language as mother tongue, and the resources are rare in other regions [13]. Thus if the number of peers in an AS network is very small, we still use random candidate neighbor selection mechanism, so as to keep the connectivity of overlay network. The algorithm is shown as follows. 1 0.9

Cumulative Distribution

0.8 0.7 0.6 0.5 0.4 0.3 0.2 Delay variance ratio to nodes in a country Delay variance ratio to nodes in a AS

0.1 0

0

0.5

1

1.5 2 2.5 Delay variance ratio

3

3.5

4

Fig. 2. Standard deviation and expectation ratio of nodes distances in the same AS or the same country

Algorithm: NPA-BT //Select peers in the same AS to form Candidate Begin Get peer_list from tracker; For each peeri in peer_list Begin If ASi = = ASlocal Add peeri into Candidate; End // num < threshold means the number of peers in this // AS is too small. Only select peers with good // performance as neighbors num := sizeof (Candidate); If (num < threshold) Begin Select threshold peers randomly from peer_list; Add the threshold peers to Candidate; End // Select good performance peers in candidate For each peeri in Candidate Begin distancei := NetPharos (local, peeri); avbi := PathGuru (local, peeri); perfi := avbi/distancei; Sort Candidate by perfi; Return the first k nodes in Candidate; End End

NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism

173

The algorithm of NPA-BT includes two main parts: finding candidates and selecting neighbors. During the finding candidates period, we try to find peers in the same AS network as candidate neighbors, but if the number of peers in an AS network is too small (less than predefined threshold), we will use random peer selection mechanism to add some peers to the candidates. During the neighbor selection period, we firstly use NetPharos and PathGuru to compute the distance and available bandwidth metrics, and use their ratio to represent the performance of a peer, and finally select the first k nodes with good performance.

4 Performance Evaluation In this section, we evaluate the performance of NPA-BT by simulation, and the simulation parameter is similar to [14]. Firstly a two hierarchical network topology with 300 routers is generated using BRITE, which includes 32 backbone routers, and they only provide traffic forwarding service. Others are stub routers, which provide network access service to end users. 1000 end user nodes are connected to the stub routes randomly, and the access bandwidth is distributed uniformly between 1~9 Mbps. The distance of each link in the stub network is distributed uniformly between [1, 3], and the distance of each link between stub router and transit router is distributed uniformly between [10, 15], and the distance of each link in the transit network is distributed uniformly between [20, 90]. For the upper network topology, we use NetPharos and PathGuru to predict the distances and available bandwidths between every two end nodes. And during the simulation each end user generates 100 download requests randomly, and each resource has 30 replicas distributed randomly in the 1000 end nodes. We use both random neighbor selection and NPA-BT to construct, and compare the average performance between neighbor nodes in the overlay networks. 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 NPA-BT Original BT

0.1 0

0

20

40

60 80 100 120 140 RTT Between Neighbors (ms)

160

180

Fig. 3. Cumulative distribution of RTT between neighbors

Fig. 3 and Fig. 4 show the cumulative distribution of average distances and available bandwidth when using the two peer selection mechanism. From which we can see that with the help of network performance information, NPA-BT can select

174

C. Xing, L. Yang, and M. Chen

the more efficient nodes as neighbors, and decreases the distances, increases available bandwidths between neighbors. Thus we can achieve the optimization object of decrease download time and cross AS network traffic by using NPA-BT. 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 NPA-BT Original BT

0.1 0

1

2 3 4 5 6 7 8 9 Available Bandwidth Between Neighbors (Mbps)

10

Fig. 4. Cumulative distribution of available bandwidth between neighbors

To improve the object download efficiency, many P2P applications will select more than 1 node as peers to download data at the same time. Fig. 5 and Fig. 6 show the distance and available bandwidth comparison of selecting peers with and without network performance information. From these two results we can see that no matter which metric is considered, neighbors selected by NPA-BT are better than that selected randomly in original BT. 120

NPA-BT

Original BT

100

RTT (ms)

80 60 40 20 0 1

2

3 4 Number of peers

5

6

Fig. 5. Relationship between average RTT and number of neighbors

Fig. 5 and Fig. 6 also show that as the number of neighbors increases, the average distance of peers selected by NPA-BT also increases accordingly (the average available bandwidth decreases accordingly), but the average distance or available bandwidth of peers selected randomly nearly keeps to be the same. This phenomenon is because NPA-BT always selects the best peers as neighbors. As the number of

NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism

175

neighbors increases, some sub-optimal nodes must be selected as neighbors, and thus the average performance will decrease accordingly. But it still much larger than that of random peer selection mechanism. 9 Average available bandwidth(Mbps)

NPA-BT 8 Original BT 7 6 5 4 3 2 1 0 1

2

3 4 Number of peers

5

6

Fig. 6. Relationship between average available bandwidth and number of neighbors

Taking all the above analysis into consideration, we can see that NPA-BT can select the best peers as neighbors, which increases the performance of P2P applications as well as decreases the cross ISP network traffic. As the emergence of new applications, how to provide network status information to them so as to optimize their performance becomes more and more important. And this method can also be extended to solve other distributed application optimization problems.

5 Conclusions More than half of current Internet traffic is generated by P2P applications, and most of which is BitTorrent traffic. Thus how to optimize P2P traffic is considered both by ISP and Internet researchers. In this paper we give a analysis on the impact of neighbor selection on BitTorrent performance and network resource consumption, and then propose a network aware BitTorrent traffic optimization mechanism NPABT. Simulation results show that with network status information, NPA-BT can select the best peers as neighbors, and thus improve the BT performance as well as decrease the inter AS traffic. In the future work, we are going to deploy the NPA-BT on PlanetLab so as to test its performance. Since the significance of P2P traffic optimization on both ISP and P2P applications, this research will be useful in providing a mechanism for BitTorrent traffic optimization.

References [1] IPOQUE. Internet Study 2007: Data about P2P, VoIP, Skype, file hosters like RapidShare and streaming services like YouTube (November 2007), http://www.ipoque.com/media/internet_studies/ internet_study_2007 [2] Seetharaman, S., Ammar, M.: Characterizing and mitigating inter-domain policy violations in overlay routes. In: Proc. of IEEE ICNP (2006)

176

C. Xing, L. Yang, and M. Chen

[3] Karagiannis, T., Rodriguez, P., Papagiannaki, K.: Should internet service providers fear peer-assisted content distribution? In: Proc. of the ACM SIGCOMM Internet Measurement Conference 2005, Berkeley, CA (October 2005) [4] Xing, C., Chen, M.: A Virtual Node based Network Distance Prediction Mechanism. In: Proceedings of the IEEE GlobeCom 2008 (2008) [5] Xing, C., Chen, M., Yang, L.: Predicting Available Bandwidth of Internet Path with Ultra Metric Space-based Approaches. In: Proceedings of the IEEE GlobeCom 2009 (2009) [6] Xie, H., Yang, Y., Krishnamurthy, A., Liu, Y., Silberschatz, A.: P4P: Provider Portal for Applications. In: Proceedings of the ACM SIGCOMM 2008, pp. 351–362. ACM Press, New York (2008) [7] Bindal, R., Cao, P., Chan, W., et al.: Improving traffic locality in BitTorrent via biased neighbor selection. In: Proc. of IEEE ICDCS 2006, Lisboa, Portugal (July 2006) [8] Aggarwal, V., Feldmann, A., Scheideler, C.: Can ISPs and P2P Systems Co-operate for Improved Performance? ACM SIGCOMM Computer Communications Review 37(3), 29–40 (2007) [9] Choffnes, D., Bustamante, F.: Taming the Torrent: A practical approach to reducing cross-isp traffic in peer-to-peer systems. In: Proceedings of the ACM SIGCOMM 2008, pp. 363–374. ACM Press, New York (2008) [10] Yu, J., Li, M.: CBT: A proximity-aware peer clustering system in large-scale BitTorrentlike peer-to-peer networks. Computer Communications 31, 591–602 (2008) [11] Network Coordinate Research at Harvard, http://www.eecs.harvard.edu/~syrah/nc/ [12] Route Views Project, http://www.routeviews.org/ [13] Wang, H., Liu, J., Xu, K.: On the Locality of BitTorrent-based Video File Swarming. In: Proc. of IPTPS 2009, Boston, MA (2009) [14] Shen, G., Wang, Y., Xiong, Y., et al.: HPTP: Relieving the Tension between ISPs and P2P. In: Proc. of ACM IPTPS 2007, Bellevue, WA (Feburary 2007)

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors Jiqiang Song1, Eugene Y. Tang1, and Leibo Liu2 1 Intel Labs China, Beijing, China {jiqiang.song,eugene.y.tang}@intel.com 2 Tsinghua University, Beijing, China [email protected]

Abstract. More and more mobile phones are equipped with multiple sensors today. This creates a new opportunity to analyze users’ daily behaviors and evolve mobile phones into truly intelligent personal devices, which provide accurate context-adaptive and individualized services. This paper proposed a MAST (Movement, Action, and Situation over Time) model to explore along this direction and identified key technologies required. The sensing results gathered from some mobile phone sensors were presented to demonstrate the feasibility. To enable always sensing while reducing power consumption for mobile phones, an independent sensor subsystem and a phone-cloud collaboration model were proposed. This paper also listed typical usage models powered by mobile phone sensor based user behavior prediction. Keywords: Mobile computing, sensor, user behavior analysis, pattern prediction, MAST, context-adaptive, individualized.

1 Introduction Sensors are essential components of mobile phones from the beginning. The major functionality of early mobile phones is voice call, for which the microphone sensor is the input device. Later on, the camera sensor is integrated into mobile phones as a must-have feature. Today’s mobile phones are far different from a decade ago. More and more sensors are integrated into mobile phones to provide revolutionary user experience. For example, Apple’s iPhone [1] uses multi-touch screen, gravity sensor, light sensor to enable a unique user interface, whereas Google’s Nexus One [2] is equipped with dual microphones, one is on the front, the other is on the back, to do noise cancellation. With the fast development of highly-integrated MEMS sensors [3], mobile phones with multiple useful sensors will become affordable for mass consumers. This creates a new opportunity to analyze users’ daily behaviors and evolve mobile phones into truly intelligent personal devices, which provide accurate context-adaptive and individualized services. This paper presents a preliminary work on this focus. Section 3 proposes a MAST (Movement, Action, and Situation over Time) model to C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 177–189, 2010. © IFIP International Federation for Information Processing 2010

178

J. Song, E.Y. Tang, and L. Liu

characterize human behaviors, and provides a probabilistic method to analyze and predict user behavior patterns. Key technologies required for this research are discussed in Section 4. Finally, we list three typical usage models based on user behavior pattern analysis and prediction using mobile phone sensors.

2 Related Work User behavior analysis is not a new topic. Plenty of research work [4-6] has been done by analyzing applications and contents that users interact with. However, these analysis tools are either embedded in applications or in the service provider’s servers, which are not taking into account the new and powerful sensing capability of next generation mobile phones. Some work on wearable sensors [7] utilized multiple sensors, but assuming sensors are worn on several positions of human body, which is very different from the usage model of mobile phones. On the other hand, existing research on sensors mainly focuses on the environmental infrastructure buildup and inter-device communications, such as ubiquitous computing [8], pervasive computing [9], and wireless sensor network [10]. Different from the above research, the work presented in this paper is mobile phone centric, and will be complementary with the advancement of above research and form synergy for mobile computing.

3 MAST Model 3.1 Concept This section proposes a simple model for mobile phone user behavior analysis. The basis of user behavior analysis is to understand human activities and link them to other factors. In general, human daily activities can be classified into movements and actions according to their effects. First, we give their definitions in our context. z z

A movement is a salient physical activity which causes location change or pose change of human body, e.g., sit down, walk, run, etc. An action is a subtle physical activity which generates impact to the rest of the world other than location change or pose change of human body, e.g., talk, press on a key, tap on a screen, etc.

There are two important factors related to human activities, time and situation. Time is the natural time perceivable by human, measured by seconds. Situation is the location where human activities happen and the environment that affects human activities. Note that the meaning of location in the situation defined here is not the same as the precise geographic location which is measured by longitude and latitude. It is what human regard as a “place”, such as office, home, bus, etc.

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors

179

Movements and actions have different relations with time and situation. Movements usually take measurable time, and it may bring human from one situation to another, while actions take variable time (very short for pressing a key or pretty long for watching a video), and it does not lead to situation changes. Figure 1 shows an example of recording a person’s behavior over a short period of time, where Si (i=1..3) denote different situations, Mi (i=1..4) denote different movements, and Ai (i=1..5) denote different actions.

Fig. 1. Temporal record of a person’s behavior over time

From Figure 1, one can clearly determine where the person is and what the person is doing at any particular time. We call this the MAST (Movement, Action, Situation over Time) model. M(t), A(t) and S(t) denote the value of movement, action and situation at time t, respectively. 3.2 Probabilistic Model If we use one day as the analyzing period for human behavior, M(t), A(t) and S(t) may be different at the same t of different dates. Therefore, M(t), A(t) and S(t) are not scalar values, but vectors containing probabilities of possible values. For example, suppose the possible movements of a person at a particular time are Mi (i=1..n), M(t) = [P1, P2, .., Pn], where Pi is the probability of Mi happening at that time, and Pi =1.0. Same for A(t) and S(t). M(t), A(t) and S(t) are denoted as column vectors. M(t), A(t) and S(t) can be independently observed over time and over days. Since a person’s daily life does show certain patterns due to his/her occupation and habits, M(t), A(t) and S(t) will converge after a certain period. They may converge at more than one peaks, e.g., corresponding to weekday pattern, weekend pattern, etc. This will be useful to predict a person’s movement, action or situation by time. However, the prediction is sensitive to the possible time shift of human activity. On the contrary, the correlations among movements, actions and situations are less time sensitive, so that they are more robust to characterize human behavior patterns. Figure 2 shows the probabilistic relations among M(t), A(t) and S(t). In short, when a user is observed in

∑

180

J. Song, E.Y. Tang, and L. Liu

Δ

Δ

state M(t), A(t) or S(t) at time t, how possible he/she will be in state M(t+ t), A(t+ t) or S(t+ t) after time t, where t is the time interval between observation and prediction.

Δ

Δ

Δ

Fig. 2. Probabilistic relation model among M(t), A(t) and S(t)

Below are the physical meanings of probabilities defined in Figure 2, where i, j are indexes of possible actions, movements or situations. z z z z z z z z z

Δ Δ Δ Δ Δ Δ Δ Δ Δ

PΔt(A|A) indicates when Ai happens at t, the probability of Aj happening at (t + t). PΔt(M|A) indicates when Ai happens at t, the probability of Mj happening at (t + t). PΔt(S|A) indicates when Ai happens at t, the probability of being Sj at (t + t). PΔt(M|M) indicates when Mi happens at t, the probability of Mj happening at (t + t). PΔt(A|M) indicates when Mi happens at t, the probability of Aj happening at (t + t). PΔt(S|M) indicates when Mi happens at t, the probability of being Sj at (t + t). PΔt(S|S) indicates when being Si at t, the probability of being Sj at (t + t). PΔt(A|S) indicates when being Si at t, the probability of Aj happening at (t + t). PΔt(M|S) indicates when being Si at t, the probability of Mj happening at (t + t).

Therefore, for a specified time t, PΔt(Y|X) is a two-dimension probability matrix, which is trained by day-to-day observations of user activities. Its row dimension equals to the dimension of X(t), and its column dimension equals that of Y(t). For any given Xi detected at time t, PΔt(Y|X) stores the probability of each Yj happening after t. The minor time shifts of user activities should be considered during training PΔt(Y|X). When X(t) or Y(t) is updated with new observation data, PΔt(Y|X) should be updated as well.

Δ

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors

181

With the converged MAST probability model trained from user activities over many days, one can extract many interesting user behavior patterns by time-movement correlation, time-action correlation, time-situation correlation, movement-movement correlation, movement-action correlation, etc. It is also interesting to extract time-related correlations and time-unrelated correlations. 3.3 MAST-Based Behavior Prediction The MAST probabilistic model described in the above section is designed for characterizing a user’s daily behavior pattern. As long as the model has been trained to converge, it is very useful to predict the user’s next activity based on his/her current activity being detected. For example, assuming time is always a known value, below is a basic flow for predicting action by movement. i. ii. iii. iv. v. vi.

Detect a human movement, denoted by Mi; Using the current time t as a starting point, search the neighborhood of M(t) for a local probability peak for Mi; If the local peak is larger than a threshold, a strong prediction can be made; otherwise, a weak prediction is made; Get the time t’ corresponding to the local peak; Using t’ as the starting point, search the neighborhood of PΔt(A|M) for a local probability peak for Mi; Take the Aj corresponding to the local peak as the prediction.

The prediction flows for other input/output requirements are similar. One can also further develop to use more than one input to predict more precisely.

4 Key Technology There are many methods, such as Bayesian-based methods [11] and HMM-based methods [12], available for training the MAST probabilistic model, so that we do not repeat them in this paper. This section describes the other key technologies required for user behavior pattern analysis and prediction based on mobile phone sensors. They include user activity detection, independent sensor subsystem, and phone-cloud collaboration. 4.1 User Activity Detection When using sensors embedded in a mobile phone to collect its user’s activity data in a non-intrusive way, the detectable movements and situations are not as many as those detectable by surrounding sensors or wearable sensors. On the other hand, not all human movements and actions are concerned for user behavior analysis. Table 1 shows some concerned movements, actions and situations.

182

J. Song, E.Y. Tang, and L. Liu

Table 1. Concerned movements, actions and situations

Movements Sit down Stand up Walk on flat Walk upstairs Walk downstairs Take lift up Take lift down Run Drive

Actions Phone call Read news Web search Read mail Write mail Take photo Listen music Watch video Visit SNS website

Situations Home Office Car Bus Subway Street Canteen Meeting room Playfield

Among several sensors on a mobile phone (e.g., Nexus One), we identified that the 3-axis accelerometer and the 3-axis orientation sensor (a.k.a. digital compass) played important roles in detecting movements. Figure 3 shows the definition of 3D output of these two sensors, where G(i) means gravity acceleration on axis i.

Fig. 3. 3D output definitions of accelerometer and orientation sensor on a phone

Figure 4 shows the accelerometer data graphs over time for some typical movements, which demonstrate salient features. In the figure, the horizontal axis is time, and the vertical axis is G(i). Since the accelerometer data are 3-dimension vectors, 2-dimension and 3-dimension graphics pattern recognition method [13] is employed to detect movements. Orientation sensor data are used to determine the orientation and pose of phone, which does not necessary indicate its user’s orientation because there are different positions that the phone may be placed on human body, e.g., shirt pocket, coat pocket, trousers pocket or bag. When the user is not moving, it is impossible to determine the user’s orientation from orientation sensor output. Fortunately, we observed that when the user is doing the same walking movement on flat ground, the accelerometer sensor

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors

183

a. G(x) data for walking on flat

b. G(x) data for walking upstairs

c. G(z) data for taking lift up Fig. 4. Accelerometer data graphs for some typical movements

output has different features for different phone poses, so that one can derive the relative position of phone on its user’s body. The user orientation detection algorithm may fuse the accelerometer output and orientation sensor output to determine user’s current

184

J. Song, E.Y. Tang, and L. Liu

orientation. Table 2 shows some results of user walking orientation detection using sensor output fusion. In the table, the accelerometer output, G*(i), is the averaged pace-wise sum of gravity acceleration values on each axis for walking, and the granularity of fused user orientation is 15⁰ in the 0⁰~360⁰ scope. From the result, we can see that when the phone moves with user at different poses, e.g., holding phone in hand to read messages so that the phone is heading front and facing up, the feature (G*(x), G*(y), G*(z)) varies much, and the fused user orientation can be significantly different from the original “Azimuth” output. Table 2. Sensor output fusion for user orientation detection Phone pose

Accelerometer output

Orientation sensor output

Fused user

Heading

Facing

G (x)

G (y)

G (z)

Azimuth

Pitch

Roll

orientation

Front

Up

-0.8

101.1

301.4

287⁰

-18⁰

0⁰

285⁰

Up

Back

49.7

308.9

-39.3

231⁰

-99⁰

9⁰

240⁰

Down

Front

56.2

-299.9

58.6

43⁰

75⁰

10⁰

75⁰

Back

Down

-18.0

-60.1

-297.9

102⁰

152⁰

-4⁰

255⁰

*

*

*

All concerned actions listed in Table 1 are through a user’s interaction with the applications installed in a mobile phone. These actions cannot be detected by sensors, but can be detected by software approach. We used an application manager which monitors the execution status of applications, such as start, active, inactive, and stop. Therefore, the user actions can be recorded by the virtual desktop. For the situation detection, the microphone sensor is most useful. There are some pretty good methods [14] can be used to detect situations by audio only. The fusion with other sensors, such as accelerometer, will further clear the ambiguity between similar audio scenes, like canteen and playfield, since the body movements are quite different. 4.2 Independent Sensor Subsystem Unlike other wearable sensor based user behavior analysis approach, we must consider the limitation imposed by mobile phone sensors. The most critical limitation is that mobile phone is a battery-powered small device. Detecting a user’s movement requires always sensing in the background, which means sensors are working even the user does not use the phone or phone is in sleep mode. This will pose a severe power consumption issue, since sensor data need CPU processing in today’s mobile phone SoC (System on Chip) architecture, so that CPU cannot sleep when sensors are working. Figure 5 shows a typical mobile phone SoC architecture.

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors

185

Fig. 5. Current mobile phone SoC architecture

In details, following reasons make the architecture in Figure 5 not suitable for always sensing. i. ii.

iii.

Sensors’ always-on traffic characteristic is different from other low-speed peripherals; There are two levels of buses to separate high-speed and low-speed traffics to alleviate burden on main bus. However, sensor data need to travel upwards to be buffered and processed by CPU. It does not make sense either to keep the high-speed bus on just waiting for the low-speed sensor data; Furthermore, there is no direct communication between sensors, which prevents sensors from autonomous inter-control and collaboration.

To save the power consumption for always sensing, we propose to design an independent sensor subsystem in mobile phone SoC to resolve the above problems, as shown in Figure 6. This design isolates the core network from the sensor network. It not only allows CPU and main bus to sleep while one or more sensor subsystems are working, but also adds intelligence to sensor peripheral for inter-sensor communication, sensor data manipulation, and sensor data fusion. Isolation of the core network from the sensor network mainly occurs in two different aspects – power and clocking. There are well-known techniques in IC (Integrated Circuit) design for creating different power islands and clock sub-networks, so they will not be repeated here. Adding intelligence to the sensor network is more complicated. On the one hand, we do not want to add a processing block to each sensor, as it creates significant overhead. On the other hand, it is not a good idea to have one processing block for all sensors, as this will cause a significant burden on the routing and processing capability. Instead, what we are proposing is a network of sensor subsystems, where each subsystem has its own processing unit. The size of each subsystem depends on the sensor processing requirements, size of chip and capability of the processing unit. We can employ power saving techniques to such a network of sensor subsystems by further dividing the sensor subsystems into sub-groups.

186

J. Song, E.Y. Tang, and L. Liu

Fig. 6. Independent sensor subsystem

4.3 Phone-Cloud Collaboration It’s well known that probabilistic model training is computation intensive. Mobile phone itself is not an ideal platform to do that, unless there is specific hardware accelerator for that, which is not true nowadays. Therefore, how to leverage the network side capability is essential. With the fast development of wireless communication, always-connected mobile phones will become mainstream and the average available bandwidth for each user will exceed 1Mbps at the 4G era. In this circumstance, cloud computing and storage service [15, 16] is going to be the most easy-to-access and always-available computation and storage resource for mobile phones. We propose a phone-cloud collaboration model to fully exploit the advantages of both sides, as shown in Figure 7. In this collaboration model, the phone acts like a sensory organ, while the cloud acts like a brain. The cloud must emphasize privacy management, so that its users feel like they are using a private computer. The MAST probabilistic model for a user is stored in the cloud side, so that switching phones does not impact behavior analysis. The cloud provides standard interface to interact with phones. The phone uses its sensors and/or software to detect its user’s movements, actions and situations. The detection results are sent to the cloud for two purposes. One is to train the MAST probabilistic model for a specific user; the other is to serve as input to predict the user’s next behavior. With this partition, the communication data rate is very low since only results are transmitted.

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors

187

Fig. 7. Phone-cloud collaboration model

5 Usage Model Mobile phone sensor based user behavior pattern analysis and prediction is the foundation of evolving mobile phones from a life-enriching device to an intelligent user behavior assistant. This section depicts three usage models powered by this technology. 5.1 Proactive UI Adaptation This usage model is between user and phone. Today’s mobile phone contains many applications that cannot be easily navigated on the small screen. Even though there are many UI innovations on mobile phones, but they still cannot fulfill individual’s need. From the user behavior analysis, the user pattern about using what application at what time, what situation, and after what movement, what action is known. With the real-time user behavior prediction result, the phone’s UI can be reorganized to adapt to the next action or situation, so that the user will always see the application he/she wants to use on the first screen. 5.2 Proactive Situation Adaptation This usage model is between user and situation. There are two aspects: one is that the phone adapts to the situation, the other is that the site adapts to the user. For the former, when a user is in different situations (e.g., home, office, driving, playfield, etc.), he/she wants to use the phone in different modes, such as turning the ring volume louder or lower, declining all calls, etc. For the latter, the same site may need to change settings or services for different users. For example, a car shared by several users need to change seat position and radio channels when a user is coming. With the real-time user behavior prediction result, the proactive adaptation can be realized.

188

J. Song, E.Y. Tang, and L. Liu

5.3 Proactive Service Adaptation The above two usage models are beneficial to phone users only. This usage model is between user and service provider. If a user is willing to share his/her behavior pattern information to service providers, service providers can access the user behavior pattern analysis and prediction interface in the cloud to classify the user’s profile and predict the user’s activity, so that to push accurate commercial services to him/her. This is a win-win usage model. Users get timely individualized service, and service providers save costs on untargeted promotions.

6 Summary This paper proposed a MAST probabilistic model for using mobile phone sensors to analyze and predict user behavior patterns. Some concerned new technologies, like user activity detection, independent sensor system and phone-cloud collaboration, are discussed. From those preliminary studies, we find that this research direction is promising while challenging. In the future, we will further investigate those key technologies and complete a prototype system.

References [1] [2] [3] [4]

[5]

[6]

[7]

[8]

[9]

[10] [11]

Apple’s iPhone Information (April 6, 2010), http://www.apple.com/iphone/ Google’s Nexus One Information (April 6, 2010), http://www.google.com/phone Bryzek, J., et al.: Marvelous MEMS. IEEE Circuits and Devices Magazine 22, 8–28 (2006) Dou, Y., Yan, H., Lei, Z.: Broadband dial-up user behavior identification and analysis. In: Proceedings of the 2nd IEEE International Conference on Broadband Network & Multimedia Technology, pp. 316–322 (2009) Vilas, M., et al.: User behavior analysis of a video-on-demand service with a wide variety of subjects and lengths. In: Proceedings of the 31st EUROMICRO Conference on Software Engineering and Advanced Applications, pp. 330–337 (2005) Chung, Y.W., Chung, M.Y., Sung, D.K.: Modeling and Analysis of Mobile Terminal Power on/off-State Management Considering User Behavior. IEEE Transactions on Vehicular Technology 57, 3708–3722 (2008) Corbellini, S., Ferraris, F., Parvis, M.: A System for Monitoring Workers Safety in an Unhealthy Environment by means of Wearable Sensors. In: Proceedings of IEEE Instrumentation and Measurement Technology Conference, pp. 951–955 (2008) Cheng, J.: Testing and Debugging Persistent Computing Systems: A New Challenge in Ubiquitous. In: Proceedings of IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, pp. 408–414 (2008) Reddy, Y.V.: Pervasive Computing: Implications, Opportunities and Challenges for the Society. In: Proceedings of the 1st International Symposium on Pervasive Computing and Applications, pp. 5–5 (2006) Chong, C.-Y., Kumar, S.P.: Sensor Networks: Evolution, Opportunities, and Challenges. Proceedings of the IEEE 91, 1247–1256 (2003) Mirikitani, D.T., Nikolaev, N.: Recursive Bayesian Recurrent Neural Networks for Time-Series Modeling. IEEE Transactions on Neural Networks 21, 262–274 (2010)

User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors

189

[12] Wu, F., Chiu, I.-H., Lin, J.-R.: Prediction of the intention of purchase of the user surfing on the Web using hidden Markov model. In: Proceedings of ICSSSM 2005, International Conference on Services Systems and Services Management, pp. 387–390 (2005) [13] Lu, T., et al.: A novel knowledge-based system for interpreting complex engineering drawings: theory, representation and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 1444–1457 (2009) [14] Lu, L., Zhang, H., Jiang, H.: Content Analysis for Audio Classification and Segmentation. IEEE Transactions on Speech and Audio Processing 10, 504–515 (2002) [15] Amazon Cloud Computing Service Information (April 6, 2010), http://aws.amazon.com/ec2/ [16] Amazon Cloud Storage Service Information (April 6, 2010), http://aws.amazon.com/s3/

ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition* Jun Jin1, Yu Zhang2, Yuanda Cao1, Xing Pu1, and Jiaxin Li1 1

Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, China 2 School of Computer Science, Beijing University of Civil Engineering and Architecture, Beijing, China [email protected]

Abstract. Web service composition is to integrate component services for providing a value-added new service. With the growing number of component services and their dynamic nature, the centralized composition model can’t manage them efficiently and accurately. In this paper, we proposed a distributed hash table (DHT)-based peer-to-peer (P2P) service composition framework, called ServiceStore. Compared with the central control in centralized model, in our ServiceStore, service selection and composition are distributed to the involved task brokers, requesting nodes and service nodes. Furthermore, a simple parallel service selection approach which can still satisfy global constraints is proposed and implemented in our multi-role cooperation (MRC) protocol. The results of experimental evaluation show that ServiceStore can achieve high scalability and efficiency. Keywords: P2P; DHT; service selection; task broker; MRC protocol.

1 Introduction Standardized web service as a main solution of service-oriented computing provides a flexible and convenient way for applications to select and integrate basic services to form new value-added services. Many applications bring service composition into practice, Figure 1 shows a service composition example. In Figure 1, a traveler requests a tourism planning from a service provider and existing atomic services can’t satisfy this composite request by themselves. Service composer can integrate the fight booking, accommodation booking and car rental by using BPEL and execute the BPEL file on an engine such as BEPL4J[6]. Furthermore, the traveler also associates the request with some global QoS constraints (e.g. total price ≤ $5 and response time ≤ 1 minute). The service composer must ensure that the integrated QoS attributes satisfy the global constraints. *

This project supported by Provincial Natural Science Foundation of Hebei Province, China (Grant No. F2009000929).

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 190–199, 2010. © IFIP International Federation for Information Processing 2010

ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition

191

Fig. 1. Web service composition example

With the aim to realize efficient service composition and resource utilization in distributed environment, our contribution of this paper can be briefly stated as follows: (1) A resource-efficient service selection approach. In order to pick out appropriate component services from the alternative ones that provide identical functionality but distinct QoS attributes and resource states. With the method proposed in [7], we design a simple local selection approach that not only satisfies the global constraints but also provides efficient resource utilization. (2) A multi-role cooperation (MRC) protocol. Each peer in MRC protocol can plays four roles – query originator, query decomposer, task broker and coordinator. With the help of this protocol, a composite service request can be solved efficiently. Assuming that the component service is atomic, the rest of this paper is organized as follows. Section 2 gives a brief overview of related work. Section 3 introduces the system architecture. Our multi-role cooperation protocol for distributed service selection and composition is presented in Section 4. Section 5 gives a simple proactive failure recovery approach. Experimental evaluations are presented in Section 6. Finally, Section 7 gives conclusions and our future work.

2 Related Work The problem of service composition has drawn many research institutes in recent years. As centralized orchestration lacks scalability and is easy to break down, more research work concentrates on decentralized orchestration. P2P system which is famous for its self-organizing and scalability has been adopted by many projects as their basic architecture. SELF-SERV [4,5] adopts an orchestration model based on P2P interactions between service components in the composition which provides greater scalability than the approaches based on central scheduler. They propose the concept of service community and a decentralized execution model. But service discovery and selection are not considered. WSPDS [7] uses an unstructured P2P system – Gnutella [2] as its infrastructure. With probability flooding technique and the subsequent content-based network, the overhead of query dissemination is significantly reduced. In WSPDS, each servent (acts as both server and client) is composed of two engines, communication engine and local query engine, having tasks for communication, collaboration and issue query. To discover a requested service, each servent receives the query and forwards it to the neighbor that has the most similar identity to the query. However, QoS-aware service selection is not considered in WSPDS.

192

J. Jin et al.

SpiderNet [10] proposes a QoS-aware service composition framework that uses DHT based P2P system as its infrastructure. Using (key, value) pairs, service discovery is very efficient. SpiderNet, each peer acts the same as in WSPDS, both server and client. SpiderNet uses a probing protocol to collect needed information and perform parallel searching of multiple candidate service graphs. Although the authors use probing budget and quota to control each request’s probing overhead, they don’t consider the situation when dealing with large number of concurrent requests and the service session setup time can’t be guaranteed. Note that all the research work above treats service discovery and selection sequentially which is costly and unnecessary. Integer programming [8] can be used to find optimal selection of component services [15]. Alrifai et al. [9] adopts it and changes service selection into a parallel fashion. In this paper, we adopt this idea and propose a distributed broker-based framework with MRC protocol to achieve QoS-aware and resource-efficient service composition.

3 System Architecture The ServiceStore system is implemented as a distributed middleware infrastructure, which can effectively map user’s composite service request into a set of component services in the P2P service overlay. 3.1 Three-Layer Architecture The architecture of ServiceStore is a three-layer structure (see Figure 2). The bottom service overlay is constructed by all service nodes with their registry component services and links mapped from underlying physical network. To facilitate node location in service overlay, a unique identifier nodeID is assigned to each service node. Each component service provides its functionality with advertised QoS attributes, however these non-functional values are mutative yet, e.g., the response time will be high when network congestion emerges, we classify them into two parts: (1) static metadata denoted as MDs(sij), a profile of the component service, including function name, its location LC(sij) and IO parameter list; (2) dynamic metadata denoted as MDd(sij), including recent statistical QoS attribute values Q(sij) = [q1(sij),…,qM1(sij)] and instant workload WL(sij) = [r1(sij),…,rM2(sij)], where M1 and M2 are the sizes of QoS vector and workload vector respectively, Q(sij) and WL(sij) are mutative to describe the performance of sij. All these component services can be classified into different service classes with each class sharing the same functionality. From the perspective of delivered functionality, each service class is identified as a service task, denoted as Ti. For clarification, we use Si = {si1,…, sij,…, siL} to denote the service class corresponding to Ti, where sij represents the j-th component service being able to fulfill the service task Ti, and L is the size of Si. The scenario of L > 1 indicates that the service task Ti is able to be realized by multiple candidate services, which can differ in their respective QoS attributes. To fast locate components services, we adopt Distributed Hash Table technique [13] to manage the component services in ServiceStore. DHT systems use (key, value) pairs to store and retrieve the value associated with a given key. We design a hash function to map a function name (keywords of function name) to a nodeID

ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition

193

(GUID). After applying it, the metadata list of component services that own similar function names and thus belong to the same service class (including MDd(Si) and MDs(Si)) are stored on the same service node, here called task broker. We use Broker(Si) to denote the task broker corresponding to Si. In Figure 2, the middle layer shows the task brokers of all component services on the service overlay. The top layer is a function graph that comes from a user’s composite service request. A request is denoted as R = F, Qr , where F = {T1,…,TN}is a function graph composed by a set of tasks (composition relations); Qr = [ q1r ,…, q Mr 1 ] shows a user’s QoS requirements.

〈

〉

Fig. 2. System architecture

3.2 QoS-Aware Service Composition Problem In ServiceStore, given a composite service request R with F = {T1,…,TN} and Qr = [ q1r ,…, q Mr 1 ], the aim of service composition is to find a list of component services, we call it Execution Plan (EP), that can realize each task in F and satisfy each quality attribute in Qr. Figure 2 shows a few of selected components services and their links with dotted lines and circles respectively, which make up an EP for R.

4 Multi-Role Cooperation (MRC) Protocol 4.1 Four Roles in MRC Protocol In MRC protocol, each peer plays four roles: (1) Query originator provides visualized specification environment [1] to help users issue composite service requests. (2) Query decomposer receives the quality levels and converts the global constraints Qr into local constraints and sends them to the involved task brokers. (3) Task broker maintains the meta-data list of alternative component services, and receives local constraints from requesting node and returns the most appropriate component service according to the heartbeats from every component service. (4) Execution coordinator the execution of the selected component services using composition pattern (e.g. sequential, conditional, concurrent, loop) and return the results to the user.

194

J. Jin et al.

Fig. 3. A peer plays four roles in ServiceStore

4.2 MRC Protocol Our MRC protocol includes five major steps shown in Figure 3: Step 1. Issue a composite service request. With the help of prestored composite service templates, a user at a host specifies a composite service request R using GUI. After query analysis, F, Qr is sent to the query decomposer and F is sent to execution coordinator.

〈

〉

Step 2. Decompose global QoS constraints into local ones. Through graph analysis, the involved tasks and global QoS constraints Qr are sent to the MIP decomposition [7]. Then the query decomposer achieves N local constraints and sends them to the corresponding task brokers. Step 3. Select feasible component services locally. To achieve accurate selection, each component service sends heartbeat message with MDd(sij) indicating the states of sij to Broker(Si). Then each corresponding task broker performs local selection and returns the selected service candidates to the requesting peer separately. And for efficient resource utilization, task broker updates WL(sij) of the optimal selected candidate service(e.g. its concurrent connection plus one). The details of this step will be described in the next subsection. Step 4. Form executable plans. Upon the receipt of all service candidates from the corresponding task brokers, the execution coordinator begins to compose them into EP according to F and sends to the first component service in EP. Step 5. Coordinate the execution of the EP. When receives an EP, a component service checks if its function is contained in the EP. If its function exists, the component service begins to execute and output results to the next component service according to EP. Finally, after all component services complete their executions, the

ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition

195

last component service sends the results to the requesting peer (user) and each involved task broker recovers WL(sij) of the component services in EP . 4.3 Service Selection As each task broker keeps the information of all alternative component services, upon the receipt of local constraints, it uses them as the bound and performs service selection for the corresponding service class independently. Given the following parameters: (1) The received local constraints for service class Si: Qc(Si)= [qk(Si) | 1 ≤ k ≤ M1] ; (2)The dynamic metadata of component service sij: MDd(sij) = , where Q(sij) = [qk(sij) |1 ≤ k ≤ M1] and WL (sij) = [rk(sij)| 1 ≤ k ≤ M2] . We compute the utility U(sij) of the j-th service candidate in class Si as M1

U ( sij ) = ∑

Qmax (i, k ) − qk ( sij )

k =1 Qmax (i, k ) − Qmin (i, k )

where

∑

M1 k =1

× ωk

(1)

ω k = 1 , Qmax (i, k ) and Qmin (i, k ) represent the max and min value of the k-

th quality attribute in class Si. Generally, the component service with the highest U(sij) is always selected as it provides the best capability. However, with the increasing number of invocation, its actual performance may become poor. Thus, a resource utility function UR(sij) is needed for representing a component service’s resource utility.

∑

v

M2

ra ij (k )

k =1

ij rmax (k )

UR( s ij ) = ∑

v

×ω k

(2)

v v ω k = 1 , rmax ( k ) and ra (k ) represent the max and available value of the kth resource (e.g. memory) in node vij (suppose sij is on the node vij). Here three requirements need to be considered in service selection: The selected component services (1) satisfy the global constraints; (2) achieve a large resource utility value; (3) achieve the optimal utility. The first requirement is very essential, as our major aim is to achieve a feasible solution for the user. The last requirement is set the lowest priority, even if big value does please the user, it may cause the system unstable yet for resource competition. Hence, for better stability and resource utility, the second requirement has higher priority than the last one. We apply these requirements in the algorithm 1.

where

M2

ij

ij

k =1

Algorithm 1. Local service selection Input: Decomposed local constraints Qc = [q1,q2,…,qM1] and metadata list of the service class S MDd(S) = {<[q1(sj),…, qM1(sj)], [r1(sj), …, rM2(sj)]> | 1< j qk then break

196

J. Jin et al.

5. end if 6. set k = k + 1 7. end for 8. if c then 9. add sj to Sout 10. compute U(sj) and UR(sj) 11. end if 12. end for 13. Sort Sout according to UR(sj) 14. return Sout

Our aim is to get a list of feasible component services from each involved task broker for the completion of EP. All feasible component services must meet the aforementioned three requirements. Algorithm 1 shows the service selection process. With the decomposed QoS constraints for service class S and meta-data of S as input parameters, the Broker(S) begins to run this algorithm. Broker(S) checks every QoS constraint (e.g. price ≤ $2) for every candidate service. If any QoS attribute was beyond the upper bound of the according given QoS constraint, Broker(S) would discard that component service. Therefore, the first requirement is satisfied during service selection. After Sout filled with all feasible component services, we sort it according to UR(sj) of every feasible component service. For fault tolerance, each involved task broker returns more than one candidate services.

5 Proactive Failure Recovery Failure recovery is very essential in dynamic environment [12]. ServiceStore provides a proactive failure recovery mechanism to maintain the quality of service composition during system runtime. As task broker is very crucial in MRC protocol, we discuss the situation when task brokers fail. ServiceStore maintains a small number of backup task brokers for each service class, for fault tolerance, when a peer publishes a component service, the metadata of this component service are stored into more than one task brokers [10]1. (1) Backup task broker computation. Applying the secure hash algorithm to the strings formed by concatenating two or three the component service’s function name, we can achieve different resourceIds, thus different task brokers to maintain the metadata list of this service class. For clarification, we give the following simple functions: String Concatenate(String functionname, int n) : concatenate the function name for n times. GUID SH(String functionname) : apply the secure hash algorithm to the functionname. And these three task brokers of the service class Si with the function name name_i are calculated as the flowing: Broker1(Si) = SH(Concatenate(name_i, 1)) Broker2(Si) = SH(Concatenate(name_i, 2)) Broker3(Si) = SH(Concatenate(name_i, 3)) 1

Current implementation keeps two additional backups.

ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition

197

For example, three task brokers of car rental service class are calculated by SH(Concatenate (“carrental”,1)), SH(Concatenate (“carrental”,2)) and SH(Concatenate (“carrental”,3)) respectively. (2) Backup task broker selection. If one of the task brokers failed, we would adopt the rule depicted in Figure 4. (3) Backup task broker synchronization. As service selection depends on the metadata list of the candidate services, each component service periodically sends the MDd to the task brokers (including the backup ones).

Fig. 4. A simple rule for task broker selection

6 Implementation and Evaluation The experiment is carried out on PeerSim [3] and the decentralized service overlay is implemented based on DHT based P2P system Pastry [11,14]. Please note that we set the same parameters during each round of simulation: 50 service functions in service overlay, 2 composite service requests during each time unit and 3000 time units each round of simulation lasts. First, we evaluate the resource utility of our MRC protocol. For simplicity, we use the number of each candidate component service’s concurrent link to measure the resource efficiency of our selection approach. We use a 1000 nodes service overlay, with each node provides component services whose function is selected from 50 predefined functions and each function has 4 duplicates. Each composite service request contains 3 functions. Every component service’s number of concurrent link will be increased by one when it is selected and reduced by one after working for 80-90 time units. Figure 5 illustrates the 4 duplicates almost have the same number of concurrent connections during 2000 running time units.

Fig. 5. Concurrent connections of each service duplicate

198

J. Jin et al.

Second, we evaluate the max and average hops when discovering a component service in different size service overlay. As the expected number of routing hops in DHT based P2P system Pastry is O(logN) [14], in Figure 6 although the service overlay size grows from 500 to 20000, the number of average hop increased very slowly and only few hops reach the max number.

Fig. 6. Max and average hops for discovering a component service

Third, we measure the average composite service session setup time with different service overlay size and different function number of each request. Figure 7 illustrates the average composite service session setup time when the function number is 3, 5, 10, 20 and the service overlay size varies from 50-20000. Thanks to parallel service selection, with the increasing service overlay size and function number in each request, composite service setup time increases slowly and doesn’t multiples with the function number.

Fig. 7. Composite service session setup time

7 Conclusion and Future Work In this paper, we have presented a P2P service composition framework called ServiceStore. The main contributions are: 1) ServiceStore provides a fully decentralized architecture implemented by using distributed task brokers as coordinators; 2)

ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition

199

ServiceStore provides a simple MRC protocol for service composition; 3) Our evaluation shows that ServiceStore scales well with service overlay size increasing and achieves good resource efficiency. Since the task broker is a critical role in MRC protocol and the failure recovery needs more time especially near the end of composite service execution, in the future we will integrate behavior prediction into our service composition framework which can help us to improve system stability.

References 1. Frolund, S., Koistinen, J.: Quality of service specification in distributed object systems desig. Distributed Systems Engineering Journal 5(4) (December 1998) (QML, which is a language for the description of QoS using XML) 2. Gnutella, http://gnutella.wego.com/ 3. PeerSim, http://peersim.sourceforge.net/ 4. Benatallah, B., Dumas, M., Sheng, Q.Z., Ngu, A.H.: Declarative Composition and Peer-toPeer Provisioning of Dynamic Web Services. In: Proc. Int’l Conf. Data Eng. ICDE, pp. 297–308 (February 2002) 5. Benatallah, B., Sheng, Q., Dumas, M.: The Self-Serv environment for web services composition. IEEE Internet Computing 7(1), 40–48 (2003) 6. Business Process Execution Language for Web Services Java Run Time (BPWS4J), http://www.alphaworks.ibm.com/tech/bpws4j 7. Banaei-Kashani, F., Chen, C.-C., Shahabi, C.: WSPDS: web services peer-to-peer discovery service. In: Proc. of the 5th Int’l Conference on Internet Computing (IC), Las Vegas, Nevada, pp. 733–743 (June 2004) 8. Nemhauser, G.L., Wolsey, L.A.: Integer and Combinatorial Optimization. WileyInterscience, New York (1988) 9. Alrifai, M., Risse, T.: Combining Global Optimization with Local Selection for Efficient QoS-aware Service Composition. In: Proc. of the 18th Int’l World Wide Web (WWW), Madrid, Spain (April 2009) 10. Gu, X., Nahrstedt, K., Yu, B.: SpiderNet: An Integrated Peer-to-Peer Service Composition Framework. In: Proc. of the 13th Int’l Symp. on High-Performance Distributed Computing (HPDC), Honolulu, Hawaii, pp. 110–119. IEEE Computer Society, Los Alamitos (2004) 11. PeerSim-Pastry, http://code.google.com/p/peersim-pastry/ 12. Raman, B., Katz, R.H.: Load Balancing and Stability Issues in Algorithms for Service Composition. In: Proc. of IEEE INFOCOM 2003, San Francisco, CA (April 2003) 13. Distributed hash table, http://en.wikipedia.org/wiki/ Dist-ributed_hash_table 14. Rowstron, A., Druschel, P.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Proc. of the 18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany (November 2001) 15. Ardagna, D., Pernici, B.: Global and Local QoS Constraints Guarantee in Web Service Selection. In: 3rd IEEE International Conference on Web Services (ICWS), Orlando, FL, USA, pp. 805–806 (July 2005)

Identifying Nearest Neighbor Nodes and Connectivity in Three-Dimensional Wireless Sensor Networks Using Poisson Point Field Yanhuai Qu1 , Jianan Fang1 , and Shuai Zhang2 1 School of Information Science and Technology, Donghua University, Shanghai 201620, P.R. China 2 School of Information and Electronic Engineering, Shandong Institute of Business and Technology, Yantai 264005, P.R. China [email protected], [email protected], [email protected]

Abstract. Some problems about the connectivity of wireless sensor networks (WSNs) are always important and diﬃcult topics in research, especially the trade-oﬀ between connectivity and energy control. In this paper, we present a novel and eﬀective method to calculate nearest neighbor nodes in three-dimensional WSNs using Poisson point ﬁeld theory, which enables each node to ﬁnd the kth nearest neighbor node and adjust transmitting range according to local requirement. Moreover, we derive the probability expression of the k-connected network. Finally, we give corresponding simulation results and discussions about the practical value for node energy control and design of WSNs. Keywords: Wireless sensor networks, nearest neighbor nodes, connectivity, transmitting range, Poisson point ﬁeld.

1

Introduction

Wireless Sensor Networks (WSNs) have received signiﬁcant attention in recent years due to their numerous potential in military and civil applications. Typically, a wireless sensor node consists of sensing, computing, communication, actuation, and power components. Sensor nodes are often air-dropped in hostile or dangerous region. It is not possible for humans to reach these sensor nodes and maintain each sensor node, as often the number of sensor nodes is quite large. Hence, self-organization of sensor nodes to form a connected network is an essential requirement. Additionally, the unattended nature of WSNs destines a majority of sensors to have energy sources which may not be replenished. Though some WSNs have equipped renewable energy such as solar battery, the energy consumed will limit their application. So energy is a decisive resource in WSNs. Energy eﬃciency and network lifetime have occupied a large portion of research eﬀort in WSNs [1],[2],[3],[4],[5]. In this paper, we investigate the problems of nearest neighbor nodes and connectivity of WSNs in three dimensions applying Poisson point ﬁeld theory C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 200–209, 2010. c IFIP International Federation for Information Processing 2010

Identifying Nearest Neighbor Nodes and Connectivity in WSNs

201

and random geometry graphs theory. We assume n nodes (We denote sensor node by node later.) are distributed in [0, l]3 according to a three-dimensional homogeneous Poisson point ﬁeld with density λ (λ = n/l3 ). Firstly, we investigate nearest neighbor nodes of a node and make use of conditional probability to ﬁnd the probability distribution function F (r) of the k th nearest neighbor node distance. And then corresponding critical transmitting range of the node is given. Next, basis of prior obtained results of this paper we discuss the connectivity of WSNs applying the random geometric graphs theorem and derive the probability expression of the k-connected Network and corresponding critical transmitting range of all nodes. Moreover, we present simulation results and discuss the practical value for node power control and design of WSNs. Finally, we present conclusion and outline the directions for our future work.

2

Related Prior Work

In this section, we recall some related work. Hou and Li [6] presented a model for analyzing the performance of transmission strategies in a multihop packet radio network where each station had adjustable transmission radius. One of the transmission strategies, namely transmitting to the nearest forward neighbor by using adjustable transmission power, had desirable features in a high terminal density environment. Philips et al. [7] examined a model of a packet radio network in which transmitters with range R were distributed according to a two-dimensional Poisson point process with density D. To ensure network connectivity, the expected number of nearest neighbors of a transmitter must grow logarithmically with the area of the network. Gupta and Kumar [8] derived the critical power a node in wireless network needed to transmit in order to ensure that the network was connected with probability one as the number of nodes in the network tended to inﬁnity. They shown that if n nodes were distributed in a disc of unit area in and each node transmitted at a power level so as to cover an area of πr2 = (ln(n + c(n))/n), then the resulting network was asymptotically connected with probability one if and only if c(n) → +∞. Bettstetter [9],[10] investigated the minimum node degree and k-connectivity of wireless multihop network in the one (respectively, two) dimensional space, he derived an analytical expression of transmitting range of k-connected network.

3 3.1

Preliminary Poisson Point Field

One-dimensional number distributions of stationary Poisson point ﬁeld N: P rob(N (B) = k) =

(λ|B|)k −λ|B| e k!

(k = 0, 1, ..., N ; |B| < ∞),

(1)

where N (B) denotes the number of points of N in a bounded Borel set B, λ is the density of N, and | · | denotes Lebesgue measure [11],[12],[13],[14].

202

3.2

Y. Qu, J. Fang, and S. Zhang

Some Basic Concepts of Graph Theory

Definition 1 (Graph). A graph G is an ordered pair of disjoint sets (V ,E), where E ⊆ V × V . Set V is called the vertex(or node, or point),while set E is the edge (or line) set of graph G. Definition 2 (Connected Graph). A graph G = (V, E) is connected if for any two nodes u, v ∈ V there exists a path from u to v in G. Definition 3 (k − Connected and k-Edge-Connected Graph). A graph G = (V, E) is k-connected, for some k ≥ 2, if removing any k-1 nodes from the graph does not disconnect it. Similarly, G is k-edge-connected, for some k ≥ 2, if removing any k-1 edges from the graph does not disconnect it [15]. (see Fig.1)

Fig. 1. 2-connected Graph

3.3

Elements of Random Geometric Graphs Theory

Definition 4 (Random Geometric Graphs). In the random geometric graphs G(n,r), n nodes are placed at random with uniform distribution in a square area of unit size (more generally, a m-dimensional cube). There is an edge (u,v) between any pair of nodes u and v, if the Euclidean distance between them is less than r [16].

4 4.1

Network Model Network Topology Model

Consider a WSN. Given n ∈ N+ (N+ = {1, 2, ..., N ),we use V to denote the set of n nodes, i.e. V = Vi :∈ [n]}. Given l ∈ N+ , we use L to denote the set of l communication links between the nodes in V, i.e. L = {Li : i ∈ [l]}. Definition 5 (N etwork T opology Graph). The network topology graph N T G =< V, L > is a graph with vertex set V representing the set of nodes, and edge set L representing the set of communication links. An arbitrary WSN consists of a set of nodes randomly placed in a space. Each node can communicate with another node within distance r through one-hop communication, where r is given as the transmitting range of node (see Fig.2). Throughout the paper, we consider an independent and identically distributed (i.i.d.) network topology model.

Identifying Nearest Neighbor Nodes and Connectivity in WSNs

203

r

Isolated nodes

Fig. 2. The topology of WSNs

4.2

Spatial Node Distribution Model

Assume n nodes are distributed in [0, l]3 according to a three-dimensional homogeneous Poisson point ﬁeld with density λ(λ = n/l3 ).

5

Nearest Neighbor Nodes

For a Poisson point ﬁeld, we can specify how the inter-node distances are distributed. By the stationarity of the Poisson point ﬁeld and Slivnyak’s theorem [11],[12], the distance between the origin and its the k th nearest neighbor is the same as the distance between an arbitrary point and its the k th nearest neighbor. 5.1

The kth Nearest Neighbor Node

In this section, we give the probability distribution function F (r) of the k th nearest neighbor node distance, denoted by Fk (r). According to the homogeneity of Poisson point ﬁeld, we can calculate it for a point close to the origin o. The conditional probability Fk (r) = 1 − P rob(((N (B(o, r) − B(o, ε)) = 0) ∪ (N (B(o, r) − B(o, ε)) = 1)... ∪(N (B(o, r) − B(o, ε)) = k − 1))|N (B(o, ε)) = 1)

(2)

is the probability that the k th neighbor node of a node in a small sphere B(o, ε)(ε > 0) locates at distance not greater than r from the origin o (see Fig.3), where r ε. As ε tends 0, we can derive the probability distribution function Fk (r) of the k th nearest neighbor node distance. According to the deﬁnition of conditional probability, we have Fk (r) = 1 − P rob(((N (B(o, r) − B(o, ε)) = 0) ∪ (N (B(o, r) −B(o, ε)) = 1) ∪ ... ∪ (N (B(o, r) − B(o, ε)) = k − 1)), N (B(o, ε)) = 1) P rob(N (B(o, ε)) = 1).

(3)

204

Y. Qu, J. Fang, and S. Zhang

z The 2nd nearest Neighbor Node

R2

y

ε o

R1

Rk The kth nearest Neighbor Node The 1st nearest Neighbor Node

x

Fig. 3. Nearest neighbor nodes in three dimensions

Because B(o, r) − B(o, ε) and B(o, ε) are disjoint Borel sets, by the fundamental properties of homogeneous Poisson point ﬁeld, we get Fk (r) = 1 − P rob((N (B(o, r) − B(o, ε)) = 0) ∪ (N (B(o, r) − B(o, ε)) = 1)... ∪(N (B(o, r) − B(o, ε)) = k − 1)).

(4)

Make use of (1), we can derive Fk (r) = 1 − Exp(−λ|B(o, r) − B(o, ε)|)(1 + λ|B(o, r) − B(o, ε)| + ... +(λ|B(o, r) − B(o, ε)|)k−1 /(k − 1)!) 4 4 Fk (r) = 1 − Exp(− λπ(r3 − ε3 )) · (1 + λπ(r3 − ε3 ) + ... 3 3 ( 43 λπ(r3 − ε3 ))k−1 + ). (k − 1)!

(5)

(6)

Here we let ε tend 0, then ( 4 λπr3 )k−1 3 4 4 Fk (r) = 1 − e− 3 λπr (1 + λπr3 + ... + 3 ) 3 (k − 1)! 4

= 1 − e− 3 λπr

3

k−1 i=1

( 43 λπr3 )i . i!

(7)

The corresponding probability density function is fk (r) =

3( 34 λπr3 )k − 4 λπr3 e 3 . r(k − 1)!

(8)

Identifying Nearest Neighbor Nodes and Connectivity in WSNs

205

We denote critical transmitting range of the node by Rk−th when its the k th nearest neighbor node got. We can deduce the probability P rob(r ≤ Rk−th )=

Rk−th

fk (r)dr 0

Rk−th

= 0

=1− e

3( 43 λπr3 )k − 4 λπr3 e 3 dr r(k − 1)!

k−1 3 ( 4 λπRk−th )i − 43 λπR3k−th 3

i!

i=1

5.2

.

(9)

The Isolated Node

Specially, when a node has no neighbor node, i.e. it is a isolated node, the probability is

P rob(isolated node) = P rob(N (B(o, r)− B(o, ε)) = 0) = Exp(−λ|B(o, r) − B(o, ε)|) 4 = Exp(− λπ(r3 − ε3 )) 3 − 43 λπr 3 =e (ε → 0), and critical transmitting range Riso of the node is 3 ln P rob(isolated node) 3 Riso = − . 4λπ

6

(10)

(11)

Connectivity

In this section, we investigate the k-connected network. In [17], Penrose has proved P rob(G is k − connected) = P rob(DEGmin (G) ≥ k), (12) where DEGmin (G) denotes the minimum node degree of random geometric graph G, i.e. a network is said to be k-connected if each node has at least the k th nearest neighbor node. So the probability for the k-connected network is 4

3

P rob(k − Connected) ≈ (1 − e− 3 λπRkConn

k−1 i=1

3 ( 43 λπRkConn )i n ) , i!

(13)

where RkConn denotes critical transmitting range of all nodes of k-connected network.

206

7

Y. Qu, J. Fang, and S. Zhang

Simulation and Discussion

In this section, we present simulation and discussion. In our simulation environment, the space of nodes distribution V = [0, 125]3 (a unit of measurement: m3 ), the number of nodes n ∈ [0, 100], so node density λ = n/V . Firstly, Fig.4 shows simulation results of the probability that a node has no neighbor node. For WSNs to function properly in the ﬁrst place it must be connected, or mostly connected. Otherwise the network would consist of scattered isolated islands of information and could not support properly networking applications.

(a)

(b)

Fig. 4. (a) The probability that a node has no neighbor node, (b) Relation between the number of nodes and transmission range when probability that a node has no neighbor node is 99%

Next, Fig.5 shows the simulation results of the probability distribution that a node has the 3rd nearest neighbor node. A main constraint in WSNs is energy. It would contribute to the survival of overall network if each node is capable of ﬁnding nearest neighbor node and adjusting transmitting range according to local requirement. Then, Fig.6 shows simulation results of the probability distribution for 3connected network. The connectivity is a vital attribute of WSNs. Through simulations, we notice that connectivity of network becomes multiple with augmentation of transmitting range rapidly. Finally, the critical transmitting range values reported in Table 1 can be interpreted as transmitting range in diﬀerent operation state of WSNs, where Riso denotes critical transmitting range of isolated node, R1st (respectively, R2nd and R3rd ) denotes critical transmitting range of the node having the 1st (respectively, 2nd and 3rd ) nearest neighbor node, R1Conn (respectively, R2Conn and R3Conn )

Identifying Nearest Neighbor Nodes and Connectivity in WSNs

(a)

207

(b)

Fig. 5. (a) The probability distribution that a node has the 3rd nearest neighbor node, (b) Relation between the number of nodes and transmitting range when probability that a node has the 3rd nearest neighbor node is 99%

(a)

(b)

Fig. 6. (a) The probability distribution of 3-connected network, (b) Relation between the number of nodes and transmitting range when probability of the 3-connected network is 99%

208

Y. Qu, J. Fang, and S. Zhang

denotes critical transmitting range of all nodes of 1-connected (respectively, 2connected and 3-connected) network. We observe that the critical transmitting range values have signiﬁcant diﬀerences between the k th nearest neighbor node and k-connected network. Due to sensor transmitting power in direct proportion to transmitting range, above results can provide a idea for designers of WSNs. Under the conditions of a large connected component, local nodes or few isolated nodes can adjust transmitting range independent of any other node, which can maximally prolong lifetime of network. Table 1. Critical transmitting range (a unit of measurement: m) yielding 99% of diﬀerent operation state of WSNs varying with network capacity n 10 25 50 75 100

8

Riso 7.77 5.72 4.54 3.97 3.61

R1st 59.88 44.12 35.02 30.59 27.79

R2nd 67.64 49.84 39.56 34.56 31.40

R3rd R1Conn R2Conn R3Conn 73.18 68.53 75.50 80.59 53.92 52.64 57.59 61.23 42.80 42.98 46.81 49.64 37.39 38.13 41.43 43.88 33.97 35.01 37.98 40.19

Conclusion and Future Work

In this paper, we propose a novel method to investigate the problems of nearest neighbor nodes and connectivity of WSNs in three dimensions. We ﬁrstly provided probability distribution function of the k th nearest neighbor node distance and analytical expression of critical transmitting range of corresponding node. And then we derive probability expression of k-connected network and critical transmitting range of all nodes. Simulation results and discussions demonstrate that the method is accurate and eﬀective which can provide some theoretical basis for furthering research energy eﬃciency and topology control of WSNs. Our future work will be focus on inhomogeneous WSNs and their stochastic properties. Acknowledgments. This research was supported in part by the National Natural Science Foundation of China under Grant No. 60874113. The authors also thank the reviewers for their valuable comments.

References 1. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: A survey. Computer Networks 38(2), 393–422 (2002) 2. Chakrabarty, K., Iyengar, S.S.: Scalable Infrastructure for Distributed Sensor Networks. Springer, London (2005) 3. Pottie, G.J., Kaiser, W.J.: Wireless sensor networks. Communications of the ACM 43, 51–58 (2000)

Identifying Nearest Neighbor Nodes and Connectivity in WSNs

209

4. Brooks, R.R., Iyengar, S.S.: Multi-Sensor Fusion: Fundamentals and Applications with Software. Prentice Hall, Englewood Cliﬀs (1998) 5. Iyengar, S.S., Brooks, R.R.: Distributed Sensor Networks. Chapman & Hall/CRC, Boca Raton (2005) 6. Hou, T.C., Li, V.O.K.: Transmission range control in multihop packet radio networks. IEEE Trans. on Communications COM 34, 38–44 (1986) 7. Philips, T.K., Panwar, S.S., Tantawi, A.N.: Connectivity properties of a packet radio network model. IEEE Trans. on Information Theory 35(5), 1044–1047 (1989) 8. Gupta, P., Kumar, P.R.: Critical power for asymptotic connectivity in wireless networks. In: Stochastic Analysis, Control, Optimization and Applications. Birkh¨ auser, Boston (1998) 9. Bettstetter, C.: On the minimum node degree and connectivity of a wireless multihop network. In: Proc. ACM International Symposium on Mobile Ad Hoc Networking and Computing. ACM Press, New York (2002) 10. Bettstetter, C.: On the connectivity of wireless multihop networks with homogeneous and inhomogeneous range assignment. In: Proc. IEEE Vehicular Technology Conference. IEEE Press, Los Alamitos (2002) 11. Stoyan, D., Kendall, W., Mecke, J.: Stochastic Geometry and Its Applications, 2nd edn. John Wiley & Sons, Chichester (1996) 12. Stoyan, D., Stoyan, H.: Fractals, Random Shapes and Point Fields. John Wiley & Sons, Chichester (1994) 13. Daley, D., Jones, D.V.: An Introduction to the Theory of Point Processes, 2nd edn. Elementary Theory and Methods, vol. I. Springer, New York (2003) 14. Daley, D., Jones, D.V.: An Introduction to the Theory of Point Processes, 2nd edn. General Theory and Structure, vol. II. Springer, New York (2008) 15. Bollob´ as, B.: Modern Graph Theory. Springer, New York (1998) 16. Penrose, M.D.: Random Geometric Graphs. Oxford University Press, Oxford (2003) 17. Penrose, M.D.: On k-connectivity for a geometric random graph. Random Structures and Algorithms 15(2), 145–164 (1999)

A Novel Trust Evaluation Model for Mobile P2P Networks Xu Wu Department of Computer Science, Xi’an Institute of Posts and Telecommunications, Xi’an, 710121, China [email protected]

Abstract. Trust is one of key factors which influence the development of mobile P2P networks. However, current trust evaluation models are not applicable to mobile P2P networks properly due to some of its characteristics such as heterogeneous nature of the peers, limited-range as well as unreliability of wireless links. In the paper main factors that influence the trust in mobile P2P networks are identified. Based on the analyzed result, we propose a novel trust evaluation model, which helps the networks to operate normally with high probability. Our model does not employ cryptographic approaches or certification mechanisms, so it is light enough to fit well with mobile P2P networks without great overheads. In addition, it also effectively solves the trust problem when no prior interaction history exists, an issue that has not been addressed in many models. The proposed model is shown to be robust in the presence of attackers through simulation. Keywords: Model; trust; P2P networks; mobility.

1 Introduction A wireless mobile network is a cooperative network where each node requires to collaborate with each other to forward packets from a source to a destination. It is obvious that mobile P2P systems are different from the wired ones, since each object is able to move around and each has a limited radio range. Compared to a fixed peerto-peer system, the mobile network environment is more distributed, with wider participants. Traditional security techniques cannot be applied directly to the mobile P2P networks due to the limitations of the wireless medium, expensive bandwidth, and the limitations of the mobile devices [1]. Therefore, computation-intensive techniques like public-key cryptography are not expected to be used in mobile P2P networks. Such a distinction is also beyond the ability of the conventional key management scheme because we cannot guarantee the secrecy of each peer’s private key, In addition, mobile devices are susceptible to a variety of attacks for example, eavesdropping, denial of services, wormhole, and Sybil attack. Even a few malicious peers can easily spread deceitful data and make the networks be in confusion without great efforts. Therefore, some smart trust management schemes are needed to identify trustworthiness of mobile peers in order to distinguish between malicious peers and innocuous peers, and to strengthen reliable peers and weaken suspicious peers. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 210–219, 2010. © IFIP International Federation for Information Processing 2010

A Novel Trust Evaluation Model for Mobile P2P Networks

211

However mobile P2P networks pose some unique challenges, many trust evaluation models [2-6] are not applicable to mobile P2P networks properly. In the paper characteristics of mobile P2P networks are discussed, and main factors that influence the trust in mobile P2P networks are analyzed. Based on the analyzed result, we propose a novel trust evaluation model for resilient mobile P2P networks, which helps the networks to operate normally with high probability. The rest of the paper is organized as follows. Section 2 describes related work. Section 3 presents the proposed trust model. Section 4 contains experimental study. Finally, we conclude this paper in Section 5.

2 Related Work EigenTrust [2] model is designed for the reputation management of P2P systems. The global reputation of peer i is marked by the local trust values assigned to peer i by other peers, which reflects the experience of other peers with it. The core of the model is that a special normalization process where the trust rating held by a peer is normalized to have their sum equal to 1. The shortcoming is that the normalization could cause the loss of important trust information. Runfang Zhou and Kai Hwang [3] proposed a power-law distribution in user feedbacks and a computational model, i.e., PowerTrust, to leverage the power-law feedback characteristics. The paper used a trust overlay network (TON) to model the trust relationships among peers. PowerTrust can greatly improve global reputation accuracy and aggregation speed, but it can not avoid the communication overhead in global trust computation. A new trust model based on recommendation evidence is proposed for P2P Networks by Tian Chun Qi et al [4]. The proposed model has advantages in modeling dynamic trust relationship and aggregating recommendation information. It filters out noisy recommendation information. Thomas Repantis and Vana Kalogeraki [5] propose a decentralized trust management middleware for ad-hoc, peer-to-peer networks, based on reputation. In the work, the middleware’s protocols take advantage of the unstructured nature of the network to render malicious behavior, and the reputation information of each peer is stored in its neighbors and piggy-backed on its replies. Recently, there are many approaches studying trust management of wireless networks. The significant efforts done so far are to manage trust with the help of Certificate Authority (CA) or Key Distribution Center (KDC). A CA/KDC is responsible for setting up the foremost trust relationships among all the nodes by distributing keys or certificates [6]. However, this strategy suffers from difficulty on collecting t certificates efficiently. In the distributed CA scheme [7], Kong et al. mentioned that the trust between a to-be-member node and t member nodes in its neighborhood can be established by out-of-bound physical proofs, such as human perception or biometrics. However, we can find that this method is far from practical.

3 Trust Evaluation Model Our trust model has two types of trust: direct trust and recommendation trust. Direct trust is the trust of a peer on another based on the direct interacting experience and is used to evaluate trustworthiness when a peer has enough interacting experience with

212

X. Wu

another peer. On the other hand, recommendation trust is used when a peer has little interacting experience with another one. Recommendation trust is the trust of a peer on another one based on direct trust and other peers’ recommendation. In the section we firstly introduce five trust factors which influence the trust in such a mobile environment. We then present the details about how to evaluate the trustworthiness of peers by these trust factors. 3.1 The Trust Factors Communication of P2P application: This factor contains communication ratio information. When a peer finds a certain event, if its neighbor peers also find the same event and broadcast the results of event, communication ratio values for those neighbor peers go up. If they do not communicate, communication ratio values for those peers go down. This factor represents the level of selfishness and normality of a peer. If a peer does not participate in communication in the networks continuously for its battery saving or some other Roubles, its trust value will be degraded.

Vi =

vsi − vf i vsi + vf

(1)

Vi : communication value of peer i, where 1 ≤ i ≤ k .

vsi : communicating success count of node i vf i : communicating failure count of node i

Communicating results: This factor represents the result information of finding malicious events. This factor consists of communicating data and communicating time for the events. The information of this factor is used to check a consistency of each mobile peer and to find malicious peers in the networks. The inconsistency check result affects the value of consistency factor, Ci . When peer j checks the inconsistency of peer i’s communicating results, if the results are out of relatively standard bound of node j , node j estimates the results to be inconsistent or deceitful data. Such an estimation for peer i, affects the value of the consistency factor, Ci .

Ri =< cdi , cti >

(2)

Ri : communicating result value of peer i, where 1 ≤ i ≤ k

cd i : communicating data of peer i cti : communicating time of peer i Consistency: This factor represents a level of consistency of a peer. Based on this factor, we can identify malicious peers, and filter out their communicating data in the networks. Ci =

cci − ici cci + ici

(3)

A Novel Trust Evaluation Model for Mobile P2P Networks

213

Ci : Consistency value of peer i, where 1 ≤ i ≤ k cci : consistent communicating count of peer i

ici : inconsistent communicating count of peer i Power: This factor represents remained lifetime of a mobile peer. As we compute trust values in consideration of this factor, we can reduce additional processes which would be necessary to handle some power- managing policies. In addition, some peers which have high trust values are likely to process more jobs than the other peers which have low trust values. In that case, the higher trust value a peer has, the earlier the peer meets its end. According to the adoption of this power factor, we can prevent such a biased power exhaustion. −1 ≤ P i ≤ 1

(4)

P i : Power value of peer i, where 1 ≤ i ≤ k Size of interactions: Size has different meanings in different mobile P2P environments. For example, in a mobile P2P file sharing network, the size of interaction expresses the file size shared in each interaction. Size of interactions is an important factor that should be considered in the trust model. For peers without any interacting history, most previous trust models often define a default level of trust. But if it is set too low, it would make it more difficult for a peer to show trustworthiness through its actions. If it is set very high, there may be a need to limit the possibility for peers to start over by re-registration after misbehaving. In our trust model, the introduction of the size of interactions effectively solves the trust problem of peers without any interacting history.

“

”

3.2 The Computational Model

Consider the situation where peer i wants to interact with peer j in order to accomplish a certain task. There are two ways in which to calculate trust value: direct and recommendation. Direct trust is denoted as D(Ti ( j ), S ) , where Ti ( j ) is the direct trust value that peer i calculates for peer j. S expresses peer j’s level of size of interaction which is granted by peer i. The level of size of interaction satisfies the following rules. (1) The lowest level is given to a new peer that doesn’t have any interaction history. (2) A certain level is updated if the number of successful interactions reaches the predefined number in the level. The predefined number is decided by the peer itself. The lower the current level is, the more the number of successful interactions it needs. (3) The predefined successful interaction number in a certain level is increased if interactions fail due to malicious activities. Direct trust computation involves an assignment of weights to the trust factors. We define Wi as a weight which represents importance of a particular factor from 0, unimportant, to +1, most important. The weight is dynamic and dependent on the application. If Pj ≠ −1 , Direct trust value that peer i calculates for peer j is computed by the following equation:

214

X. Wu

Ti ( j ) =

W1C j + W2V j + W3 Pj

∑ j =1W j 3

+ pen( m)

1 1 + e− n

(5)

where 0 < W j ≤ 1 . In case of Pj = −1 , we just assign -1 to Ti ( j ) and exclude the node from the networks because it totally cannot work in the networks. Because each mobile peer uses histograms for the accumulative trust evaluation, some malicious or compromised peers that broadcast inconsistent or deceitful data continuously can be found and classified in this trust computing process. pen( m) denotes the punishment function.

1 is the acceleration factor, where 1 + e− n

n denotes the number of fail. It can

make trust value to drop fast when the interaction is failed. When two peers have little interaction experience, other peers’ recommendation is needed for trust establishment. Recommendation trust is the trust of a peer on another one based on direct trust and other peers’ recommendation. Let we assume that peer j requests an interaction with peer i and the size of the interaction is Q . First, peer i computes peer j’s direct trust denoted as D(Ti ( j ), S ) . (1) If Q ≤ S and Ti ( j ) reaches a certain value (which is set by peer i), peer i considers peer j to be trustworthy. It will then decide to interact with peer j. (2) If Q ≤ S but Ti ( j ) fails to reach a certain value, peer i chooses to join a group based on its interest. Then it checks its own group and location with GPS and floods a HELLO message which containing a packet to announce itself to other peers by using Echo protocol [8], then requests requests all other members of the group to cast a vote for peer j from the perspective of trust in the level of Q . For any new peer without any interaction history, its trust value would be 0 and would be granted the lowest level of the size of interaction. Without requesting, it will be permitted to interact at the lowest level. (3) If Q ≤ S but Ti ( j ) fails to reach a certain value, peer i immediately refuses to interact with peer j. (4) If Q ≤ S and Ti ( j ) reaches a very high value, peer i chooses to join a group based on its interest and then requests requests all other members of the group to cast a vote for peer j from the perspective of trust in the level of Q . Second, after the other peers receive the poll request message, they will decide whether to cast the vote based on the following formula. Let e denotes a voting peer, then DTe ( j ) =

N ( j)

∑( m =1

W1C j + W2V j + W3 Pj

∑

3

Wj

j =1

+ pen( m)

1 ) 1 + e−n

(6)

where DTe ( j ) is the poll value of e in j. N ( j ) denotes the total number of interactions e has conducted with j at level Q . Lastly, peer i gathers up all poll information of peer j from the repliers and gets peer j’s recommendation trust by this equation:

A Novel Trust Evaluation Model for Mobile P2P Networks

215

N ( w)

T=

∑ R(w) × p

(7)

i =1

N (w)

where N ( w) denotes the total number of votes and R ( w) denotes peer w ’s vote accuracy factor which is in the range of (0, 1). p is related to DTw ( j ) such that if DTw ( j ) > 0 , p = 1 , else p = 0 .

4 Experimental Study Experiments have been carried out to study the effectiveness and the benefits of our proposed model. In a real environment, there may exist some vicious attacks including malicious recommendations or cheating in the accumulation of trust in small-size interactions. In addition, it should solve the trust problem when there is no interaction history or little trust value. Table 1. Default parameters in simulation experiments

Number of Peers

300

Communicating range (m)

70

2

×

Simulation area (m )

500 500

Number of malicious Peers

0% - 70% of all peers

Risk attitude

averse, neutral, seeking

Communication protocol

802.11

Life time (s)

[50,100]

Maximum speed (m/s)

20

The simulation environment is set up as follows: we create 300 peers that will perform interacting in a mobile p2p resource sharing system. 300 mobile peers are uniformly distributed at the area whose size is 500m × 500m . Communicating range of a mobile device is 70m. The simulated experiments were run on a dual-processor Dell server and the operation system installed on this machine is Linux with kernel 2.6.9. To make our simulation as close to the real mobile p2p systems where peers often go offline, we simulate the offline peers by assigning every peer a random lifetime (or Time-To-Live) within the step range [50, 100]. After reaching the lifetime, the peer will not respond to any service request, and won’t be counted in the statistics either. After one more step, the peer comes alive again with a new life time randomly chosen from the range [50, 100]. In this analysis, we assume that all mobile peers have a same amount of battery power and participate in communication positively regardless of their roles. Each peer acts as both client and server to share its resources with other peers, and communicates with each other via IEEE 802.11. The default parameters in simulation experiments are showed in the table 1.

216

X. Wu 1

The recommendation trust

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3

5

7

9

11

13

15

17

19

21

19

21

The wrong recommendation number

(a) 1

The recommendation trust

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3

5

7

9

11

13

15

17

The wrong recommendation number

(b) Fig. 1. Trust evaluation results

In the first experiment we evaluate the trust evaluation model in terms of its efficiency of excluding malicious recommendations in the network. We implement and simulate a file sharing system. The environments of the system are as follows. 300 mobile peers are uniformly distributed at the area whose size is 500m × 500m . Communicating range of a mobile device is 70m. In this analysis, we assume that all mobile peers have a same amount of battery power and participate in communication positively regardless of their roles. So, we consider only a consistency evaluation

A Novel Trust Evaluation Model for Mobile P2P Networks

217

factor. Fig. 1 shows the simulation result in which the broken line denotes the recommendation trust value Tm that includes malicious peers’ recommendations and the solid line denotes the real recommendation trust value Tr that doesn’t include any malicious recommendations. In this simulation, a same malicious recommendation event occurs every 10 seconds. As we can see Fig. 1 (a), normal recommendation trust value is 0.3, but a malicious recommendation result is 0.9 by few malicious peer which broadcasts three times as high as a normal recommendation result. This indicates the vulnerability of a system without a trust evaluation scheme. Fig. 1(b) shows the process of filtering inconsistent data of a malicious node which acts inconsistently after certain seconds with a proposed trust evaluation scheme. We can see that Tm fluctuates around Tr but the scale of the fluctuation is very small. The earlier the system detects a malicious node, the lower the malicious recommendations of it can affect the aggregated result.

Fig. 2. Simulation results of peers under independent cheat and group cheat

In the second experiment, we assess the performance of our mechanism under two attack models: independent cheat and group cheat. Our experiment also points out that the trust model is also sensitive to the group cheat. In the experiment, we add a number of malicious peers to the network such that malicious peers make up between 0% and 70% of all peers in the network. Fig. 2.shows what is happening. In this figure, we compare the independent cheap and group cheat. Under independent cheat, the malicious peers firstly accumulate trust values through small interactions, gaining a relatively high trust. After trusted by most adjacent peers, the peer takes advantage of its high trust value to attack another peer, which means to always provide an inauthentic file to another peer when selected as download source. Group cheat is that

218

X. Wu

there is a group in which the peer of the group provides an authentic file to each other and provides an inauthentic file to the peer outside the group. The rate of inauthentic downloads under independent cheat or group cheat increases at the beginning, then starts to drop when the number of malicious peers reaches to 30%-40% of all peers in the network. The reason is that the trust computing mechanism used in our experiments punishes this behavior by lower the trust values quickly. Since malicious peers found by the mechanism will lose choice selected as download sources. As a result, the rate of inauthentic downloads will drop. However, due to the good rating coming from the cheating group, the rate of inauthentic downloads under group cheat drops more slowly than the one under independent peer. Yet one thing remains assured: the rate under group cheat is still dropping and will drop to 5%. Even if no malicious peers are present in the system, downloads are evaluated as inauthentic in 3%-5% of all cases – this accounts for mistakes users make when creating and sharing a file, e.g., by providing the wrong meta-data or creating and sharing an unreadable file.

5 Conclusion and Future Work The realization of trust mechanism in mobile p2p networks is quite different due to some characteristics of mobile environment, which indicates the trust between participants can not be set up simply on the traditional trust mechanism. In the paper we proposed a novel trust evaluation model for mobile P2P networks. The main factors that influence the trust in mobile P2P networks are identified. Our model does not employ cryptographic approaches or certification mechanisms, so it is light enough to fit well with mobile P2P networks without great overheads. To the best of our knowledge, our approach is one of the incipient researches on trust evaluation model for mobile P2P networks that can detect malicious and compromised mobile peers. In addition, the proposed model effectively solves the trust problem of peers without any interacting history. We expect that our trust evaluation model can help to make resilient mobile P2P networks. In the near future, we would like to test our trust into more real mobile P2P systems and analyze the system performances.

References 1. Takeshita, K., Sasabe, M., Nakano, H.: Mobile P2P Networks for Highly Dynamic Environments. In: 6th IEEE International Conference on Pervasive Computing and Communications, Hong Kong, pp. 453–457 (2008) 2. Kamvar, S.D., Schlosser, M.T., Molina, H.G.: The EigenTrust Algorithm for Reputation Management in P2P Networks. In: 12th International Conference on Word Wide Web, Budapest, Bulgaria, pp. 640–651 (2003) 3. Zhou, R., Hwang, K.: PowerTrust: A Robust and Scalable Reputation System for Trusted P2P Computing. IEEE Transactions on Parallel and Distributed Systems 18(5) (2007) 4. Tian, C.Q., Zou, S.H., Wang, W.D., Cheng, S.D.: A New Trust Model Based on Recommendation Evidence for P2P Networks. Chinese Journal of Computers 31(2), 271–281 (2008)

A Novel Trust Evaluation Model for Mobile P2P Networks

219

5. Thomas, R., Vana, K.: Decentralized trust management for ad-hoc peer-to-peer networks. In: 4th International Workshop on Middleware for Pervasive and Ad-Hoc Computing, Melbourne, Australia (2006) 6. Zhou, L., Haas, Z.J.: Securing ad hoc networks. IEEE Special Issue on Network Security 13(6), 24–30 (1999) 7. Kong, J., Zerfos, P., Luo, H., Lu, S., Zhang, L.: Providing robust and ubiquitous security support for mobile ad-hoc networks. In: 9th International Conference on Network Protocol, pp. 25–260 (2001) 8. Sastry, N., Shankar, U., Wagner, D.: Secure verification of Locotion Cloims. In: 2nd ACM Workshop on Wireless Security, New York, pp. 1–10 (2003)

Evaluating and Optimizing I/O Virtualization in Kernel-based Virtual Machine (KVM) Binbin Zhang1, Xiaolin Wang1, Rongfeng Lai1, Liang Yang1, Zhenlin Wang2, Yingwei Luo1, and Xiaoming Li1 1

Dept. of Computer Science and Technology, Peking University, Beijing, China, 100871 2 Dept. of Computer Science, Michigan Technological University, Houghton, USA {wxl,lyw}@pku.edu.cn, [email protected]

Abstract. I/O virtualization performance is an important problem in KVM. In this paper, we evaluate KVM I/O performance and propose several optimizations for improvement. First, we reduce VM Exits by merging successive I/O instructions and decreasing the frequency of timer interrupt. Second, we simplify the Guest OS by removing redundant operations when the guest OS operates in a virtual environment. We eliminate the operations that are useless in the virtual environment and bypass the I/O scheduling in the Guest OS whose results will be rescheduled in the Host OS. We also change NIC driver’s configuration in Guest OS to adapt the virtual environment for better performance. Keywords: Virtualization, KVM, I/O Virtualization, Optimization.

1 Introduction Software emulation is used as the key technique in I/O device virtualization in Kernel-based Virtual Machine (KVM). KVM uses a kernel module to intercept I/O requests from a Guest OS, and passes them to QEMU, an emulator running on the user space of Host OS. QEMU translates these requests into system calls to the Host OS, which will access the physical devices via device drivers. This implementation of VMM is simple, but the performance is usually not satisfactory because multiple environments are involved in each I/O operation that results in multiple context switches and long scheduling latency. In recent versions, KVM tries to reduce the I/O virtualization overhead by emulating key devices in the KVM kernel module. However, the main I/O devices are still emulated by QEMU. In this paper, we evaluate KVM disk and network virtualization overhead and try to optimize it by reducing the overhead of VM Exits1 and simplifying the corresponding virtualization operations in Guest OS. The rest of the paper is organized as follows. Section 2 evaluates disk and network performance in a KVM guest. Section 3 presents our optimizations and evaluations. Section 4 discusses related work. And we conclude the paper in Section 5. 1

A VM Exit is a context switch from Guest OS to VMM. When a sensitive instruction is executed in the Guest OS, a VM Exit will happen. After the VMM emulates the sensitive instruction, the Guest OS can continue to work.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 220–231, 2010. © IFIP International Federation for Information Processing 2010

Evaluating and Optimizing I/O Virtualization in KVM

221

2 KVM I/O Performance Evaluation A software emulation-based I/O virtualization technique, which is used by KVM and most other host-based VMMs, causes a significant performance overhead. To analyze the sources of the overhead, we begin with a serial of experimental evaluations of KVM's disk and network I/O virtualization performance. We use bonnie++ [6] to evaluate the performance of disk I/O operations, such as character/block read and write, random/sequential read and write. And we use netperf [7] to evaluate network I/O performance, including data throughput, latency, and CPU utilization rate during sending and receiving data using TCP and UDP protocols. Finally, we use SPECjbb [8] to emulate data warehouse’s I/O workload. 2.1 Disk I/O Performance Test-bed: Intel Core 2 Quad Q9550, 2.83GHz, 4G RAM, 500G SATA disk, Linux 2.6.27.7, KVM-76 with default configuration, and the virtual machine uses raw disk image. Results: We run Bonnie++ which evaluates the file system calls in the Guest OS. Figure 1 illustrates the throughput of the sequential block write benchmarking, as well as the CPU utilization rate during the experiments, of the virtual machine and the native machine. Sequential Block Writes 80000

70

70000

60

s)/ 60000 bK 50000 (t up 40000 hg uo 30000 hrT 20000

50 40 30 20 10

10000 0

)% (n oi ta izl it u UP C

256B 512B

1K

2K

4K

8K

16K 32K

64K 128K 256K 512K 1M

0

Block size Throughput(Native) CPU utilization(Native)

Throughput(KVM) CPU utilization(KVM)

Fig. 1. Comparison of the performance between Native and KVM - Sequential block writes

From Figure 1, it can be observed that the throughput and CPU utilization of sequential block writes on the virtual machine is only about a half compared to the native machine. Figure 2 illustrates the throughput of the sequential block read benchmarking, as well as the CPU utilization rate during the experiments, of the virtual machine and the Native machine.

222

B. Zhang et al.

Sequential Block Reads 100000

70

90000

) s / b K ( t u p h g u o r h T

60

80000

50

70000 60000

40

50000 30

40000 30000

20

20000

10

10000 0

) % ( n o i t a z i l i t u U P C

0 256B 512B 1K

2K

4K

Throughput(Native) CPU utilization(Native)

8K 16K 32K 64K 128K 256K 512K 1M Block size Throughput(KVM) CPU utilization(KVM)

Fig. 2. Comparison of the performance between Native and KVM – Sequential block reads

It can be observed that the throughput of the sequential block read of the virtual machine is very close to that of the Native machine. However, when the size of the disk block is less than 4K, the CPU utilization of the virtual machine is much higher than that of the Native. The reason is that when bonnie++ is reading disk, the data can be frequently hit in the disk cache instead of actually accessing the actual disk during such experiments. From the experimental data shown in Fig.1 and Fig. 2, it can be deduced that the key to achieve better performance of disk I/O is to improve write throughput and reduce the CPU overhead during disk-reading. 2.2 Network Performance Test-bed: Two physical machines, HostA and HostB, both using Gigabit Ethernet. HostA works with 82566DC NIC and HostB works with 82567LM-2 NIC. One virtual machine (named GuestB) runs on HostB with KVM-76. Ping-pong benchmarking: The Round Trip Time (RTT) between HostA and HostB, between HostB and GuestB and that between HostA and GuestB are measured and illustrated in Figure 3. Based on Figure 3, we are able to estimate the additional network overhead brought by virtualization. Packet transmission time from HostA to GuestB is 0.378 ms, which is equivalent to the transmission time form HostA to HostB (0.234 ms) plus the transmission time from HostB to GuestB (0.126 ms). That is true when the transmission reverses, from GuestB to HostA. That means transmission time between HostB and GuestB can be considered as the virtualization overhead. More accurately, this time overhead accounts for 33% of the transmission time between HostA and GuestB.

Evaluating and Optimizing I/O Virtualization in KVM

223

Ping-pong Benchmarking 0.5 0.45 0.4 0.35 ) 0.3 s m ( T 0.25 T R 0.2 0.15 0.1 0.05 0 HostA ping HostB ping HostA ping HostB ping GuestB GuestB HostB GuestB GuestB HostA ping HostB ping HostA

RTT

Fig. 3. Ping-pong benchmarking results

Netperf Tests: The throughput and latency between the client and the server is measured using Netperf, and illustrated in Figure 4. Experiments: 1. Run the netperf server on remote HostA. 2. HostB tests using netperf client. 3. GuestB tests using netperf client, with various configurations, e.g. different virtual NIC and different drivers. Figure 4 shows the result. Netperf 1.2 1 0.8 0.6 0.4 0.2 0

TCP Stream(Mb/s) UDP Stream(Mb/s) TCP RR(Trans/s)

Native(e1000e)

KVM(e1000)

KVM(e1000+NAPI)

Udp RR(Trans/s)

KVM(rtl8139)

Fig. 4. Netperf benchmark test, Guest OS used different NIC configurations - virtual e1000 NIC (e1000 driver, e1000 driver with NAPI support), virtual rtl8139 NIC

According to the conclusion in Experiment 1 based on ping-pong benchmarking, the throughput on the virtual machine should be 2/3 of that of the Native. In other words, the throughput should be approximately 600Mb/s, not at the level of 400Mb/s as measured. This gap indicates that data processing in the VMM and the Host OS may cause an additional overhead.

224

B. Zhang et al.

From Figure 4, it can be also observed that the virtual device capability greatly influences network performance in a virtual environment. 2.3 Instruction Level Evaluation To further analyze the reasons of such performance degradation, we intercept the operations issued by Guest OS that result in VM Exits during the SPECjbb and Netperf tests. The top 5 functions, named the hot functions which result in most VM Exits, are listed in Table 1 and Table 2. It can be observed that I/O operations have a certain hot code effect, that is, a small number of instructions cause lots of VM Exits. So we can optimize these hot functions to decrease the virtualization overheads. Table 1. Hot functions during SPECjbb test on KVM VM

1 2 3 4 5

Trap address 0xc041be3f 0xc042ffa6 0xc0407cf9 0xc041add2 0xc0407cf7

Trap count 607251 602546 600625 600223 600092

Trap time(s)/percentage(%) 4.29 / 17.66 0.10 / 0.41 0.38 / 1.55 8.57 / 35.28 3.08 / 12.68

Trap function ack_ioapic_irq __do_softirq timer_interrupt smp_apic_timer_interrupt timer_interrupt

Table 2. Hot functions during Netperf test on KVM VM

1 2 3 4 5

Trap address 0xc04ee0fc 0xc04ee0d1 0xc04ee153 0xc04edfe4 0xc0623d05

Trap count 16545487 14317411 7636364 6045896 4401573

Trap time(s)/percentage(%) 101.64 / 26.03 85.33 / 21.85 62.26 / 15.94 44.53 / 11.40 0.73 / 0.19

Trap function ioread16 iowrite16 ioread32 ioread8 _spin_lock_irqrestore

Additionally, in the SPECjbb test, the timer interrupt routine is one of the top 5 time consumers. Therefore, optimization on timer interrupts will improve SPECjbb performance. We also intercept hot functions in the Bonnie++ benchmark, and the top 5 hot functions are shown in Table 3. Because these functions are frequently called here and there, we further intercepted the caller functions. Table 3. Hot functions during Bonnie++ test on KVM VM

1 2 3 4

Trap address 0xc06442dd 0xc0547047 0xc0546f3c 0xc05470ff

Trap function acpi_pm_read iowrite8 ioread8 iowrite32

Caller function address 0xc06442e7 0xc05e8fe3 0xc05eaa44 0xc05ea9c2

Caller function verify_pmtmr_rate ata_sff_dev_select ata_bmdma_status ata_bmdma_setup

We observe that the top caller function verify_pmtmr_rate is used to read the clock on motherboard. It is redundant in the virtual environment and thus can be eliminated.

Evaluating and Optimizing I/O Virtualization in KVM

225

3 Optimization and Evaluation Based on the experiment results in Section 2, we focus on reducing context switches and simplifying the Guest OS to optimize KVM I/O virtualization. We discuss our optimization methods in this section. 3.1 Reducing the Context Switching There are multiple occasions for context switching at each I/O operation in KVM, including the context switching among the Guest OS, KVM, and the Host OS. Most switches are caused by the KVM architecture, but the switches between the Guest OS and KVM depend on the behavior of the Guest OS, which may lead to other context switches. If we can modify Guest OS’ behavior and remove some VM Exits, the context switches will be reduced. From the instruction test results in Section 2.3, it can be observed that I/O operations in Guest OS are clustered to a certain extent. Some I/O instructions bring a lot of VM Exits, and it can be further optimized. 3.1.1 Merging Successive I/O Instructions When an I/O instruction is executed in a Guest OS, it will cause a VM exit to KVM. In the Guest OS disk driver, some code fragments include successive I/O instructions, which will cause multiple consecutive VM Exits. If we merge these instructions into a single operation, only a single VM Exit is needed to handle the multiple Guest I/O instructions. We merge the successive I/O instructions to a single vmcall, which will exit to KVM actively. The method is to put information of each instruction into a queue, including IN/OUT, port number, and the value. The address and the length of the queue are passed to KVM as the parameters of vmcall. KVM will get information of each instruction from the queue, and emulate them one by one. For example, Figure 5 is a code fragment in a function (__ide_do_rw_disk) in the disk driver code (/driver/ide/ide-disk.c). /driver/ide/ide-disk.c hwif->OUTB(tasklets[1], IDE_FEATURE_REG); hwif->OUTB(tasklets[3], IDE_NSECTOR_REG); hwif->OUTB(tasklets[7], IDE_SECTOR_REG); hwif->OUTB(tasklets[8], IDE_LCYL_REG); hwif->OUTB(tasklets[9], IDE_HCYL_REG); hwif->OUTB(tasklets[0], IDE_FEATURE_REG); hwif->OUTB(tasklets[2], IDE_NSECTOR_REG); hwif->OUTB(tasklets[4], IDE_SECTOR_REG); hwif->OUTB(tasklets[5], IDE_LCYL_REG); hwif->OUTB(tasklets[6], IDE_HCYL_REG); hwif->OUTB(0x00|drive->select.all,IDE_SELECT_REG);

Fig. 5. A code fragment including successive I/O instructions

226

B. Zhang et al.

11 I/O instructions will be executed concecutively which yield 11 VM Exits. Our approach merges them into one vmcall as the following. The fragment after replacement is illustrated in Figure 6. /driver/ide/ide-disk.c struct io_insn io_out[11]; // the queue to restore instruction information usigned long io_gpa, io_len; #define IO_OUT(x, _type, _val, _port) \ io_out[x].type = _type, \ io_out[x].port = _port, \ io_out[x].val = _val //put information of I/O instructions into the queue: IO_OUT(0, OUTB, tasklets[1], IDE_FEATURE_REG); IO_OUT(1, OUTB, tasklets[3], IDE_NSECTOR_REG); IO_OUT(2, OUTB, tasklets[7], IDE_SECTOR_REG); IO_OUT(3, OUTB, tasklets[8], IDE_LCYL_REG); IO_OUT(4, OUTB, tasklets[9], IDE_HCYL_REG); IO_OUT(5, OUTB, tasklets[0], IDE_FEATURE_REG); IO_OUT(6, OUTB, tasklets[2], IDE_NSECTOR_REG); IO_OUT(7, OUTB, tasklets[4], IDE_SECTOR_REG); IO_OUT(8, OUTB, tasklets[5], IDE_LCYL_REG); IO_OUT(9, OUTB, tasklets[6], IDE_HCYL_REG); IO_OUT(10, OUTB, 0x00|drive->select.all, IDE_SELECT_REG); //because the address space is different between KVM and Guest OS, we should translate io_insn address into physical address: io_gpa = virt_to_phys((unsigned long)io_out); io_len = 11; // the number of I/O instructions we have replaced vmcall(XKVM_IO_COALESCE, io_gpa, io_len, 0); // vmcall to exit into KVM, the parameters include the address and the length of io_insn.

Fig. 6. The code fragment - The successive I/O instructions are replaced into a vmcall

We have modified two fragments which include successive port I/O instructions. And the number of VM Exits caused by I/O instructions is reduced. We count the number of VM Exits caused by inb and outb, and the result is in Table 4: Table 4. The number of VM Exits caused by inb and outb Instruction address 0xc054fcc0 0xc054ecf0

Before modification 111050 15864

After modification 17226 2343

This method is actually borrowed from para-virtualization[5], which reduces context switches between the Guest OS and the VMM via modifying the Guest OS. We now only implement static modification. In the future we will try to implement dynamic

Evaluating and Optimizing I/O Virtualization in KVM

227

modification, which is to replace the code fragments on the fly when the Guest OS is running. One method is to monitor Guest OS execution, locate the fragment which present lots of VM Exits caused by I/O operations, merge the successive I/O instructions and produce the new fragment, then insert jump instructions to replace the old fragment. Another method is to prefetch following instructions after a VM Exit occurs. If there are I/O instructions following the current one, emulate them all at once, and then switch back to Guest OS. We evaluate the performance of our optimization. Unfortunately, the result is not promising. The CPU overhead is somewhat reduced, but I/O throughput is nearly the same. This may be because we only reduce the overhead brought by context switches, but the operation after VM Exit is not modified. Therefore the cost of VM Exit and VM Entry is not the main reason for the performance overhead. 3.1.2 Reducing the Timer Interrupts Timer interrupt is another cause of VM Exits. When a timer interrupt happens, a VM Exit will occur. For non-real-time applications, we can reduce the timer frequency to reduce VM Exits. KVM emulates a PIT (Programmable Interval Timer) for each Guest. The PIT can trigger timer interrupts at a programmed frequency. A PIT consists of an oscillator which produces clock signal at the frequency of (roughly) 1193182HZ. When a clock signal generated, the counter in PIT channel 0 is decreased by 1. When the counter reaches 0, PIT will generate a timer interrupt. The virtual PIT works in a similar way. So if we modify the initial value of the counter in PIT channel 0, the timer frequency is modified. For example, if the counter in PIT channel 0 is initialized 1193, the timer frequency is 1000HZ (1193182/1193). And if the counter is initialized 11932, the timer frequency is 100HZ. If the value initialized by the Guest OS is modified, the timer frequency is modified transparently to Guest OS. The implementation is simple: We only need to modify pit_load_count function. We decrease the Guest actual timer frequency to 100HZ, and we compared its I/O performance to the Guest when the timer frequency is 1000HZ. SPECjbb

Bonnie++ 1.2

1.04

1.15

1.02 1

1.1

0.98

1.05

0.96 1

0.94

0.95

0.92

0.9 vHZ1000

putc

write

rewrite

getc

s create

vHZ100

(a) Bonnie++

r create

0.9 vHZ1000 1 vHZ100

2

3

4

5 Warehouses

6

7

8

average

(b) SPECjbb

Fig. 7. Timer Interrupt Evaluation

It can be observed in Figure 7 that the lower timer frequency results up to 15% performance improvement.

228

B. Zhang et al.

3.2 Simplifying Guest OS A virtual environment is different from a native one. There may be some Guest OS operations that become redundant when the OS operates in a virtual environment. If these redundant operations are removed, the Guest OS can be more efficient. 3.2.1 Removing the Redundant Operations We try to locate and remove redundant operations in a Guest OS. An example is verify_pmtmr_rate. This function causes the most VM Exits during bonnie++ running (see Section 2.3). It is used to adjust timer on the motherboard, which is useless in a virtual environment. We modify the Guest OS by simply removing this function. The result is illustrated in Figure 8. The write performance is improved across all block sizes. Block Write Throughput

11200 11100 11000

) s / 10900 b K 10800 ( t u p 10700 h g 10600 u o r 10500 h T 10400 10300 10200 256B 512B

before

1K

2K

after

4K

8K

16K

32K

64K 128K 256K 512K

Block size

Fig. 8. Write performance (before and after verify_pmtmr_rate removed)

Another redundant operation is I/O scheduling in the Guest OS, because all the I/O requests scheduled by the Guest OS will always rescheduled in the Host OS. The scheduler in the Guest OS is thus redundant. For one thing, the Guest OS is unaware of the physical disk information, therefore the results of this scheduler may not be helpful anyway. For another, when multiple Guest OSes operate simultaneously, the Host OS must reschedule I/O requests from all Guest OSs and try to maximize the whole system performance. We remove the I/O scheduler in the Guest OS, and I/O requests are directly submitted to drivers. After our optimization, the throughput of block writes on the virtual machine increases by 6.01%; while throughput of block reads increases by 5.64% (the Host OS uses CFQ I/O scheduler [12] by default). The disadvantage is that all the I/O requests will be submitted in the order of FIFO when the I/O scheduling is eliminated. We cannot arrange set up performance quotas without the I/O scheduler. 3.2.2 Optimizing NIC Driver in Guest OS A NAPI function is used to reduce interrupt frequency via interrupt masking. But in a virtual environment, a virtual NIC can’t generate interrupts directly. So the NAPI support in the NIC driver in the Guest OS is useless and time-consuming. We remove

Evaluating and Optimizing I/O Virtualization in KVM

229

the NAPI support from the NIC driver in the Guest OS, and evaluate the performance. The result is illustrated in Figures 9. Netperf 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 TCP Stream KVM(e1000+NAPI)

UDP Stream

TCP RR

UDP RR

KVM(e1000)

Fig. 9. Netperf performance (with/without NAPI support)

It can be observed that this optimization improves network throughput. In TCP /UDP Stream and UDP Request/Response test, after NAPI support is removed from the NIC driver, the throughput is increased by more than 60%.

4 Related Work This section discusses the related work that focuses on reducing context switching overhead and modifying Guest OS. 4.1 Reducing the Context Switching Overhead There are two ways to reduce context switching cost. One is to lower the frequency of such operations, and the other is to reduce the cost of each switching. Submission in batch is one way to reduce context switch frequency. And another way is to reduce device interrupt if possible. Sugerman et al. [1] have implemented batch submission based VMware Workstation. When the context switch frequency exceeds a certain threshold, the requests will be queued in a cache until the next interrupt happens, and then all queued requests can be handled at one time. This method can also reduce the IRQ transfer cost because only one IRQ is needed once the batch of requests is complete. Another optimization by Sugerman et al. [1] is to improve the Guest driver protocol. They design an interface suitable for a virtual environment. For example, this interface avoids the I/O instructions accessing device status. And it can reduce the number of virtual IRQs which cause context switching between a VMM and a VM. Virtio (R. Russell [4]) is a general device driver which provides the same operation process for block and network devices. Virtio uses para-virtualization for reference, and can be used by various VMMs. The implementation is to maintain buffer rings based on shared memory between a Guest OS and a VMM. One of them posts to the rings, while the other consumes them. And an event notification mechanism is

230

B. Zhang et al.

implemented. When buffers are added to the ring, the other side will be notified. This notification can be masked, to facilitate batching and reduce context switching. Reduce the cost of each switch is to simplify the operations during each switching. Sugerman et al. [1] modify the process switch operation in a Guest OS on VMware Workstation. When switching to the idle task, the page table is not reloaded, since the idle task is a kernel thread, and can use any process’s page table. This optimization cuts the virtualization overhead caused by MMU by a half. In addition, Sugerman et al. try to optimize VMware Workstation via Host OS bypassing. The optimization is to make the VMM access hardware device directly, and thus avoid context switching between the VMM and the Host OS. This method is employed by full virtualization systems, e.g., VMware ESX Server. Another cost is due to TLB misses and cache misses after context switches. Aravind Menon et al. [2] enhance the Guest OS in Xen and make it support advanced virtual memory management, e.g. superpage and global page mapping, which greatly reduces TLB misses caused by context switches. 4.2 Simplifying Guest OS Ram et al. [3] manage to reduce the overhead in a Guest OS in three aspects. Firstly, they implement LRO (Large Receive Offload) which combines a number of data packets into one large-sized TCP/IP packet, so that a large amount of data can be handled during a single protocol stack process. Secondly, they reduce the buffer size to half-page which can reduce the working set and thereby reduce the TLB miss rate. Menon et al. [2] improve Xen’s virtual network interface to support offload features which is supported by most NICs (if the hardware does not support offload, then the Driver domain can simulate it, which can also improve performance). Offload functions include scatter/gather I/O, TCP/IP checksum offload, TCP segmentation offload (TSO). Scatter/gather I/O supports continuous DMA operation with noncontiguous memory. TSO reduces the number of packets to be processed. Checksum offload reduces the Guest OS loads. Most other optimizations specifically focus on the architecture of Xen, which are not very helpful to other VMMs. A series of hardware assistance and specifications try to assist device virtualization from the hardware level, including Intel VT-d [9], AMD IOMMU [10], and PCI-SIG IOV [11]. VT-d and IOMMU are similar. They ensure the isolation of I/O address space between different VMs. An I/O MMU, similar to MMU, is installed on the PCI bridge to translate DMA addresses to machine memory addresses. And an IOTLB accelerates this translation. PCI-SIG IOV includes ATS (Address Translation Services), SR-IOV (Single Root IOV), and MR-IOV (Multi-Root IOV). A series of specifications help I/O address translation and let the devices provide multiple interfaces for multiple VMs direct access. These new technologies and specifications help a guest to access hardware devices directly, avoiding virtualization overhead and simplifying implementation of I/O virtualization. Future research on optimizing I/O virtualization must try to focus on how to optimize the Guest OS to make it more efficient in the virtual environment. Another direction is to examine how to make the Guest OS, the VMM, and the Host OS work better in coordination with each other.

Evaluating and Optimizing I/O Virtualization in KVM

231

5 Conclusion We evaluate KVM I/O performance and propose some methods to optimize it. We reduce VM Exits by merging successive I/O instructions and decreasing the frequency of timer interrupt. And we simplify Guest OS by removing redundant operations in the virtual environment. Guest OS simplification will be an important direction in future research to optimize VMM performance. We will continue to research on how to make a Guest OS more efficient in a virtual environment and how to make the Guest OS, the VMM, and the Host OS coordinate with each other better.

Acknowledgement This work is supported by the National Grand Fundamental Research 973 Program of China under Grant No. 2007CB310900, National Science Foundation of China under Grant No.90718028 and No. 60873052, National High Technology Research 863 Program of China under Grant No.2008AA01Z112, and MOE-Intel Information Technology Foundation under Grant No. MOE-INTEL-08-09. Zhenlin Wang is also supported by NSF Career CCF0643664.

References [1] Jeremy, S., Ganesh, V., Beng-Hong, L.: Virtualizing I/O Device on VMware Workstation’s Hosted Virtual Machine Monitor. In: Proceddings of the 2001 USENIX Annual Technical Conference (June 2001) [2] Menon, A., Cox, A.L., Zwaenepoel, W.: Optimizing network virtualization in Xen. In: Proceedings of the Annual Technical Conference on USENIX 2006 Annual Technical Conference (2006) [3] Ram, K.K., Santos, J.R., Turner, Y., Cox, A.L., Rixner, S.: Achieving 10 Gb/s using safe and transparent network interface virtualization. In: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (2009) [4] Russell, R.: Virtio: Towards a De-Facto Standard For Virtual I/O Devices. Operating System Review 42(5), 95–103 (2008) [5] Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the Art of Virtualization. In: Proceedings of the 19th ACM SOSP (2003) [6] bonnie++, http://www.coker.com.au/bonnie++/ [7] netperf, http://www.netperf.org/netperf/ [8] SPEC JBB (2005), http://www.spec.org/jbb2005/ [9] Intel Corporation. Intel® Virtualization Technology for Directed I/O Architecture Specification (2007), http://download.intel.com/technology/computing/ vptech/Intelr_VT_for_Direct_IO.pdf [10] Advanced Micro Devices, Inc. ADM I/O Virtualization Technology (IOMMU) Specification (2007), http://www.amd.com/us-en/assets/content_type/ white_papers_and_tech_docs/34434.pdf [11] PCI-SIG. I/O Virtualization (2007), http://www.pcisig.com/specifications/iov/ [12] Axboe, J.: Time Sliced CFQ I/O Scheduler, http://kerneltrap.org/node/4406

Distributed Stream Processing with DUP Kai Christian Bader1 , Tilo Eißler1 , Nathan Evans1 , Chris GauthierDickey2 , Christian Grothoﬀ1 , Krista Grothoﬀ1 , Jeﬀ Keene2 , Harald Meier1 , Craig Ritzdorf2 , and Matthew J. Rutherford2 1 Faculty of Informatics Technische Universit¨ at M¨ unchen 2 Department of Computer Science University of Denver

Abstract. This paper introduces the DUP System, a simple framework for parallel stream processing. The DUP System enables developers to compose applications from stages written in almost any programming language and to run distributed streaming applications across all POSIXcompatible platforms. Parallel applications written with the DUP System do not suﬀer from many of the problems that exist in traditional parallel languages. The DUP System includes a range of simple stages that serve as general-purpose building blocks for larger applications. This work describes the DUP Assembly language, the DUP architecture and some of the stages included in the DUP run-time library. We then present our experiences with parallelizing and distributing the ARB project, a package of tools for RNA/DNA sequence database handling and analysis. Keywords: Coordination language, parallel programming, productivity.

1

Introduction

The widespread adoption of multi-core processors and the commoditization of specialized co-processors like GPUs [1] and SPUs [2] requires the development of tools and techniques that enable non-specialists to create sophisticated programs that leverage the hardware at their disposal. Mainstream and productive development cannot rely on teams of domain and hardware experts using specialized languages and hand-optimized code, though this style of development will remain applicable to high-performance computing (HPC) applications that demand ultimate performance. This paper introduces the DUP System1 , a language system which facilitates productive parallel programming for stream processing on POSIX platforms. It is not the goal of the DUP System to provide ultimate performance; we are instead willing to sacriﬁce some performance gain for signiﬁcant beneﬁts in terms of programmer productivity. By providing useful and intuitive abstractions, the DUP System enables programmers without experience in parallel programming 1

Available at http://dupsystem.org/

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 232–246, 2010. c IFIP International Federation for Information Processing 2010

Distributed Stream Processing with DUP

233

or networking to develop correct parallel and distributed applications and obtain speed-ups from parallelization. The key idea behind the DUP System is the multi-stream pipeline programming paradigm and the separation of multi-stream pipeline speciﬁcation and execution from the language(s) used for the main computation. Multi-stream pipelines are a generalization of UNIX pipelines. However, unlike UNIX pipelines, which are composed of processes which read from at most one input stream and write to a single output stream (and possibly an error stream), multi-stream pipelines are composed of processes that can read from any number of input streams and write to any number of output streams. In the remainder of this document, we will use the term “stage” for individual processes in a multi-stream pipeline. Note that UNIX users — even those with only rudimentary programming experience — can usually write correct UNIX pipelines which are actually parallel programs. By generalizing UNIX pipelines to multi-stream pipelines, we eliminate the main restriction of the UNIX pipeline paradigm — namely, the inherently linear data ﬂow. In order to support the developer in the use of multi-stream pipelines, the DUP System includes a simple coordination language which, similar to syntactic constructs in the UNIX shell, allows the user to specify how various stages should be connected with streams. The DUP runtime then sets up the streams and starts the various stages. Key beneﬁts of the DUP System include: 1. Stages in a multi-stream pipeline can run in parallel and on diﬀerent cores; 2. Stages can be implemented, compiled and tested individually using an appropriate language and compiler for the given problem and architecture; 3. Stages only communicate using streams; streams are a great match for networking applications and for modern processors doing sequential work; 4. If communication between stages is limited to streams, there is no possibility of data races and other issues that plague developers of parallel systems; 5. While the DUP System supports arbitrary data-ﬂow graphs, the possibility of deadlocks can be eliminated by only using acyclic data-ﬂow graphs; 6. Applications built using multi-stream pipelines can themselves be composed into a larger multi-stream pipeline, making it easy for programmers to express hierarchical parallelism In addition to introducing the DUP System itself, this paper also presents experimental results from a case study involving the DUP System. The case study shows that it is possible to rapidly parallelize and distribute an existing complex legacy bioinformatics application and obtain signiﬁcant speed-ups using DUP.

2

Approach

The fundamental goal of multi-stream pipelines is to allow processes to read from multiple input streams and write to multiple output streams, all of which may be connected to produce the desired data-ﬂow graph. This generalization of linear

234

K.C. Bader et al.

UNIX pipelines can be implemented using traditional UNIX APIs,2 especially the dup2 system call. Where a typical UNIX shell command invocation only connects stdin, stdout and stderr, the DUP System establishes additional I/O streams before starting a stage. Using this method, traditional UNIX ﬁlters (such as grep) can be used as stages in the DUP System without modiﬁcation. New stages can be implemented in any language environment that supports POSIXlike input-output operations (speciﬁcally, reading and writing to a ﬁle). Since dup2 also works with TCP sockets, the DUP System furthermore generalizes multi-stream pipelines to distributed multi-stream pipelines. 2.1

The DUP Assembly Language

The DUP Assembly language allows developers to specify precisely how to connect stages and where those stages should be run. Figure 1 lists the DUP Assembly code for a distributed “Hello World” example program. s @10.0.0.1[0out.txt] $

fanout; grep Hello; grep World; faninany;

Fig. 1. DUP speciﬁcation. in.txt is passed to fanout (“0
In essence, the DUP language allows developers to specify a directed graph using an adjacency list representation and IO redirection syntax similar to that of well-known UNIX shells [3]. The nodes in the directed graph are the stages initiated by DUP. A DUP program consists of a list of statements, each of which corresponds to one such node. Statements start with a label that is used to reference the respective stage in the speciﬁcation of other stages. The keyword DUP is used to reference streams associated with the controlling dup command in the case that the dup command itself is used as a stage. The label is followed by a hostname specifying on which system the stage will be run. A helper process, dupd, will be started on the speciﬁed host, listen on a port to establish network connections and eventually supervise stages run there. The address is followed by a comma-separated list of edges representing primarily the outgoing streams for this stage. Input streams are only explicitly speciﬁed in the case of input from ﬁles or the controlling dup command. Inputs from other stages are not speciﬁed because they can be inferred from the respective entry of the producing stage. DUP supports four diﬀerent ways to create streams for a stage: 2

The APIs needed are supported by all platforms conforming to the POSIX standard, including BSD, GNU/Linux, OS X, and z/OS.

Distributed Stream Processing with DUP <STAGE> <EDGELIST> <EDGE>

::= ::= ::= ::= ::= ::= ::=

235

<STAGE>* ’@’

’[’ <EDGELIST> ’]’ ’$’ ’;’ <EDGE> (’,’ <EDGE>)* | ’:’ ’|’ | ’<’ | ’>’ | ’>>’

Fig. 2. Simpliﬁed grammar for the DUP Assembly language. Note that we do not expect programmers to need to develop applications by directly using this language in the future; this language is the “assembly” language supported by the DUP runtime system. Higher-level languages running on top of DUP that facilitate (static) process scheduling and aspect oriented programming are under development.

Read: An input ﬁle edge consists of an integer, the “<” operator and a path to the ﬁle to be used as input. The integer is the ﬁle descriptor from which the stage will read the input stream. dupd is responsible for opening the input stream and validating that the ﬁle exists and is readable. Write: An output ﬁle edge for writing consists of an integer, the “>” operator and a path to the ﬁle to be overwritten or created. The integer is the ﬁle descriptor to which this stage will write. dupd checks that the speciﬁed path can be used for writing. Append: An output ﬁle edge for appending consists of an integer, the “>>” operator and a path to the ﬁle. The integer is the ﬁle descriptor to which this stage will write. Pipe: Non-ﬁle output edges consist of an integer, the “|” operator, a stage label, the “:” character and another integer. The ﬁrst integer speciﬁes the ﬁle descriptor to which this stage will write. The label speciﬁes the process on the other end of the pipe or TCP stream and the second integer is the ﬁle descriptor from which the other stage will read. If an edge list contains a label that is not deﬁned elsewhere in the conﬁguration ﬁle then the program ﬁle is considered malformed and rejected by dup. The ﬁnal component of a complete stage statement is the command (with arguments) that is used to start the process. Figure 2 contains a formal grammar for the DUP language. The grammar omits I/O redirection from/to the controlling dup command for clarity. 2.2

DUP System Architecture

The dup client interprets the mini-language from Section 2.1 which speciﬁes how the various stages for the application should be connected. dup then connects to hosts running ssh servers and starts dupd helper processes which then receive control information via the SSH tunnel. The control information speciﬁes the binary names and arguments for the stages as well as how to establish TCP streams and UNIX pipes to connect the stages with each other.

236

K.C. Bader et al.

Figure 3 illustrates how the components of the system work together.

fanout /deal exec

pe

TC P

faninany /gather out.txt

TC P

ex ec

in.txt

pi

grep /ARB

grep/ ARB

pi

pe exec

exec

dupd ssh

dupd ssh

DUP code

Fig. 3. Overview for one possible conﬁguration of the DUP System. Red (dashed) lines show application data ﬂow. Black (solid) lines correspond to actions by DUP. Examples for DUP Assembly corresponding to the illustration are given in Figures 1, 4 and 5 respectively: the three code snippets specify the same data-ﬂow graph, but with diﬀerent commands.

The primary interaction between dup and the dupds involves four key steps [4]: 1. dup starts the dupds and transmits session information. This includes all of the information related to processes that are supposed to be run on the respective dupd. 2. When a stage is conﬁgured to transmit messages to a stage initiated by another dupd, the dupd responsible for the data-producing stage establishes a TCP connection to the other dupd and transmits a header specifying which stage and ﬁle descriptor it will connect to the stream. If dup is used as a ﬁlter, it too opens similar additional TCP streams with the respective dupds. The main diﬀerence here is that dup also initiates TCP connections for streams where dup will ultimately end up receiving data from a stage. 3. Once a dupd has conﬁrmed that all required TCP streams have been established, that all required ﬁles could be opened, and that the binaries for the stages exist and are executable, it transmits a “ready” message to the controlling dup process (using the connection on which the session information was initially received). 4. Once all dupds are ready, dup sends a “go” message to all dupds. The dupds then start the processes for the session.

Distributed Stream Processing with DUP

2.3

237

Generic DUP Stages

Taking inspiration from stages available in CMS [5,6], the DUP System includes a set of fundamental multi-stream stages. UNIX already provides a large number of ﬁlters that can be used to quickly write non-trivial applications with a linear pipeline. Examples of traditional UNIX ﬁlters include grep [7], awk [8], sed [8], tr, cat, wc, gzip, tee, head, tail, uniq, buffer and many more [3]. While these standard tools can all be used in the DUP System, none of them support multiple input or output streams. In order to facilitate the development of multi-stream applications with DUP, we provide a set of primitive stages for processing multiple streams. Some of the stages currently included with the DUP System are summarized in Table 1. Many of the stages listed in Table 1 are inspired by the CMS multi-stream pipeline implementation [6]. Naturally, we expect application developers to write additional application-speciﬁc stages. Table 1. Summary of general-purpose multi-stream stages to be used with DUP in addition to traditional UNIX ﬁlters. Most of the ﬁlters above can either operate lineby-line in the style of UNIX ﬁlters or using a user-speciﬁed record length. I/O Streams in out fanout Replicate input n times 1 n faninany Merge inputs, any order n 1 gather Merge inputs, round-robin (waits for input) n 1 holmerge Forward input from stream that has sent the most data so n 1 far, discard data from other streams until they catch up deal Split input round robin to output(s), or per control stream 2 n mgrep Like grep, except non-matching lines output to secondary 1 2 stream lookup Read keys from stream 3; tokens to match keys from stream 2 3 0; write matched tokens to 1, unmatched to 4 and unmatched keys to 5 gate forward 1st input to 1st output until 2nd input ready 2 1 Stage

2.4

Description

DUP Programming Philosophy

In order to avoid the common data consistency issues often found in parallel programming systems, stages and ﬁlters for DUP should not perform any updates to shared storage outside of the memory of the individual process. While the DUP System has no way to enforce this property, updates to ﬁles or databases could easily cause problems; if stages were allowed to update storage, changes in the order of execution could easily result in unexpected non-determinism. This might be particularly problematic when network latency and stage scheduling cause non-deterministic runs in a larger system that replicates parts of the computation (e.g., in order to improve fault-tolerance).

238

K.C. Bader et al.

For applications that require parallel access to shared mutable state, the DUP System can still be used to parallelize (and possibly distribute) those parts that lend themselves naturally to stream processing. Other parts of the code should then be designed to communicate with the DUP parts of the application through streams. We speciﬁcally expect stages developed for the DUP System to be written in many diﬀerent languages. This will be necessary so that the application can take advantage of the specialized resources available in heterogeneous multi-core or HPC systems. Existing models for application development on these systems often force the programmer to use a particular language (or small set of languages) for the entire application. For example, in a recent study of optimization techniques for CUDA code [9], twelve benchmark programs were modiﬁed by porting critical sections to the CUDA model. On average, these programs were only 14% CUDA-speciﬁc, yet the presence of CUDA sections limits the choice of languages and compilers for the entire program. The implications are clear: the use of a monolithic software architecture for programs designed to operate eﬃciently on high-performance hardware will severely restrict choices of development teams and possibly prevent them from selecting the most appropriate programming language and tool-chain for each part of a computation. Using the DUP System, developers will be able to compose larger applications from stages written in the most appropriate language available. Another important use-case for DUP is the parallel and distributed execution of legacy code. In contrast to other new languages for parallel programming, which all too often advocate for large-scale (and often manual) program translation eﬀorts, the DUP philosophy calls for writing thin wrappers around legacy code to obtain a streaming API. As we experienced in our case study, it is typically easy to adapt legacy applications to consume inputs from streams and to produce outputs as streams.

3

Case Study: Distributed Molecular Sequence String Matching

Exact and inexact string searching in gene sequence databases plays a central role in molecular biology and bioinformatics. Many applications require string searches, such as searching for gene sequence relatives and mining for PCRprimers or DNA-probes in DNA sequences [10,11,12]; both of these applications are important in the process of developing molecular diagnostic assays for pathogenic bacteria or viruses based upon speciﬁc DNA ampliﬁcation and detection. In the ARB software package, a suﬃx-tree-based search index, called the PTServer, is the central data structure used by applications for fast sequence string matching [13]. A PT-Server instance is built once from the sequence entries of a gene sequence database of interest and is stored permanently on disk. In order to perform eﬃcient searches, the PT-Server is loaded into main memory in its entirety — if the entire data structure cannot ﬁt into the available main

Distributed Stream Processing with DUP

239

memory (the PT-Server requires ∼ 36 bytes of memory per sequence base), the database cannot be eﬃciently searched. In addition to memory consumption, the runtime performance of the search can be quite computationally intensive. An individual exact string search — in practice, short sequence strings of length 15–25 base pairs are searched for — is quick (3–15 milliseconds). However, the execution time can become signiﬁcant when millions of approximate searches are performed during certain bioinformatic analyses, such as probe design. In the near future, the number of published DNA sequences will explode due to the availability of new high-throughput sequencing technology [14]. As a result, current sequential analysis methods will be unable to process the available data within reasonable amounts of time. Furthermore, rewriting more than half-amillion lines of legacy C and C++ code of the high-performance ARB software package is prohibitively expensive. The goal of this case study was to see how readily the existing ARB PT-Server could be distributed and parallelized using the DUP System. Speciﬁcally, we were interested in parallelization in order to reduce execution time and in distribution in order to reduce per-system memory consumption. 3.1

Material and Methods

The study used 16 compute nodes of the Inﬁniband Cluster in the Faculty of Informatics at the Technische Universit¨at M¨ unchen [15]. Each node was equipped with an AMD Opteron 850 2.4 GHz processor with 8 GB of memory, and the nodes were connected using a 4x Inﬁniband network. The SILVA database (SSURef 91 SILVA 18 07 07 opt.arb) [16], which stores sequences of small subunit ribosomal ribonucleic acids and consists of 196,890 sequence entries (with 289,563,473 bases), was used for preparing test database sets and respective PTServers. We divided the original database into 1, 2, 4, 8, and 16 partitions, and a random sampling algorithm was used for composing the partitioned database sets (within each database analysis set, each partition is about the same size). The PT-Servers used in this study were created from these partitions. Table 2 characterizes the resulting partitions and PT-Servers. For the queries, we selected 800 inverse sequence strings of rRNA-targeted oligonucleotide probe sequences of length 15–20 from probeBase, a database of published probe sequences [17]. Each retrieved sequence string has matches in the SILVA database and the respective PT-Server instance. Applying these real world query sequence strings ensured that every search request required non-trivial computation and communication. We generated four sets of inverse sequence strings (400 strings each) by random string distribution of the original dataset from probeBase, and every test run was performed with these four datasets. The presented performance values are the means of the four individually recorded runs.

240

K.C. Bader et al.

Table 2. Resulting problem sizes for the diﬀerent numbers of partitions. This table lists the average number of sequences and bases for the PT-Server within each partition and the resulting memory consumption for each PT-Server as well as the total memory consumption for all partitions. # Part. # Sequences # MBases Memory Part. 1 196,890 289.6 1,430 2 98,445 144.7 745 49,222 72.4 402 4 24,611 36.2 231 8 16 12,305 18.1 145

3.2

(MB) total 1,430 1,489 1,609 1,849 2,327

Adapting ARB for DUP

In the ARB software package, arb probe is a program which performs, per execution, one string search using the PT-Server when a search string and accompanying search parameters are speciﬁed (these are passed as command line arguments). For DUP, arb probe had to be modiﬁed to read the parameters and the search string as a single line from stdin and pass one result set per line to stdout. It took one developer (who had experience with ARB but not DUP or distributed systems) about three hours to create the modiﬁed version arb probe dup and another two hours to compile DUP on the Inﬁniband Cluster, write adequate DUP scripts and perform the ﬁrst run-time test. Debugging, testing, optimization and gathering of benchmark results for the entire case study was done in less than two weeks. All searches were conducted using the program arb probe dup with similar parameters: id 1 mcmpl 1 mmis 3 mseq ACGTACGT. The ﬁrst parameter (id 1) set the PT-Server ID; the second activated the reverse complement sequence (mcmpl 1). For each dataset and approach, the third parameter was used to perform an exact search (mmis 0) in order to ﬁnd matches identical with the search string and an approximate search (mmis 3) in order to ﬁnd all identical strings and all similar ones with maximum distance of three characters to the search string. The last parameter indicated the match sequence. Figure 4 shows the DUP assembly code for the replicated run with two servers. Here, identical PT-Servers are used with the goal of optimizing execution time. Figure 5 shows the equivalent DUP assembly code for the partitioned setting. In this case, since each PT-Server only contains a subset of the overall database, all requests are broadcast to all PT-Servers using fanout. 3.3

Results and Discussion

As shown in Table 2, partitioning the original database into n partitions results in almost proportional reductions in per-node memory consumption: doubling the number of partitions means almost halving the memory consumption per PT-Server partition. In practice we expect signiﬁcantly larger databases to be

Distributed Stream Processing with DUP s @opt1[0out.txt]

$ $ $ $

241

deal; arb_probe_dup; arb_probe_dup; faninany;

Fig. 4. DUP speciﬁcation for the replicated conﬁguration that uses identical ARB servers. The queries are simply distributed round-robin over the two available ARB PT-Servers and the results collected as they arrive.

s @opt1[0out.txt]

$ $ $ $

fanout; arb_probe_dup; arb_probe_dup; gather;

Fig. 5. DUP speciﬁcation for the partitioned conﬁguration where each ARB server only contains a slice of the database. The queries are broadcast to the available ARB PT-Servers and the results collected in round-robin order (to ensure that results for the same query arrive in one batch).

partitioned, resulting in partition sizes close to the size of the main memory of the HPC node responsible for the partition. Figure 6 summarizes the speedup we obtained using n PT-Server replicas (each processing a fraction of the queries). This answers the question as to how much performance could be gained by distributing the queries over n identical (replicated) PT-servers, each containing the full database. Compared with a local version (direct communication between a PT-Server and arb probe dup) we have measured a speedup of 5.84 for 16 compute nodes. replicated,exact replicated,approximate

Speed-up

6

4

2

0 0

2

4

6

8 10 # Nodes

12

14

16

Fig. 6. Speedup of sequence string matches for the replicated PT-Server. The plot shows the average speed-up over ﬁve runs for an exact search and an approximate search (with up to three mismatches).

242

K.C. Bader et al.

The available bandwidth of the compute cluster using TCP is about 107 MB/s, which is close to the 85 MB/s consumed on average by the collector node for 16 compute nodes. For this run, the average CPU utilization of the 16 compute nodes is about 34% and the master node uses about 56%. The legacy ARB code produces rather verbose output, which explains why this benchmark is IO-bound on our cluster when run with more than approximately 8 compute nodes. Converting the human-readable output to a compact binary format would likely result in a signiﬁcant performance improvement; however, the purpose of this study was to evaluate possible speed-ups for legacy code without signiﬁcant changes to the existing infrastructure and changing the message format of the ARB framework would be a signiﬁcant change. The overall runtime for querying a partitioned PT-Server with one sequence string set (400 requests) was in a range of 2 seconds (16 partitions) to 8.25 seconds (one partition) for exact searches, and 16 seconds (16 partitions) to 73 seconds (one partition) for approximate searches. For the replicated PT-Servers, execution time for exact searches ranged from approximately 8.3 seconds on one node to 1.5 seconds on 16 nodes. The approximate search (up to three mismatches) ranged from 72 seconds on one node to 13 seconds on 16 nodes. In an additional test run with the replicated servers, we increased the number of requests (to 2000) by repeating the string set to increase the measured time and reduce possible side eﬀects. The execution time ranged from 140.91 seconds (one node) to 27.09 seconds (16 nodes) for exact searches, and 1479.60 seconds (one node) and 222.26 (16 nodes) for the approximate search. 3.4

Conclusion and Future Work

The speed-ups achieved in this case study are by themselves clearly not sensational; however, the ratio of speedup to development time is. Programmer productivity is key here, especially since researchers in bioinformatics are rarely also experts in distributed systems. Furthermore, the improvements in performance and memory consumption are signiﬁcant and have direct practical value for molecular biologists and bioinformaticians, especially since, aside from the acceleration of sequence string searches by a factor 3.5 to 5.8, this approach also oﬀers biologists the possibility to search very large databases using the ARB PTServer without having to access special architectures with extreme extensions to main memory. In the future, we plan to use DUP to drive large-scale bioinformatics analyses. Depending on the problem size, we also expect to use DUP to combine partitioning and replication in one system. For example, it would be easy to create n replicas of m partitions in order to improve throughput while also reducing the memory consumption of the PT-Servers. Finally, assuming additional performance is desired, the ARB data format could be changed to be less verbose and thereby avoid bandwidth limitations.

Distributed Stream Processing with DUP

4

243

Related Work

The closest work to the DUP System presented in this paper are multi-stream pipelines in CMS [6]. CMS multi-stream pipelines provide a simple mini-language for the speciﬁcation of virtually arbitrary data-ﬂow graphs connecting stages from a large set of pre-deﬁned tools or arbitrary user-supplied applications. The main diﬀerence between CMS and the DUP System (which uses parallel execution of stages) is that CMS pipelines are exclusively record-oriented and implemented through co-routines using deterministic and non-preemptive scheduling with zero-copy data transfer between stages. CMS pipelines were designed for efﬁcient execution in a memory-constrained, single-tasking operating system with record-oriented ﬁles. In contrast, DUP is designed for modern applications that might not use record-oriented I/O and need to run in parallel and on many diﬀerent platforms. Another close relative to the DUP System are Kahn Process Networks (KPNs) [18]. A major diﬀerence between DUP and KPNs is that buﬀers between stages in DUP are bounded, which is necessary given that unbounded buﬀers cannot really be implemented and that in general determining a bound on the necessary size of buﬀers (called channels in KPN terminology) is undecidable [19]. Note that the UNIX command buffer can be used to create buﬀers of arbitrary size between stages in DUP. Another major diﬀerence with KPNs is that DUP does not require individual processes to be deterministic. Non-determinism on the process level voids some of the theoretical guarantees of KPNs; however, it also enables programmers to be much more ﬂexible in their implementations. While DUP allows non-determinism, DUP programmers explicitly choose nondeterministic stages in speciﬁc places; as a result, non-determinism in DUP is less pervasive and easier to reason about compared to languages oﬀering parallel execution with shared memory. Where CMS pipelines focus on the ability to glue small, reusable programs into larger applications, the programming language community has extended various general-purpose languages and language systems with support for pipelines. Existing proposals for stream-processing languages have focused either on highlyeﬃcient implementation (for example, for the data exchange between stages [20]) or on enhancing the abstractions given to programmers to specify the pipeline and other means of communication between stages [21]. The main drawback of all of these designs is that they force programmers to learn a complex programming language and rewrite existing code to ﬁt the requirements of the particular language system. The need to follow a particular paradigm is particularly strong for real-time and reactive systems [22,23]. Furthermore, especially when targeting heterogeneous multi-core systems, quality implementations of the particular language must be provided for each architecture. In contrast, the DUP language implementation is highly portable (relying exclusively on canonical POSIX system calls) and allows developers to implement stages in any language. On the systems side, related research has focused on maximizing performance of streaming applications. For example, StreamFlex [22] eliminates copying between ﬁlters and minimizes memory management overheads using types. Other

244

K.C. Bader et al.

research has focused on how ﬁlters should be mapped to cores [24] or how to manage data queues between cores [20]. While the communication overheads of DUP applications can likely be improved, this could not be achieved without compromising on some of the major productivity features of the DUP System (such as language neutrality and platform independence). In terms of language design and runtime, the closest language to the DUP Assembly language is Spade [25] which is used to write programs for InfoSphere Streams, IBM’s distributed stream processing system [26]. The main diﬀerences between Spade and the DUP Assembly language is that Spade requires developers to specify the format of the data stream using types and has built-in computational operators. Spade also restricts developers of ﬁlters to C++; this is largely because the InfoSphere runtime supports migrating of stages between systems for load-balancing and can also fuse multiple stages for execution in a single address space for performance. Dryad [27] is another distributed stream processing system similar to Spade in that it also restricts developers to developing ﬁlters in C++. Dryad’s scheduler and fault-tolerance provisions further require all ﬁlters to be deterministic and graphs to be free of cycles, making it impossible to write stages such as faninany or holmerge in Dryad. In comparison to both Spade and Dryad, the DUP System provides a simpler language with a much more lightweight and portable runtime system. DUP also does not require the programmer to specify a speciﬁc stream format, which enables the development of much more generic stages. Speciﬁcally, the Spade type system cannot be used to properly type stream-format agnostic ﬁlters such as cat or fanout. Finally, DUP is publicly available whereas both Spade and Dryad are proprietary. DUP is a coordination language [28] following in the footsteps of Linda [29]: the DUP System is used to coordinate computational blocks described in other languages. The main diﬀerence between DUP and Linda is that in DUP, the developer speciﬁes the data ﬂow between the components explicitly, whereas in Linda, the Linda implementation needs to match tuples published in the tuplespace against tuples published by other components. The matching of tuples in the Linda system enables Linda to execute in a highly dynamic environment where processes joining and leaving the system are easily managed. However, the matching and distribution of tuples also causes signiﬁcant performance issues for tuplespace implementations [30]. As a result, Linda implementations are not suitable for distributed stream processing with signiﬁcant amounts of data.

5

Conclusion

The signiﬁcant challenges with writing eﬃcient parallel high-performance code are numerous and well-documented. The DUP System presented in this paper addresses some of these issues using multi-stream pipelines as a powerful and ﬂexible abstraction around which an overall computation can be broken into independent stages, each developed in the language best suited for the stage, and each compiled or executed by the most eﬀective tools available. Our experience so

Distributed Stream Processing with DUP

245

far makes us conﬁdent that DUP can be used to quickly implement parallel programs, to obtain signiﬁcant performance gains, and to experiment with various dataﬂow graph conﬁgurations with diﬀerent load-distribution characteristics.

Acknowledgements The authors thank Prof. Dr. Matthias Horn, University of Vienna, for providing us with probe sequences from probeBase for the bioinformatics case study. This work was funded in part by the Deutsche Forschungsgemeinschaft (DFG) under ENP GR 3688/1-1 and by the Bayerische Forschungsstiftung (BFS) under AZ 767-07 (the NANOBAK Projekt).

References 1. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A uniﬁed graphics and computing architecture. IEEE Micro 28, 39–55 (2008) 2. Flachs, B., Asano, S., Dhong, S.H., Hofstee, P., Gervias, G., Kim, R., Le, T., Liu, P., Leenstra, J., Liberty, J., Michael, B., Oh, H., Mueller, S.M., Takahashi, O., Hatakeyama, A., Wantanbe, Y., Yano, N.: A stream processing unit for a cell processor. In: IEEE International Solid-State Circuits Conference, pp. 134–135 (2005) 3. Quigley, E.: UNIX Shells, 4th edn. Prentice Hall, Englewood Cliﬀs (2004) 4. Grothoﬀ, C., Keene, J.: The DUP protocol speciﬁcation v2.0. Technical report, The DUP Project (2010) 5. Hartmann, J.P.: CMS Pipelines Explained. IBM Denmark (2007), http://vm.marist.edu/~ pipeline/ 6. IBM: CMS Pipelines User’s Guide. version 5 release 2 edn. IBM Corp. (2005), http://publibz.boulder.ibm.com/epubs/pdf/hcsh1b10.pdf 7. Goebelbecker, E.: Using grep: Moving from DOS? Discover the power of this Linux utility. Linux Journal (1995) 8. Dougherty, D.: Sed and AWK. Reilly & Associates, Inc., Sebastopol (1991) 9. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Mei, W., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 73–82. ACM, New York (2008) 10. Nordberg, E.K.: YODA: selecting signature oligonucleotides. Bioinformatics 21, 1365–1370 (2005) 11. Linhart, C., Shamir, R.: The degenerate primer design problem. Bioinformatics 18(Suppl. 1), S172–S181 (2002) 12. Kaderali, L., Schliep, A.: Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 18, 1340–1349 (2002) 13. Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, Buchner, A., Lai, T., Steppi, S., Jobb, G., F¨ orster, W., Brettske, I., Gerber, S., Ginhart, A.W., Gross, O., Grumann, S., Hermann, S., Jost, R., K¨ onig, A., Liss, T., L¨ ussmann, R., May, M., Nonhoﬀ, B., Reichel, B., Strehlow, R., Stamatakis, A., Stuckmann, N., Vilbig, A., Lenke, M., Ludwig, T., Bode, A., Schleifer, K.H.: ARB: a software environment for sequence data. Nucleic Acids Research 32, 1363–1371 (2004)

246

K.C. Bader et al.

14. Shendure, J., Ji, H.: Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135– 1145 (2008) 15. Klug, T.: Hardware of the InﬁniBand Cluster 16. Pruesse, E., Quast, C., Knittel, K., Fuchs, B.M., Ludwig, W., Peplies, J., Gl¨ockner, F.O.: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research 35, 7188–7196 (2007) 17. Loy, A., Maixner, F., Wagner, M., Horn, M.: probeBase – an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Research 35 (2007) 18. Kahn, G.: The semantics of a simple language for parallel programming. Information Processing, 993–998 (1974) 19. Parks, T.M.: Bounded Scheduling of Process Networks. PhD thesis, University of California, Berkeley (1995) 20. Giacomoni, J., Moseley, T., Vachharajani, M.: Fastforward for eﬃcient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 43–52. ACM, New York (2008) 21. Thies, W., Karczmarek, M., Amarasinghe, S.P.: Streamit: A language for streaming applications. In: Horspool, R.N. (ed.) CC 2002. LNCS, vol. 2304, pp. 179–196. Springer, Heidelberg (2002) 22. Spring, J.H., Privat, J., Guerraoui, R., Vitek, J.: Streamﬂex: high-throughput stream programming in java. SIGPLAN Not. 42, 211–228 (2007) 23. Lee, E.A.: Ptolemy project (2008), http://ptolemy.eecs.berkeley.edu/ 24. Kudlur, M., Mahlke, S.: Orchestrating the execution of stream programs on multicore platforms. In: PLDI 2008: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 114–124. ACM, New York (2008) 25. Hirzel, M., Andrade, H., Gedik, B., Kumar, V., Losa, G., Soule, R., Wu, K.-L.: Spade language speciﬁcation. Technical report, IBM Research (2009) 26. Amini, L., Andrade, H., Bhagwan, R., Eskesen, F., King, R., Selo, P., Park, Y., Venkatramani, C.: Spc: A distributed, scalable platform for data mining. In: Workshop on Data Mining Standards, Services and Platforms, DM-SPP (2006) 27. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed dataparallel programs from sequential building blocks. In: European Conference on Computer Systems (EuroSys), Lisabon, Portugal, pp. 59–72 (2007) 28. Gelernter, D., Carriero, N.: Coordination languages and their signiﬁcance. ACM Commun. 35, 97–107 (1992) 29. Carriero, N., Gelernter, D.: Linda in context. ACM Commun. 32, 444–458 (1989) 30. Wells, G.C.: A Programmable Matching Engine for Application Development in Linda. PhD thesis, University of Bristol (2001)

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries Yongqiang Zou, Jia Liu, Shicai Wang, Li Zha, and Zhiwei Xu Institute of Computing Technology, Chinese Academy of Sciences Beijing, 100190, China {zouyongqiang,liujia09,wangshicai}@software.ict.ac.cn, {char,zxu}@ict.ac.cn

Abstract. Massive scale distributed database like Google’s BigTable and Yahoo!’s PNUTS can be modeled as Distributed Ordered Table, or DOT, which partitions data regions and supports range queries on key. Multi-dimensional range queries on DOTs are fundamental requirements; however, none of existing schemes work well while considering three critical issues: high performance, low space overhead, and high reliability. This paper introduces CCIndex scheme, short for Complemental Clustering Index, to solve all three issues. CCIndex creates several Complemental Clustering Index Tables for performance, leverages region-to-server information to estimate result size, and supports incremental data recovery. This paper builds a prototype on Apache HBase. Theoretical analysis and micro-benchmarks show that CCIndex consumes 5.3% ~ 29.3% more space, has the same reliability, and gains 11.4 times range queries throughput of secondary index scheme. Synthetic application benchmark shows that CCIndex query throughput is 1.9 ~ 2.1 times of MySQL Cluster. Keywords: Clustering, index, range queries, multi-dimensional.

1 Introduction Massive scale distributed databases like Google’s BigTable [1] and Yahoo!’s PNUTS [2] gain more and more attention to store data for Internet scale applications. These systems can be modeled as Distributed Ordered Table, short as DOT, which partitions continuous keys to regions, replicates regions for performance and reliability, distributes regions to shared-nothing region servers for scalability, serves as tables and columns, and supports range queries on keys. Multi-dimensional range queries on DOT systems are natural requirements. For example, a query needs to find out nearby restaurants through “latitude > 48.5 and latitude < 48.6 and longitude > 112.5 and longitude < 112.8 and type = restaurants”. Another example is finding out hottest pictures in this week in a photo-sharing application, such as Flickr, with a query like “timestamp > 1267660008 and rank > 1000”. With only DOT’s range queries over key, a multi-dimensional range query is a table scan over key with predicates on non-key C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 247–261, 2010. © IFIP International Federation for Information Processing 2010

248

Y. Zou et al.

columns to filter results, which is very ineffective for a low selectivity query on nonkey columns, and the latency is unacceptable in large scale data sets, such as TB or PB level. This paper refers to selectivity by the percentage of records passing the predicate. The inefficiency requires index on non-key columns to accelerate multidimensional range queries. However, multi-dimensional range queries over DOT are big challenges if we considering the three critical issues: high performance, low space overhead, and high reliability. None of existing schemes can work well considering all the three issues. Building secondary indexes for non-key columns through creating ordered tables to store indexes is common. However, the range query over secondary index is significantly slow, because random read is slower than scan (eg. In BigTable is 13.7 times). Other “better” index schemes without clustering data will encounter the same problem as slow random read. Clustering index reduces the random reads but needs several times storage, and data recovery is a big issue if the underlying replica mechanism is disabled. Lacking of statistics on DOTs imposes more difficulties on optimizing multidimensional range queries. This paper introduces a new scheme CCIndex, short for Complemental Clustering Index, to support multi-dimensional range queries over DOT while achieving high performance, low space overhead, and high reliability. CCIndex creates several Complemental Clustering Index Tables, or CCIT, each for a search column with the full row data, which makes range query over this column a range scan. CCIndex leverages the region-to-server mapping information to estimate the result size of each queries. CCIndex disables the underlying data replica mechanisms to avoid too much storage overhead, and creates a replicated Complemental Check Table, or CCT, for each search column to support incremental data recovery. CCIndex prototype has been built on Apache HBase, a subproject of Hadoop. Theoretical analysis and experimental evaluations have been given. The rest of this paper is structured as follows. Section 2 presents related work. Section 3 describes the CCIndex architecture design, including the construction of the index. Section 4 presents query processing and optimization. Section 5 gives the fault tolerant mechanisms. Section 6 gives detailed evaluations. Section 7 concludes the paper.

2 Related Work Recently, some research focuses on index mechanisms over DOT. Yahoo! focuses on optimizing range queries on DOT through adaptive parallelizing [3], and multidimensional range queries are done through range scan over primary key with predicates. This approach is very ineffective with low selectivity queries. Google and Yahoo! claim the future work on secondary index over DOTs [4, 2]. A currently available secondary index over DOT is the IndexedTable mechanism in Apache Hbase [5]. IndexedTable creates a new table for each index column, saves it in the DOTs in the order of the index value. IndexedTable is more effective for low selectivity queries than table scan, and has acceptable space overhead and fault tolerance ability. However, index scan needs random reads on original table which is very slow. Traverse [6] builds B-tree [7, 8] index for the map-reduce-merge system. Traverse has

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables

249

the same performance problem as IndexedTable, and is lack of reliability due to the non-replica B-tree indexes. CCIndex is better than these approaches in space overhead, reliability, and index scan performance. Multi-dimensional range queries in databases are topics gaining attentions for more than 20 years. R-tree [9], R+-tree [10], and their successors extend B-tree, divide the multi-dimensional space, and store the recursively divided spaces as tree nodes. Queries walk through the tree to find out the data block. These schemes does not consider the reliability problem over DOT, unless they are implemented in a scalable and reliable way, just like the distributed B-tree [11]. Even though, the performance degradation also exists due to missing clustered data. DB2 introduces multi-dimensional clustering [12] to form every unique combination of dimension values as a logical ‘cell’, which is physically organized as block of pages. Multiple B-tree indexes are built for every dimension and the B-tree leaves point to the block. This scheme can avoid random read only when the values in block are dense, and the reliability of the B-tree index is not considered. Parallel databases [13, 14] support multi-dimensional queries and have good reliability, such as the MySQL cluster [15]. CCIndex is designed for more scalable DOTs to get good performance with large dataset and many machines. DHTs [16, 17, 18, 19] are scalable and reliable for key-value pair storage. Because the data is partitioned by hashing functions, DHT systems do not support range queries naturally. MAAN [20] and SWORD [21] use locality preserving hashing and store attributes in DHT as index to support range queries. However, the logN hop latency is not good for user-interactive applications.

3 Data Layout and Management This section introduces the CCIndex ideas and the underlying data layout. 3.1 Basic Idea The CCIndex is inspired through these observations: (1) There are usually 3 to 5 replica in the DOT systems to assure reliability and improve performance. (2) The indexes number is usually less than 5. (3) The random reads is significantly slower than scan. The trick of CCIndex is reorganizing the data to a new layout to accelerate multi-dimensional range queries. CCIndex introduces several Complemental Clustering Index Tables, each for a search column with the full row data, to convert the slow random reads to fast range scan. With multiple tables, a key decision is determining which table is chosen to scan. CCIndex leverages the region-to-server mapping information to estimate the result size of each sub queries. CCIndex disables the underlying data replica mechanisms to get an acceptable storage overhead, and uses these Complemental Clustering Index Tables to recovery each other to assure reliability. CCIndex creates a replicated Complemental Check Table for each search column to support incremental data recovery. 3.2 Data Layout In DOT systems, tables are very tall and logically ordered by row keys. Physically, each table is partitioned to regions containing continuous ranges, and each region has

250

Y. Zou et al.

several replicas identical to each other for fault tolerance. CCIndex reorganizes the underlying data layout as in Fig. 1.

Fig. 1. Data layout of CCIndex. For a logical table has id as the primary key with two index columns idx1 and idx2, CCIndex creates another two CCIT tables each for a index column and reorganizes the rows in the order of this column. CCIndex creates replicated CCTs for fast data recovery.

In the data layout, each logical table has several complemental tables. In Fig. 1, there is a table to support range queries over id and two index columns idx1 and idx2. The table CCIT0 is the original table ordered by unique id. CCIT1 and CCIT2 are ordered by key1 and key2, which are generated by concatenating index column value, the original id, and the index column value length. The construction of the new key makes sure the new CCITs are ordered by index column values, and makes the duplicated values of index columns be unique keys. The index value length field makes it easy to split the index value and id. With these CCITs, range queries over id or index columns can be a scan on the corresponding CCITs. Each CCIT has a corresponding replicated CCT, which contains the primary key and index columns of the CCIT. CCTs are necessary to help incremental data recovery of CCITs. The CCITs have no replica but the CCTs have replicas. Fig. 1 shows the logical view of CCIndex, and these tables are physically stored in DOT system. Storing CCITs and CCTs in DOT leverages the primary key ordering, data partition, and various operation optimizations to simplify CCIndex implementation. 3.3 Index Create, Update, and Delete The index maintenance is done along with the record insert and delete operations.

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables

251

When CCIndex creates a table with specified index columns, all CCITs are created. When a record is written to DOT, the CCIndex first reads the original table to check whether the index column values are changed, and delete the corresponding records in CCITs when necessary. After that, the CCIndex writes the records to all CCITs in a parallel way. The delete operation also involves all the CCITs.

4 Query Processing and Optimization The DOT read and scan operation are simply redirecting to the original CCIT. The index scan is processed by CCIndex. 4.1 Query Plan Generation and Execution CCIndex introduces a SQL-like syntax to expression multi-dimensional range queries. The query string is like this: select rowkey, host, service, time, status from MonitoringData where host=’node 216’ and service=’CPU Load’ and (time > 1260610511 and time < 1260610521) CCIndex translates the SQL expression to a query plan tree, optimizes the tree, and translates the tree to disjunctive form. Then CCIndex executes the logic OR part in parallel, and executes each internal AND part by the estimated optimal one query with predicates of other columns to filter rows. 4.2 Query Plan Optimization CCIndex first does simple optimization of query plan tree to eliminate redundant range queries. For example, the time > 123 and time > 135 could be merged into time > 135. Furthermore, the important optimization is estimating result size of multiple AND queries and choosing the minimal one. In databases, query optimization is based on statistics of tables. However, DOT systems are lack of statistics, because the statistics are very difficult to gather and maintain in massive scale tables maintained by thousands of region servers. For example, there is not any statistics in HBase, and an additional tool must be written to count table rows. CCIndex introduces a way to estimate the query result size in the absence of statistics. CCIndex’s estimation method relies on the region-to-server mapping information of DOTs. The mapping information is necessary for DOT systems to record the responsible region server for each region. DHTs have no such information, because the mapping relationship is deduced by the overlay topology, object ids, and node ids. The mapping information can be abstracted in the form of . The regionStartKey is the minimal key in this region and serves as the region id. The mapping information is gathered together and ordered by the regionStartKey. CCIndex scans all this mapping information using a binary search, and finds out the number of covered regions for each range query. CCIndex claims that

252

Y. Zou et al.

the region number determines the result size, because for a DOT containing more than one region and having 64 MB default region size, each region size must be between 32 MB and 64 MB. For the first or last region not fully covered in a range, the coverage ratio is calculated to estimate the result size for the regions. This policy is more accurate for large query result size, because the average region size is more accurate when there are lots of regions. For query covering few of regions, the detailed size is not important, because this result size is small and the estimation objective is determining a query with small result size to execute.

5 Fault Tolerance In CCIndex, CCITs have no replica to avoid huge storage overhead, and cause the problem of fault tolerance. The basic idea is that CCITs replicate and recovery each other in record level. However, when a region of a CCIT is damaged, we can only reconstruct the whole CCIT if there is lack of ways to gather necessary records to recovery the region. CCIndex introduces the CCTs to help recoverying the damaged region. CCIndex only checks the corresponding CCTs to get the proper keys for CCITs and get the record data to rebuild the region. The CCTs imposes additional overhead on inserting or deleting records. CCIndex maintains the CCTs in an asynchronous way to minimize the overhead. CCIndex leverages the log of DOTs to update CCTs in batch mode by the background threads.

6 Implementation and Evaluations This paper implements a CCIndex prototype and evaluates CCIndex through theoretical analysis, micro benchmarks, and synthetic application benchmarks. 6.1 Implementation This paper builds a CCIndex prototype based on Apache Hbase, an open-source implementation of BigTable. Hbase is a sub-project of Apache Hadoop [22], which has HDFS as the distributed file system and MapReduce as the parallel computing model. HBase builds on top of HDFS, has one master process called HMaster and many slave processes called HRegionServer to manage data regions. The CCIndex prototype uses HBase v0.20.1 as code base, adds clustering index table package, and implements CCIndex in Java. CCIndex disables the replica of HDFS by setting the replica factor to one, and creates one CCIT for each search column. CCIndex builds several CCTs on replicated HFDS files to achieve reliability. HBase has multiple META regions, and each META region contains mapping of a number of user regions comprising the tables to HRegionServers. HBase has a ROOT region to locate all the META regions. CCIndex scans the ROOT and META regions to get the region-to-server mapping information and estimates the query result size.

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables

253

The comparable IndexedTable is a built-in index mechanism provided in HBase. IndexedTable creates a replicated ordered table for each index column and is an implementation of secondary index scheme. IndexedTable does not provide multidimensional range queries interface or optimization to estimate query result size for multi-dimensional range queries. 6.2 Theoretical Analysis For the three metrics performance, space overhead, and reliability, the first one is easy to evaluate through experiments, while the other two are more suitable to do theoretical analysis to get more insight. Theorem 1. The space overhead ratio of CCIndex to IndexedTable is (N*N+1)/(2*N+(N+1)*L/Ln) Where N is the number of index columns without primary key, and L/Ln is factor that total record length divided by the sum of index column lengths and key length, with the suppose that the replica factor for record data is N+1, and index column and primary key has the same length. Proof. In IndexTable, the space for each record is the original table plus index: Sii=(Lk+Li)*N*F+L*F

(1)

Where N is the number of index columns, F is the replica number, Lk is the length of key, Li is the average length of index columns, and L is the total length of a record. In CCIndex, the space for each record is the CCITs plus CCTs. The space for CCTs is: Sc = (Lk+N*Li)*N*F + (Lk+N*Li)*F = (Lk+N*Li)*(N+1)*F

(2)

The total space for CCIndex is: Scc=Sc+L*(N+1)

(3)

If Lk = Li, F = N + 1, the space overhead ratio of CCIndex to IndexedTable is: (Scc-Sii)/Sii = (N*N+1) / (2*N+L/Lk)

(4)

Let Ln=Lk + N*Li, then the formula (4) is: (Scc-Sii)/Sii = (N*N+1)/(2*N+(N+1)*L/Ln)

(5)

□

The equation (5) in theorem 1 can be plotted as Fig. 2. From Fig. 2, the overhead ratio drop significantly as the L/Ln increases and the N decreases, which indicates that CCIndex should have less columns to index and all index columns should have small length to avoid big space overhead. If N changes from 2 to 4 and the L/Ln changes from 10 to 30, then the overhead changes from 5.3% to 29.3%.

254

Y. Zou et al.

Fig. 2. The space overhead ratio of CCIndex to IndexedTable. The overhead ratio drops significantly as the L/Ln increases and the N decreases. If N changes from 2 to 4 and the L/Ln changes from 10 to 30, then the overhead changes from 5.3% to 29.3%.

Theorem 2. In CCIndex, the probability of being able to recovery a damaged record is (1 - f(N+1))2 Where f is the probability of a record damages, N is the indexed column number. The probability is the same as that of IndexedTable. Proof. CCIndex recoveries the data through CCTs and CCITs. CCTs have N replicas plus another copy in the corresponding columns in CCIT and the probability of failing to read from all CCTs is f(N+1). The probability of replicas for a given record in all CCITs are damaged is f(N+1). So, the probability of being able to recovery a damaged record is (1 - f(N+1))2. For IndexedTable, data access relies on replicated index and the original table. The probability is obviously the same as CCIndex.

□

6.3 Micro Benchmarks BigTable introduces a micro benchmark to evaluate the basic operations throughput, including random read/write, sequential read/write, and scan. The workload is comprised of a table with 1KB rows, and each row has an additional 10 bytes rowkey. The throughput is defined as rows per seconds for all clients. HBase implements this micro benchmark and has single thread client, multi-threads clients, or MapReduce clients to evaluate the throughput. CCIndex extends the micro benchmark by adding an “index” column family to contain three columns and each is 10 bytes, and building three indexes using these columns. CCIndex adds an IndexScan operation to scan through the first column index. We setup an experimental environment having two clusters. The small cluster has 3 nodes for micro benchmarks, and the big one has 16 nodes for synthetic application

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables

255

benchmark. Each node has two 1.8 GHz dual-cores AMD Opteron (tm) Processor 270, 6 GB memory. Each node in the small cluster has 321 GB RAID5 SCSI disks, and each node in the big cluster has 186GB RAID1 SCSI disks. All nodes in each cluster are connected by Gigabits Ethernet. Each node uses Red Hat CentOS release 5.3 (kernel 2.6.18), ext3 file system, Sun JDK1.6.0_14, Hadoop v0.20.1, and HBase 0.20.1. The HBase itself uses 3 GB heap memory. In our experiments, we choose the workloads which have 1 million rows, and run each tests three times to report the average value. The client uses one of the 3 machines with three concurrent threads. The micro benchmarks use 3 machines.

Fig. 3. Basic Operation Performance of Two Index Schemes. CCIndex throughput is 11.4 times of IndexTable’s in IndexScan operation. CCIndex random write and sequential write operations is 54.9% and 121.4% better than that of IndexTable.

We compare CCIndex with IndexedTable in HBase and show results in Fig. 3. CCIndex’s IndexScan operation throughput is 11.4 times of IndexTable’s, which shows the benefits of CCIndex through avoiding random reads in primary key. CCIndex random write and sequential write operations are 54.9% and 121.4% better than those of IndexTable, which is due to the parallel index updating. The scan, random read, and sequential read of these two schemes are nearly identical due to the same logic path. We compare the throughput of CCIndex with the original table, and the result is in the Fig. 4. The IndexScan is unavailable for origin table without index. CCIndex IndexScan throughput is 10.9% more than origin table, which is due to the first column in the “index” family is moved to the row key, so the data length of CCIndex table is smaller than original table’s. Fig. 4 further interprets why CCIndex can gain an order of magnitude improvement over IndexedTable. For IndexedTable, a range query over an index column should first scan the index table, and then issue multiple random reads in the original table to get the row data. In IndexedTable, throughput of scan over index table is nearly the same as scanning the original table. While in CCIndex, range query over an index column is done by IndexScan, which scans over

256

Y. Zou et al.

corresponding CCIT. The IndexScan throughput for CCIndex is 8.2 times of random read in original table, and 1.1 times of scan in original table, so the throughput is at least 9.3 times over IndexedTable, because IndexedTable needs additional time to parse and wrap intermediate results.

Fig. 4. Basic Operation Performance with Original Table. CCIndex IndexScan throughput is 10.9% more than origin table. The random write and sequential write is significantly lower than the origin table due to the overhead to maintain index, which is a common issues for both index schemes.

The throughput of random write and sequential write for CCIndex is significantly lower than the origin table, because maintaining index needs another random read to get row data for checking whether to change index column value, and a further delete and write to update index if is necessary.

Fig. 5. Index scan latency of all three schemes. CCIndex is 9.2 times faster than IndexedTable when the result count is larger than 1024.

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables

257

Because scanning index to get the matching row data is the most important functions of building index, we claim that CCIndex significantly outperforms IndexedTable and is suitable for range queries over indexes. However, we should carefully choose the workloads having less write operations and choose more stable index columns to avoid the performance degradation of write, and these are general guidelines for all two index schemes. The following experiments show the index scan latency of all the three schemes in different result count. We use the scan over primary key to represents the unavailable index scan of the origin table. The results are illustrated in Fig. 5. From Fig. 5, the CCIndex latency is significantly smaller than IndexedTable, and the ratio is stable at 9.2 when the result count is larger than 1024. Another interesting thing is that the absolute latency of CCIndex is low, and the round-trip latency to get 1024 1KB continuous rows is 42 micro-seconds. The low latency of CCIndex over HBase shows the ability of serving high user-interactivity applications, such as blog, wiki, and twitter. 6.4 Synthetic Application Benchmarks Multi-dimensional range queries are not directly supported by IndexedTable, so we designed a suite of experiments to compare the performance with the memory-based parallel database MySQL cluster. There are no common accepted benchmarks for multi-dimensional range queries over DOTs yet. The well-known benchmarks for database, such as TPC-C [23] and TPC-H [24], have a majority of operations not supported by DOT, such as transactions over multiple records and complex queries with joins and aggregations. This paper designs a synthetic application benchmark by analyzing a well-known cluster monitoring application Nagios [25]. Nagios supports comprehensive monitoring of operating systems, applications, network protocols, system metrics, and network infrastructure through user-configured monitoring items, called “service”, in a fixed interval on all hosts in a cluster. Nagios records the information about launching a monitoring item on a host into the log, including timestamp, host, service, execution time, and the response message for this monitoring item, etc. Nagios provides a web portal contacting backend CGI programs to read monitoring data and show various aspects of the cluster. The log information volume is exposing if we have more monitor items, more hosts, shorter interval, and a longer period of information to store. Through analyzing the application logic of the Nagios web portal, we construct a table ServiceTime using host concatenating service and time as the primary key, and with service and time as the record. We design two queries for our tests. z

z

AndQuery: Multi-dimensional range queries with AND operations results in a big result count. The query likes “select * from ServiceTime where (primaryKey > K1 and primaryKey < K2) and (time > k3 and time < k4) and (service = ‘CPU Load’)”. The query runs with multiple clients concurrently, each with different ranges to get load balance. The result count for each client is about 5 million. OrQuery: Multi-dimensional range queries with OR operations results in a big result count. The query is similar as AndQuery, but uses OR to connect different dimensions. The result count for each client is about 10 million.

258

Y. Zou et al.

These queries should be run in multiple clients to get the total throughput of all clients. We use the 16 node cluster described in the micro benchmarks, and there is totally 64 cores and 96 GB memory. In our experiments, we collect more than 120 million monitoring records with average record length 118 bytes. The MySQL cluster is version 7.09, which is configured with 1 management node, 2 SQL nodes, and 14 data nodes. In this test the maximum data node number is 14 because data nodes must not co-located with management node and must be even number. The HBase regionserver in each node has 3GB heap memory. In the following tests, we use at most 90 million records because it reaches the capacity limits of our configured MySQL cluster. We allocate 3 GB as the data memory for each MySQL data node. MySQL cluster stores all records in data memory and cannot accept new records when the memory is all consumed. In the tests, each node runs an instance of client. Fig. 6 shows the results.

Fig. 6. Multi-dimensional range queries throughput for multiple clients. MySQL Cluster performance is stable when the data sets increases from 30 million to 90 million records. With the 90 million records, CCIndex AndQuery and OrQuery throughput is 2.1 and 1.9 times of the memory-based parallel database MySQL Cluster.

CCIndex AndQuery and OrQuery throughput is 2.1 and 1.9 times over MySQL Cluster with 90 million records dataset, which shows CCIndex performance is significant better than MySQL Cluster. In Fig. 6, the MySQL Cluster performance is stable for AND and OR queries when the data sets increase from 30 million to 90 million records. However, the MySQL Cluster scalability problem is that the capacity is determined by the total memory for data, because all the data has a copy in memory, which improves the performance but limits the capacity.

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables

259

6.5 Discussion The CCIndex can be applied to DOT systems with few of columns to index, which has great impact on the storage overhead. For a table more than 5 columns having query requirements, the practical solution is identifying the most frequently used columns to build index with CCIndex, or combines some columns to reduce the column number. CCIndex practically does not support adding or removing index after the table is created, for the reason that creation or deleting of CCIT costs unaffordable time for massive scale data. Another problem is that CCIndex write operation is slower than the original table. These two are common problems for many index schemes. In CCIndex, the probability of being able to recovery a damaged record is fairly good; however, the data recovery time is longer than IndexedTable scheme. Because in CCIndex, recovery a region needs gathering all records by random read in other CCITs, which is slower than copying a 64 MB region data file.

7 Conclusions and Future Work This paper models the massive scale databases as Distributed Ordered Table, or DOT, which partitions continuous keys to regions, replicates regions for performance and reliability, distributes regions to shared-nothing region servers for scalability, serves as tables and columns, and supports range queries on keys. This paper formulates the problem as supporting multi-dimensional range queries over DOT while considering the three metrics: high performance, low space overhead, and high reliability. This paper proposes a scheme called CCIndex, short for Complemental Clustering Index, to tackle this problem. CCIndex introduces Complemental Clustering Index Tables each for a search column with the full row data to reorganize data and improve query performance. CCIndex leverages the region-to-server mapping information to estimate the result size of each query without statistics. CCIndex disables the underlying data replica mechanisms to avoid too much storage overhead, and introduces replicated Complemental Check Table to support incremental data recovery. CCIndex prototype has been built on Apache HBase. Theoretical analysis shows that CCIndex consumes 5.3 ~ 29.3% storage more than secondary index scheme in HBase for typical situations and the probability of failing to recovery bad rows is the same as secondary index scheme. Micro benchmarks show that CCIndex throughput of range queries on non-key column is about 11.4 times of secondary index. The synthetic monitoring application range queries in a 16-node cluster shows that CCIndex AndQuery and OrQuery throughput is 2.1 and 1.9 times over MySQL Cluster with 90 million records dataset. The future work includes further optimization and evaluation the space overhead and reliability in terms of recovery time. Additional work should be done to optimize the index updating performance. Some real world application benchmarks should be involved to evaluate the query performance in real world scenarios. More practical experiences and lessons should be given.

260

Y. Zou et al.

Acknowledgment We would like to thank Vega GOS R&D team members and GOS users, especially Liang Li to help optimizing the query plan. This work is supported in part by the Hi-Tech Research and Development (863) Program of China (Grant No. 2006AA01A106, 2006AA01Z121, 2009AA01A130), and the National Basic Research (973) Program of China (Grant No. 2005CB321807).

References 1. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: 7th USENIX Symposium on Operating Systems Design and Implementation, vol. 7, pp. 205–218. USENIX Association, Berkeley (2006) 2. Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform. In: Proc. VLDB Endow. vol. 1, pp. 1277–1288. VLDB Endowment (2008) 3. Vigfusson, Y., Silberstein, A., Cooper, B.F., Fonseca, R.: Adaptively parallelizing distributed range queries. In: Proc. VLDB Endow., vol. 2, pp. 682–693. VLDB Endowment (2009) 4. Cafarella, M., Chang, E., Fikes, A., Halevy, A., Hsieh, W., Lerner, A., Madhavan, J., Muthukrishnan, S.: Data management projects at Google. SIGMOD Rec. 37(1), 34–38 (2008) 5. Apache Hbase project, http://hadoop.apache.org/hbase 6. Yang, H.C., Parker, D.S.: Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters. In: Zhou, X., Yokota, H., Deng, K. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 308–322. Springer, Heidelberg (2009) 7. Bayer, R., McCreight, E.: Organization and maintenance of large ordered indexes. Acta Informatica 1(3), 173–189 (1972) 8. Comer, D.: Ubiquitous B-Tree. ACM Computing Surveys (CSUR) 11(2), 121–137 (1979) 9. Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, vol. 13, pp. 47–57. ACM, New York (1984) 10. Sellis, T.K., Roussopoulos, N., Faloutsos, C.: The R + -tree: A dynamic index for multidimensional objects. In: Proceedings of the 13th International Conference on Very Large Data Bases, vol. 13, pp. 507–518. Morgan Kaufmann, San Francisco (1987) 11. MacCormick, J., Murphy, N., Najork, M., Thekkath, C.A., Zhou, L.D.: Boxwood: abstractions as the foundation for storage infrastructure. In: Proceedings of the 6th USENIX on Symposium on Operating Systems Design and Implementation, vol. 6, pp. 105–120. USENIX Association, Berkeley (2004) 12. Padmandabhan, S., Bhattacharjee, B., Malkemus, T., Cranston, L., Huras, M.: Multidimensional clustering: a new data layout scheme in DB2. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, vol. 32, pp. 637–641. ACM, New York (2003) 13. DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: GAMMA - A High Performance Dataflow Database Machine. In: Proceedings of the 12th International Conference on Very Large Data Bases, vol. 12, pp. 228–237. Morgan Kaufmann, San Francisco (1986)

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables

261

14. Fushimi, S., Kitsuregawa, M., Tanaka, H.: An Overview of The System Software of A Parallel Relational Database Machine GRACE. In: Proceedings of the 12th International Conference on Very Large Data Bases, vol. 12, pp. 209–219. Morgan Kaufmann, San Francisco (1986) 15. .Ronström, M., Oreland, J.: Recovery principles of MySQL Cluster 5.1. In: Proceedings of the 31st International Conference on Very Large Data Bases, vol. 31, pp. 1108–1115. VLDB Endowment ( 2005) 16. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalable peerto-peer lookup service for internet applications. In: Proceedings of the 2001?Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 149–160. ACM, New York (2001) 17. Sylvia, R., Francis, P., Handley, M., Karp, R., Schenker, S.: A scalable contentaddressable network. In: Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 161–172. ACM, New York (2001) 18. Rowstron, A., Druschel, P.: Pastry: Scalable, Decentralized Object Location and Routing for Large-Scale Peer-to-Peer Systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001) 19. Zhao, B.Y., Kubiatowicz, J.D., Joseph, A.D.: Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing. Technical report, University of California at Berkeley (2001) 20. Cai, M., Frank, M., Chen, J., Szekely, P.: MAAN: A Multi-Attribute Addressable Network for Grid Information Services. Journal of Grid Computing 2, 3–14 (2004) 21. Albrecht, J., Oppenheimer, D., Vahdat, A., Patterson, D.A.: Design and implementation trade-offs for wide-area resource discovery. ACM Trans. Internet Technol. 8(4), 1–44 (2008) 22. Apache Hadoop project, http://hadoop.apache.org 23. TPC Benchmark C, http://www.tpc.org/tpcc 24. TPC Benchmark H, http://www.tpc.org/tpch 25. Nagios project, http://www.nagios.org

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems Wei Zhou1, Jianfeng Zhan1, Dan Meng1, and Zhihong Zhang2 1

Institute of Computing Technology, Chinese Academy of Sciences 2 The Research Institution of China Mobile [email protected]

Abstract. It has been long recognized that failure events are correlated, not independent. Previous research efforts have shown the correlation analysis of system logs is helpful to resource allocation, job scheduling and proactive management. However, previous log analysis methods analyze the history logs offline. They fail to capture the dynamic change of system errors and failures. In this paper, we purpose an online log analysis approach to mine event correlations in system logs of large-scale cluster systems. Our contributions are three-fold: first, we analyze the event correlations of system logs of a 260-nodes production Hadoop cluster system, and the result shows that the correlation rules of logs change dramatically in different periods; Second, we present a online log analysis algorithm Apriori-SO; third, based on the online event correlations mining, we present an online event prediction method that can predict diversities of failure events with the great detail. The experiment result of a 260-nodes production Hadoop cluster system shows that our online log analysis algorithm can analyze the log streams to obtain event correlation rules in soft real time, and our online event prediction method can achieve higher precision rate and recall rate than the offline log analysis approach. Keywords: System logs, online log analysis, event correlations, online event prediction.

1 Introduction As the scale of cluster systems grows in the area of scientific computing and commercial applications, failures [7] become normal, and their root causes are diversely derived from software, hardware, maintenance (typically by the vendor), operations (management of the system), environment (power, facilities, command lines), and the infrastructure that supports software distribution, and project management [1]. Collected by systems, applications, and tools, logs that record important failure events are the first place system administrators go for troubleshooting when they are alerted to a problem. The examples of logs include /dev/error in Linux systems, data collected by tools like OpenView, IBM Tivoli, Microsoft Operations Manager and NetLogger [2]. It has been long recognized that failure events are correlated, not independent. For example, the work published in the year of 1992 [19] has concluded the impact of C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 262–276, 2010. © IFIP International Federation for Information Processing 2010

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

263

correlated failures on dependability is significant. Previous research efforts [3] [4] [5] [6] have shown the correlation analysis of system logs is promising in event prediction and fault diagnosis, and thus it is helpful to resource allocation, job scheduling and system management [3] [5] [9]. Recent work shows the analysis of logs is useful in mining dependency of distributed system [16][17] or model IT service availability [18]. Most of previous log analysis methods are offline. The offline log analysis methods collect log streams of a long period, for example one or even three months, offline preprocess logs for mining event correlation or filter events, and then use the analysis result to predict failures or diagnose faults [20]. The offline log analysis methods have three drawbacks: first, it is difficult to provide online service, for example event prediction, to other runtime service such as a job scheduling system. Second, the previous work of Adam Oliner et al [20] shows that over the course of a system’s lifetime, anything from software upgrades to minor configuration changes can dramatically alter the meaning or character of the logs, while offline log analysis methods are weak in capturing the dynamic of failures. Third, offline tools do not provide the ability to automatically react to problems [21], however, system administrators or autonomic management systems need to deal with the outages in time. In this paper, we focus on the online log analysis and event prediction, based on the event correlations. When we refer to an online log analysis method, we indicate three-fold meanings: first, our method can analyze incoming system logs (log stream) in a soft real time; second, almost immediately after an event of cluster system occurs, our method will mine event correlations in almost real time; third, other systems will use mining results for different purposes in time. In this paper, we will show how to use mining results for event prediction. We also plan to use these results for fault diagnosis or other promising purposes. The online log analysis of large-scale cluster systems raises several challenges [15]. First, because the meaning or character of the logs changes over the course of the lifetime of a cluster system [20], the analysis algorithm should be suitable for capturing the dynamic nature of logs or failures. Second, the analysis results should be almost real time, accurate and complete for diversities of online systems, for example event prediction, fault diagnoses, job scheduling or checkpoint system etc, and hence other systems can use the online analysis results. As shown in Fig.1, we treat a continuous time flow as several overlapping time frames; our system generates events rules that capture the event correlations of logs in different time frames and then updates the event rule database that collects event rules mined in the whole log history. For example, as shown in Fig.1, at the end of the Ith time frame, our system will invoke the job of mining event correlations in the logs of the Ith time frame, generate new event rules that capture the event correlations of logs in the Ith time frame, and then update the event rule database. In our system, we let the Ith time frame overlaps the (I+1)th time frame. This choice is because that we need to mine the correlations of events in adjacent time frames, and if two adjacent time frames are disjoint, some event correlations will be ignored.

264

W. Zhou et al. Invoke the job of mining event correlations in the logs of Ith time frame

Invoke the job of mining event correlations in the logs of (I+1)th time frame

updating event rule database Ith time frame

updating event rule database

Time flow (I+1)th time frame

Analysis time for generating new event rules

Analysis time for generating new event rules

Fig. 1. Basic idea of our online log analysis system

Our contributions are three-fold. First, we analyze the correlations of logs of a 260nodes production Hadoop cluster system, and the analysis result shows that the correlation rules of logs change dramatically in different periods; Second, we design and implement the first online log analysis algorithm Apriori-SO that supports the online log filtering and event correlations mining; third, based on the online event correlations mining, we present an online event prediction method that can predict diversities of failure events with the great detail. We use an emulation methodology to analyze and predict the event logs of a 260-nodes production Hadoop cluster systems in the Research Institution of China Mobile. The experiment result shows that our online log analysis system can analyze the log streams to obtain event correlation rule in soft real time and our online event prediction system can achieve more precision rate and recall rate than our previous offline log analysis system. The paper is organized as follows: In Section 2, we describe the related work. Section 3 justifies our motivation of the online log analysis. Section 4 presents the design and implementation of our online event correlation mining and online event prediction systems. The evaluation of the system is summarized in Section 5. We draw a conclusion and discuss the future work in Section 6.

2 Related Work Most of traditional log analysis methods are offline. Some work uses statistical analysis approach to find simple temporal and spatial laws or models of system events [6] [5] [10] [11] in large-scale cluster systems like BlueGene/L. When the obtained knowledge is used in event prediction, it may bring high precision rate and recall rate, however it is compared with the filtered logs obtained with the aggressive filtering policy. For example, in the work of [10], 99.96% of original logs are filtered. With

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

265

the aggressive filtering policy, the important failure patterns [22] or warning messages [10] , which are often the symptom of fatal errors, may be ignored. Besides, the predicted events are coarse without the detail. The work of [3] applied time-series algorithms, rule-based classification techniques, and Bayesian network model to assess the effectiveness of these techniques in predicting failure in a cluster. However it either focuses on specific types of failures or targets small scale systems, thus not sufficient for large-scale clusters [6]. Rule-based algorithms are used in some papers [3] [6] [8] [12]. The work of [6] presents a meta-learning method based on statistical analysis and standard association rule algorithm. The rule-based algorithms only consider the correlations between two event types. If we consider the correlations across multiple event types, the precision of event prediction will improve. Besides, they did not consider the dynamic change of failure correlations over time.

3 Motivation: Why Online Log Analysis Is Necessary? To analyze the log of large-scale cluster systems, in our previous work, we have developed an offline log analysis system named LogMaster to mine the event correlations of logs. In this section, we present some offline log analysis results to justify our motivation of online log analysis. 3.1 The Description of the Hadoop System We used LogMaster to analyze and predict the logs of a production Hadoop cluster system in the Research Institution of China Mobile, which is the largest telecom operator in the world. The production cluster system is used to run a series of dataintensive applications based on Hadoop [15]. The system has 260 nodes, including 10 servers and 250 data nodes. Data nodes are used to run Hadoop applications, while servers are used to analyze logs or manage system. Inspired by Google's MapReduce and Google File System (GFS), Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. 3.2 The Introduction of Offline Log Analysis The detail of the offline log analysis method can be found in our previous work [13]. To save space, in this section, we only give the short description of concepts and methods.

266

W. Zhou et al.

Event preprocessing Because of the different sources of event logs, the logs have different formats: text files, databases, or special file formats generated by programs. The log preprocessing step parses the variant log files into a nine-tuples (timestamp, log id, node id, event id, severity, event type, application name, process id, user). The severity degrees include INFO, WARNING, ERROR, FAILURE, FAULT, and the event types include HARDWARE, SYSTEM, APPLICATION, FILESYSTEM, NETWORK, etc. The attributions of timestamp, node id, application name, process id and user are easily obtained. The event id and log id are respectively the mapping functions of 2-tuples (severity, type) and 4-tuple (node id, event id, application, process id). For an upcoming event, the event id is generated according to 2-tuples (severity, type). If a new 2tuples (severity, type) is reported, a new event id will be assigned to this event. If a new 4-tuple (node id, event id, application, process id) is reported, a new log id will be assigned to this event. Event correlations mining For an event pair (A, B), of which event A occurs before event B within a predefined sliding time windows of the log buffer (in short sliding time window), we call event A is event B’s preceding event, and event B is event A’s posterior event. The support count is the recurring times of the preceding event which is followed by the posterior event, while the posterior count is the recurring times of the posterior event which follows the preceding event. For example, for an event sequence ACBBA, the support count of (A, B) is one and the posterior count is two. We define the confidence of (A, B) as follows: Confidence (A, B) = support count (A, B) / count (A). If an event pair occurs within the predefined time window and the support count and confidence of the event pair exceeds the predefined support count threshold and confidence threshold, we call it an event rule. The time relation of offline log analysis is shown in Fig.2. The Apriori-S algorithm scans the whole log history to get the 2-items event rules using statistical analysis, and generates candidate k-items rules (k>2) based on (k-1)-items rules, then scans the whole log history to validate the candidate k-items rules and get k-items rules.

Fig. 2. Time relation in the offline failure analysis approach

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

267

We give an example of 3-item event rules (911->913->985) here. 911, 913 and 985 are three different log id, their details are as follows: …… Nov 28 13:42:13 compute-5-4.local rpc.statd[2579]: Caught signal 15, unregistering and exiting. Nov 28 13:44:17 compute-5-4.local sshd[1653]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use. Nov 28 13:45:33 compute-5-12.local sshd[1655]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use. Event prediction Based on the event rules, a failure predictor can help determine possible occurrences of important events in the near future. As shown in Fig.2, the system begins predicting failures at the timing of the predicting point. The occurring time of predicted failure is called the predicted point. The prediction time is the time difference between the predicting point and the predicted point. The prediction time is the time span left for the autonomic system or system administrator to respond with possible upcoming failures. When the system predicts failures at the predicting point, the events having occurred at the prediction time window will be used to predict events. When using our system to predict failures, the system administrator can predefine the prediction valid duration that is the time difference between predicting point and the expiration point. If the predicted event occurs within the prediction valid duration, we consider it valid, else we consider it invalid. Our system can predict failure events with great detail. For example, a predicted event includes the following information: predicted point, log id, node id, application name, event type, and event severity. 3.3 Experiment Results Our work analyzes three month’s logs of a 260-nodes production Hadoop cluster system in the Research Institution of China Mobile. The logs are collected between Oct 26, 2008 and Dec 31, 2008, which includes 977,858 original event entries. We use a server to analyze the logs. The server has two Intel Xeon Quad-Core E5405 2.0GHZ processors, 137GB disk, and 8G memory. We divide three months’ logs into three disjoint periods: (1) Period one: from Oct 26, 2008 to Nov 16, 2008; (2) Period two: from Nov 17, 2008 to Dec 08, 2008; (3) Period three: from Dec 09, 2008 to Dec 31, 2008. We independently mine event rules in different periods, and then we compare event rules mined in different periods. The comparison of event rules is shown in Table 1. From Table 1, we can observe that the event rules dramatically change in different periods. For example, there are only 12 same events rules occurring in both Period one and Period two. This has several possible causes: (1) the repair of the failure or the self-healing of systems or applications cause some failure events disappear; (2) changes of events in the system or applications; (3) new coming failure events.

268

W. Zhou et al.

Table 1. The comparison of event rules in different periods (log buffer =60 minutes, support count threshold =5, confidence threshold =0.5)

number of same event rules period 1 period 2 period 3

period 1 133

period 2 12 51

period 3 5 6 48

Finally, we use the event rules obtained from Period I to predict events in Period J. The experiment result is shown in Fig.3. From Fig.3, we can see that if we use event rules in a period to predict events in another period, the precision rate is low. However, if we use event rules in a period to predict events in the same period, the precision rate is higher. ) 100 % ( n 90 o i 80 t c i 70 d e 60 r p 50 f o 40 e t 30 a r 20 n 10 o i s 0 i c e r P

Predicted events in Period 1 Predicted events in Period 2 Predicted events in Period 3

Rules of Period 1

Rules of Period 2

Rules of Period 3

Fig. 3. The precision rate of event prediction using vent rules in different period. (Log buffer =60 minutes, prediction time window = 60 minutes, support count threshold =5, confidence threshold =0.5). The precision rate of event prediction is defined in Section 5.1.

Our observation is complementary to the previous work of Adam Oliner et al [20]. The previous work of Adam Oliner et al show that over the course of a system’s lifetime, anything from software upgrades to minor configuration changes can dramatically alter the meaning or character of the logs. Both observations justify our motivation of online log analysis.

4 Online Log Analysis and Event Prediction In this section, we introduce the online log analysis algorithm and the online event prediction method. In this section, we use the same concepts introduced in Section 3.2.

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

269

4.1 Online Log Analysis Different from Apriori-S algorithm described in our previous work [13], we use a sliding time window model as shown in Fig.4 so as to analyze the event log streams. The logs in the sliding time window are saved into the log buffer. When all the logs in the time windows are read into the log buffer, the online log analysis system will analyze the logs in the two adjacent log buffers to generate new event rules and then update the event rule base.

Fig. 4. Time relation in our online failure analysis approach

4.2 Apriori-SO Event Correlations Mining Algorithm For event logs in multiple nodes of cluster systems, we take the following facts into accounts: (1) the replicated applications in multiple nodes may have the same errors or software bugs, and failure events may appear in multiple nodes; (2) nodes in largescale computing systems transfer data and communicate with each other, so a failure on one node may cause related failures on other nodes; (3) a failure on one node may change the environment of the system, which may cause other failures on other nodes. So as to analyze the correlation of failure events among multiple nodes, we use a log filtering policy that only analyzes events occurring in the same node or having the same event types or of the same applications. It can effectively reduce the size of the analyzed logs and decrease the analysis time. On the basis of the Apriori associate rule mining algorithms [13], we propose an improved algorithm, named Apriori-SO, to get the frequent itemsets with the support count above the user-defined threshold value. The Apriori-SO algorithm is a one-pass algorithm, which is described as below: (1) Proper thresholds of support Sth and confidence Cth are predefined; the proper sliding time window Tb is defined too. (2) Suppose B is the current log buffer, and Bp is the preceding log buffer. Suppose C (k) means the set of frequent k-items event set candidates, F (k) means the set of frequent k-item event set, R (k) means the set of k-item event rules. Set R (2) = {};

270

W. Zhou et al.

(3) If a new event e comes, add e into the log buffer B. If all events in the current time window is read into the current log buffer, goto step (4); else loop step (3). (4) Scan the log buffer B and Bp, count the number of each event, and support count and posterior count of each event pair (i , j) while i and j are both in B,or i in Bp and j in B. (5) Calculate cumulative number of each event, and support and confidence of each event pair (i, j). Update the support and posterior of event rules in R (2), and add new event rules if the support and confidence of an event pair are above thresholds. (6) Get new frequent k-items (k>=3) event set candidates. If two adjacent subsets of a k-items event set are in F (k-1), add the k-items event set into C (k). For example, if (A,B) and (B,C) are frequent 2-items event set, then the 3-items event set (A,B,C) are frequent 3-items event set candidates and add it into C(3). (7) Scan the B and Bp to get the support count and posterior count of event pairs in C (k). We regard the log buffer B and Bp as the sampling of the log history, and calculate the approximate value of support count and posterior count of each event pair in C (k); (8) Add the k-items candidates in C (k) having support count above the threshold to F (k), and add the k-items in F (k) having confidence above the threshold to R (k); The 2-items event rules are generated in step (2) and step (3). In the step (2) and step (3), we only consider the event set that have the same node name or event type or of the same application name occurring in the time window Tb which is defined in step (1). This event filtering policy can reduce the amount of event logs effectively. Based on the 2-items rules, the k-items rules are generated in step (4). Different from the Apriori-S algorithm which get the support of the candidate rules by scanning the whole log history (Shown in Fig. 2), we get the approximation value of support count of k-items rules based on the support count of (k-1)-items rules. 4.3 Online Event Prediction The online event prediction is shown in Fig.5. The concepts of predicting point, predicted point, prediction valid duration, prediction time, and prediction time window are same like that of offline log analysis, which are explained in Section 3.2.

Fig. 5. Time relation of our online event prediction approach

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

271

As shown in Fig.5, the difference of the online event prediction method from our previous offline one is that at the end of each time frame, we will generate new event rules and update the event rule database. Finally, we use the updated event rules to predict upcoming events. The implementation of our online event prediction system is same like that in our previous work [13]. We find all the correlated events of coming log in the prediction time window. The current state of system is determined according to these events, and the upcoming correlated events are predicted based on the event rules.

5 Experiments The target Hadoop system and its system logs in our experiment are described in Section 3.2. In our experiment, we preprocess and analyze the logs between Oct 26, 2008 and Dec 31, 2008. The event rules generated according to the event logs are used to online predict the event logs in Jan, 2009. 5.1 Metrics We use the analysis time, the memory usage of the Log server node to evaluate the overhead of our online log analysis system, and compare the precision rate, the recall rate, and the average prediction time to evaluate our online event prediction system. (1) Average analysis time The compute complexity means the time and space utilization of algorithm, including time complexity and space complexity. In this experiment, we use the average analysis time and the average analysis time of time frames to evaluate the compute complexity. As shown in Fig.5, the analysis time is the time difference between the beginning and ending timing points of event preprocessing, filtering and correlations mining. Average analysis time = the total analysis time / count of event logs Average analysis time of time frames = the total analysis time / count of time frames (2) Precision rate and Recall rate The precision rate means the ratio of the correctly predicted events to all predicted events. The recall rate means the ratio of correctly predicted events to all forthcoming events. True Positive (TP) = the count of events which are correctly predicted False Positive (FP) = the count of events which are predicted but not appeared in the time window Precision rate = TP / (TP + FP) Recall rate = True Positive/ count of all events (3) Average prediction time The prediction time is defined in Section 3.2. The prediction time is the time span left for the autonomic system or system administrator to respond with the possible upcoming failures.

272

W. Zhou et al.

5.2 The Experimental Methodology Our online event correlation system includes two major components: Log agents and Log server. Log agents in each data node collect logs and transfer logs to Log server in almost real time. After Log server receives all the logs of a time frame, it will preprocess and filter the log stream, and mine the new event rules. In our experiment, we use an emulation methodology. Instead of log agents on each data node sending logs to the log server, we use a program on another node to replay the logs of the 260-nodes Hadoop cluster systems, and send the logs to Log server in real time according to the timestamp of each event in logs. 5.3 Online Log Analysis After log preprocessing and filtering, the event logs are used to analyze the failure correlations. When the 38432 logs between Oct 26, 2008 and Dec 31, 2008 are analyzed by our Apriori-SO algorithm, the support threshold and the confidence threshold are respectively set to 0.5 and 0.25 through comparing several runs of experiments with different configurations. The average analysis time and the analysis time of time frames are shown in Fig.6 with the varying time windows of the log buffer. We can see from Fig.6 that the average analysis time of time frames increases with the sliding time window of the log buffer, and the average analysis time decreases.

s i s y l a n a e g a r e v a

50 45 40 35 ) s 30 m ( 25 e m i t 20 15 10 5 0

average analysis time

10

20

average analysis time of time frame

30 45 60 90 Log buffer(min)

900 800 700 600 500 400 300 200 100 0

e mi ) t sm si (s sy em la ar n f a e eg mi ar t e fo va

120 180

Fig. 6. Average analysis time and average analysis time of time frames V.S. time window of Log buffer Tb (Sth=5, Cth=0.25)

The number of event rules is shown in Fig.7 with the varying time window of the log buffer. We can observe in Fig.7 that the number of event rules also increases with the size of log buffer. That is to say, the integrity of associate event rules is related to the increase of the set of log buffer.

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

se 600 lu r 500 tn 400 ev e 300 fo 200 re 100 bm uN 0

10

20

30

45

60

90

120

273

180

Log Buffer(min)

Fig. 7. Number of event rules V.S. Log buffer Tb (Sth=5, Cth=0.25)

5.4 Online Event Prediction Together with the online log analysis, online event prediction is used to predict upcoming events based on the event rules. Based on event rules generated with the Apriori-SO algorithm, the logs are used to online predict events. In our experiment, we online analyze the event logs between Oct, 2008 and Dec, 2008, and then use these event rules to online predict the event logs in Jan, 2009. When the support threshold and confidence threshold are respectively set to 5 and 0.25, and the prediction valid duration of online prediction is set as 60 minutes, the result is shown in Fig.8.

) % ( e t a r n o i s i c e r P

100 90 80 70 60 50 40 30 20 10 0

Precision rate(%)

Recall rate(%) 10

20

30

50 45 40 35 30 25 20 15 10 5 0

) % ( e t a r l l a c e R

45 60 90 120 180 240 Log Buffer(min)

Fig. 8. Precision rate and Recall rate of online prediction V.S. Log Buffer (Sth=5, Cth=0.25, Prediction valid duration=60 minutes)

When the log buffer of online log analysis is set as 60 minutes, the support threshold and confidence threshold are respectively set to 5 and 0.25. The result shows that the number of event rules increases from 355 to 386.

274

W. Zhou et al.

In Fig.9 and Fig.10, we compare our online prediction approach with our previous prediction approach based on the offline log analysis. Fig.9 presents the relationship between the precision rate and the prediction valid duration for both online and offline event prediction. Fig.10 presents the relationship between the recall rate and the prediction valid duration for both online and offline event prediction. 70 )60 % ( e50 t a r40 n o30 i s i c20 e r P10

Online Offline

0 10

20 30 40 50 60 Prediction valid duration(min)

Fig. 9. Precision rate of online and offline prediction V.S. Prediction valid duration (log buffer=60 minutes, Sth=5, Cth=0.25)

30 25 ) % (20 e t a r15 l l a10 c e R 5

Online Offline

0 10

20 30 40 50 60 Prediction valid duration(min)

Fig. 10. Recall rate of online and offline prediction V.S. Prediction valid duration (log buffer=60 minutes, Sth=5, Cth=0.25)

It can be seen from Fig.9 and Fig.10 that the precision rate and recall rate of online prediction are higher than offline prediction. It demonstrates that the online log analysis can effectively represent the dynamic change of event rules. The relationship between the average prediction time and prediction valid duration is shown in Fig.11. The difference between the average prediction time of online prediction and that of offline prediction are small.

Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

)1000 s ( e 900 m i 800 t n 700 o i 600 t c 500 i d 400 e r p 300 e 200 g a r 100 e v 0 a

275

Online Offline

10 20 30 40 50 60 Prediction valid duration(min)

Fig. 11. Average prediction time of online and offline prediction V.S. Prediction valid duration (log buffer=60 minutes, Sth=5, Cth=0.25)

6 Conclusion In this paper, we present an online log analysis approach to analyze event logs of large-scale cluster systems. After online preprocessing and filtering, filtered logs are used to mine failure correlations and generate associate event rules online. We propose an improved Apriori associate rules algorithm named Apriori-SO to analyze the event logs. The Apriori-SO algorithm uses the sliding time window model and stores the log streams to log buffer, and analyze the log buffer to generate new rules and update the existed rules. Based on the observation that most of events rules occur in the same nodes or applications or have the same types, the Apriori-SO algorithm use an event filtering policy to reduce the computing complexity. The event rules generated by Apriori-SO algorithm can be used in online event prediction. The experiments on a production cluster system in the Research Institution of China Mobile show that our algorithms can achieve good precision rate in event prediction. In the near future, we will integrate path-based request tracing [22] and event correlation mining approaches to diagnosis the failure events and performance problems of Internet services applications. Much work will be done to analyze multidimensional event logs in large-scale cluster systems [23]. Acknowledgments. This paper is supported by the NSFC projects (Grant No. 60703020 and Grant No. 60933003).

References 1. Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S.: Failure data analysis of a large-scale heterogeneous server environment. In: Proc. of DSN 2004 (2004) 2. Tierney, B., Johnston, W.: The NetLogger methodology for high performance distributed systems performance analysis. In: Proc. of HPDC (1998)

276

W. Zhou et al.

3. Sahoo, R.K., Oliner, A.J.: Critical Event Prediction for Proactive Management in Large scale Computer Clusters. In: Proc. of SIGKDD (2003) 4. Fu, S., Xu, C.: Exploring Event Correlation for Event prediction in Coalitions of Clusters. In: Proc. of ICS (2007) 5. Fu, S., Xu, C.: Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management. In: Proc. of SRDS (2007) 6. Gujrati, P., Li, Y., Lan, Z.: A Meta-Learning Failure Predictor for Blue Gene/L Systems. In: Proc. of ICPP (2007) 7. Knight, J.C.: An Introduction To Computing System Dependability. In: Proc. of ICSE (2004) 8. Tang, D., Iyer, R.K.: Analysis and Modeling of Correlated Failures in Multicomputer Systems. IEEE Trans. on Comput. 41(5), 567–577 (1992) 9. Koskinen, E., Jannotti, J.: BorderPatrol: Isolating Events for Precise Black-box Tracing. In: Proc. of Eurosys (2008) 10. Liang, Y., Zhang, Y.: BlueGene/L Failure Analysis and Prediction Models. In: Proc. of DSN (2006) 11. Hacker, T.J., Romero, F., Carothers, C.D.: An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing 69(7), 652–665 (2009) 12. Oliner, A.J., Aiken, A., Stearley, J.: Alert Detection in Logs. In: Proc. of ICDM (2008) 13. Zhou, W., Zhan, J., Meng, D., Xu, D., Zhang, Z.: LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems. In: CoRR abs/1003.0951 (2010) 14. Jiang, N., Gruenwald, L.: Research Issues in Data Stream Association Rule Mining. ACM SIGMOD Record 35(1) (March 2006) 15. http://en.wikipedia.org/wiki/Hadoop 16. Salfner, F., Tschirpke, S.: Error Log Processing for Accurate Event prediction. In: USENIX Workshop on The Analysis of System Logs, WASL (2008) 17. Lou, J.G., Fu, Q., Wang, Y., Li, J.: Mining Dependency in Distributed Systems through Unstructured Logs Analysis. In: USENIX Workshop on WASL (2009) 18. Zhang, R., Cope, E., Heusler, L., Cheng, F.: A Bayesian Network Approach to Modeling IT Service Availability using System Logs. In: USENIX Workshop on WASL 2009 (2009) 19. Tang, D., Iyer, R.K.: Analysis and Modeling of Correlated Failures in Multicomputer Systems. IEEE Trans. on Comput. 41(5), 567–577 (1992) 20. Oliner, A., Stearley, J.: What Supercomputers Say: A Study of Five System Logs. In: Proc. of DSN (2005) 21. Rouillard, J.P.: Real-time log file analysis using the Simple Event Correlator (SEC). In: Proc. of LISA (2004) 22. Zhang, Z., Zhan, J.: Precise request tracing and performance debugging of multi-tier services of black boxes. In: Proc. of DSN 2009 (2009) 23. Zhou, W., Zhan, J.: Multidimensional Analysis of System Logs in Large-scale Cluster Systems. In: Proc. of DSN 2008, Fast Abstract (2008)

Differentiated Replication Strategy in Data Centers Tung Nguyen, Anthony Cutway, and Weisong Shi Wayne State University {nttung,acutway,weisong}@wayne.edu

Abstract. Cloud computing has attracted a great deal of attention in both academia and industry. We envision the provisioning of differentiated services as being one of the key components to the success of cloud computing. Unfortunately, this issue has not been fully addressed in the past. Realizing that different users might have different requirements regarding availability, reliability, durability, response time and so on, we conjecture that providing flexible replication mechanism is the right approach to service differentiation. In this paper, we propose Differentiated Replication (DiR), which allows users to choose different replication strategies by considering both the user requirements and system capability. We implemented a system that offers four differentiated storage services with DiR. The experimental results show that this service actually provides different availabilities and execution times for different service types with the same request traces, failure traces, and workload. In addition, we also show that in comparison to the regular uniform replication, DiR can further improve resource utilization, which will in turn provide a better user experience with regards to cloud computing.

1 Introduction Recently, cloud computing has been a hot research topic. In fact, it may shape the future of the computer industry [1–3]. With cloud computing, companies can reduce their overheads regarding the buying, installing, and maintaining computer resources. With cloud computing, they can register necessary services from the Internet, allowing them to focus on the core aspects of their business. It is clear that many users of cloud computing services have different requirements for the service. Some require optimal performance, while others seek data redundancy and reliability. In general, users may demand different properties from services, such as availability, reliability, durability, and performance. One, several, or all of these properties may be under consideration for any given request. This possibility demands differential services. Most available cloud services today do not take this fact into account. Often times, they only provide one type of service for all of their users. For example, Windows Azure and Amazon S3 maintain a fix number of replicas (3) for data stored in it. Both Microsoft Windows Azure and Amazon S3 guarantee in their service level agreement (SLA) that the availability of customer’s data is always greater than 99.9%. This may lead to poor resource utilization from the provider or an inefficient usage from the user. We also find that all of these properties relate to replication. Differentiated replication strategies can provide different availabilities, reliability, durabilities, and performances. Therefore, we propose a new strategy called Differentiated Replication (DiR) to address this problem. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 277–288, 2010. c IFIP International Federation for Information Processing 2010

278

T. Nguyen, A. Cutway, and W. Shi

To demonstrate the concept of DiR, we built a prototype system providing storage services capable of providing data at different rates of availabilit. Users are provided a simple interface that allows them to store and later fetch their data with their expectation. The current version of DiR provides four replication types. This paper comes in three core parts . First, we propose the idea of differentiated services for data centers and design a set of simple but powerful APIs for high level users. Second, we propose four different replication strategies on the server side, enabling differentiated services in terms of data availability. Third, we implement a prototype of DiR and evaluate the four proposed replication strategies in a comprehensive manner in terms of availability using both synthetic and real failures traces. The evaluation results show that the last replication strategy, which takes both user requirements and system behavior into consideration, is capable of providing different availabilities and execution times. The rest of the paper is organized as follows. Section 2 exhibits the design of DiR, Section 3 describes the implementation, and Section 4 presents the experiment and results. The related work and conclusion are covered in Sections 5 and 6 respectively.

2 System Design 2.1 Assumptions and Requirements The goal of DiR is to build a read/write only (not modified) storage system that provides different types of replication services to the user with better resource utilization. We target read/write only storage system for the sake of simplicity in term of data consistency. However, this is also practical because the data in the Cloud is often very huge and should not be modified. This assumption is often made in the area of data intensive computing like HDFS of Hadoop [4]. We mainly focus on differences in: (1) replication strategies; (2) search algorithms; (3) network topology; and (4) availability. Hence, in our case better service may mean higher availability or faster or less communication cost. The idea for a user to request other properties such as durability, reliability, and performance is almost similar and will be introduced later. The system is heterogeneous with different hardware, software, computational ability, etc. Each member can join or leave the system or fail at any time. Given the requirement of read/write only service, the system does not need to maintain consistency between replicas. Therefore, the only two methods we need to provide are store and fetch. In addition, the failure we consider is of a fail-stop type rather than the Byzantine failure type. This means that when the machines are alive, they are supposed to have correct behavior. Finally, reasonable load balance, fault-tolerance, scalability and reliability are also important requirements of the system. 2.2 APIs Users of our system are not terminal application users but developers of front-end applications. They are not supposed to know replication techniques in detail. The interface component is required to be simple enough so that it can easily be used in applications.

Differentiated Replication Strategy in Data Centers

279

Table 1. The DiR APIs Function name fetch(filename) fetch(filename,service-type) store(filename) store(filename, service-type)

Description retrieve a file from the DiR system retrieve a file from the DiR system with specific service type insert a file from local file system to DiR insert a file from local file system to DiR with specified service type

Therefore, we only need to add one more parameter, called service-type, to the current APIs of Chord/DHash [5] to indicates which type of service the user requires. These methods merely call new methods with a default service-type, as illustrated in Table 1. The user will notice that the service-type of a fetch needs to match that of the store for a certain file. 2.3 Availability Analysis Basically, there are two ways to provide different levels of availability: change the number of replicas, or change the location of them. Higher availability of an object can be achieved by increasing the number of its replicas or by placing its replicas onto more “available” machines. Intuitively, more replicas on more reliable nodes will produce higher availability. Even so, we cannot tell which method provides better availability in some situations. The system has to decide the availability under the resource constraints to guarantee the load balance. If we do not handle this correctly, we may introduce extra overhead to highly available nodes. The problem is formalized as follows. Given an expected availability A of a certain object/file, and a set of nodes with their own availability, we need to find the number of replicas, and the specific nodes in which to store them. It is noteworthy that the availability of an object stored on a machine is equal to the availability of that machine, also under the fail-stop assumption. Let M be a set of nodeIDs and corresponding availabilities of N nodes. M = {(ni , ai )|ni is the ID of node i and 1 ≤ i ≤ N } Let

⎧ ⎨

⎫ (nxl , axl ) ∈ M, ⎬ σ = {(nxl , axl )}l=1,k nxi = nxj , 1 ≤ i, j ≤ k, ⎩ ⎭ 1 ≤ xi , xj , k ≤ N

The solution of the problem is in σ set. Assuming {(ny1 , ay1 ), (ny2 , ay2 ), . . . , (nyl , ayl )} is one specific solution, the following approximation should be satisfied A ≈ 1 − (1 − ay1 )(1 − ay2 ) . . . (1 − ayl ) = f (ay1 , ay2 , · · · , ayl ).

(1)

280

T. Nguyen, A. Cutway, and W. Shi

Note that in Equation (1) while A represents the user’s expectation, {ay1 , ay2 , · · · , ayl } represents the availability of the system. With a certain value of A, we may have several solutions. One way to completely solve this is: For every member mi of σ, compute A = f (mi ), if A ≈ A then mi is a solution. Unfortunately, this method is O(2N ), in which N is the number of nodes. The second method is to calculate the average of all the availabilities a=

N

ai

i=1

and use the formular in [6]: l=

log(1 − A) log(1 − a)

(2)

to derive the number of replicas l. This approach may create a resource utilization problem in the next step of choosing proper nodes to store replicas. The third method is shown in the following algorithm 1: 2: 3: 4: 5: 6: 7: 8:

Sort M in decending order of availability F ⇐ (1 − M [0].a) i⇐1 while (i < N )and(A < A) do F ⇐ F ∗ (1 − M [i].a) A ⇐ 1 − F inc(i) end while

This method may cause an overload in the high availability nodes. Finally, since this problem is of a constraint programming type, another regular way to solve it is to use an existing C(L)P solver. As a result, no matter what method we use, from the system design point of view, the system is required to have a monitor server to provide the availabilities of all nodes in the system. This leads to the need for OPERA. 2.4 OPERA OPERA stands for OPEn Reputation Architecture, which is a general framework to compute the reputation of nodes in the system. OPERA allows users to define how to calculate reputation, and it returns the reputation of nodes based on that definition. It employs a traditional master-slave model in its communication, since we need to obtain the global reputation of the system. OPERA clients communicate to each other in replying to the rate request from the server. Design and implementation of OPERA are not detailed here due to page limitation. Basically, OPERA employs Ganglia (a monitoring tool) to collect information about nodes in the system and calculates reputation for each node based on this information.

Differentiated Replication Strategy in Data Centers

281

2.5 Utilization Analysis We argued that DiR also provides better resource utilization. This is rather obvious and straightforward. Let’s define the utilization of a system =

Resreal Resneed

in which, Resreal are the resources that really used to provide the services and Resneed are the resources that can satisfy user needs. The following analysis compares the resource utilization of DiR and that of uniform replication, which is a very widely used technique today. We assume both systems have n requests r1 , r2 , · · · , rn and ci is the correspondent number of replicas

nto satisfy r

i . The total number of replicas of DiR (idealy) and uniform method is i=1 ci and n M ax{c |0 ≤ i ≤ n} respectively. Note that, in the uniform repication system, i i=1 we need to choose l large enough to satisfy the highest quality requests. For example, with n = 4, c1 = 2, c2 = 3, c3 = 2, c4 = 4, the resource utilization of ideal DiR 4×4 and uniform system is 1 and 2+3+2+4 = 1.45. As a result, this analysis proved that DiR used resources more efficiently. From another aspect, with the same resource, DiR (better utilized system) can satisfy more requests as well. 2.6 DiR System Architecture The overall system architecture is shown in Figure 1. The user uses the DiR Interface to ask for service. Depending on the request, the DiR interface decides which replication strategy to use. There are, in total, four replication strategies available (represented by four blocks in the figure) that can offer all required differences. In fact, there are many other options to choose to construct a strategy. For example, we can choose random walk search algorithm [7], CAN [8] or Pastry [9] to build a new type of service. Such openness is expressed by the lowest block with “three dots” in the figure. Finally, the rightmost circle with small squares inside represents the physical underlying network connecting the machines of the system.

DiR Interface

Regular Uniform Uniform + DHT Uniform + DHT + OPERA

Underlying Network

Non-uniform + DHT + OPERA ...

Fig. 1. The DiR system architecture

Our system can also support the differences in the durability/reliability of objects stored in it by modifying the policy being used in the monitoring server, OPERA, to define how to calculate them. The nodes that are more durable/reliable have higher reputation scores. By doing this, the OPERA server returns the durability/reliability of all nodes in the system. The remaining problem is how to calculate these values.

282

T. Nguyen, A. Cutway, and W. Shi Table 2. The summary of available service types Service type Description Sd Regular uniform replication, unstructured network Sd Uniform replication, DHT, ring-based Sd+o Uniform replication, DHT, ring-based, OPERA Sd+o+a Non-uniform replication, DHT, ring-based, OPERA

3 Implementation We implemented our system by extending Chord/DHash [5]. The architecture of an individual node (peer) derived from our previous design is shown in Figure 2. The rightmost block is the OPERA client. This block is in charge of rating other nodes and responding to the request from them as well as from the OPERA server. The highest level in Figure 2 is the DiR interface that offers the APIs to the user and chooses an appropriate handler. The two leftmost blocks (File Transfer and breadth-first search) correspond to the “Regular Uniform” handler of Figure 1. “File Transfer” is used to receive files sent from other nodes. ”Breadth First Search” is used to find a replica. The last three middle blocks (DiR Manager, DHash and Chord) correspond to the final three handlers in Figure 1. The lowest block, Chord, is a replica lookup service. The upper block that uses the Chord lookup service is DHash, a block store service. This layer is responsible for storing/retrieving blocks of data into/out of storage devices. The DiR Manager is located on top of the block store layer (DHash). It is used to provide the store/fetch file functions to the DiR interface, communicate to OPERA and calculate the appropriate number of replicas as well as their locations.

DiR Manager File Transfer

Breadth First Search

DHash

OPERA Client

DiR Interface

Chord

Fig. 2. An augmented node in DiR implementation

We have implemented a prototype of a DiR storage system and OPERA using C++ on Linux. The DiR prototype has implemented all parts of the previous design. It offers several types of differences: in the network topology (unstructured or structured), communication model (message and aRPC), search algorithm, replica location and expected availability. These differences are implied in the following four service types (corresponding to the four handlers in the Section 2.6) and are summarized in Table 2. Type Sr indicates that a user wants to use a normal uniform replication with a traditional search algorithm. Nodes communicate with each other by sending and receiving

Differentiated Replication Strategy in Data Centers

283

messages. Although this strategy is popular, we built it from scratch since we could not find any open source implementation available online. To handle this type, each node has two servers. The first one, (called searchserver), waits to receive a file request and looks for the file in the system. This server is the “Breadth First Search” block in Figure 2. To make the search process workable, we apply the TTL (Time To Live) technique to each request. The second server, (called filetransferserver), is used to receive actual files sent from searchservers. It is worth noting that since we use failure traces in the experiment, these servers have to be implemented to tolerate failure at anytime. Therefore, we also apply a timeout technique in the communication. Type Sd represents the uniform replication, using Chord as a DHT lookup algorithm. In fact, this type is the original version of Chord/DHash. Type Sd+o utilizes the same lookup algorithm as Sr but with a different replica location. With this service type, DiR first calls Chord to get the successor nodes of the hash value of the filename and contacts OPERA to ask for their availabilities. Based on this information and the predefined number of replicas, DiR chooses nodes with the highest availability to host the replicas of that file. The thing worth being noticed here is that in this prototype, these first three types have the same number of replicas. Type Sd+o+u means the user only cares about the availability, not the number of replicas. This version of DiR fixes the target availability to 3 “nines” as in Amazon S3 storage service, but this can easily be changed to the desired values. DiR calculates the necessary replicas using equation (2) with the average availability of the successor list of the file ID. In the step of choosing host machines to store the replica, we simply choose the sets randomly, recalculate the availability accordingly of each set, and choose the one that is close to the expected availability. Although this way may not provide the best solution, it is faster and helps to balance the load.

4 Experiment and Results To evaluate our system, we deployed DiR onto a cluster of 21 nodes. The first 20 nodes had DiR installed. The 21 node was dedicated to the OPERA server with a guarantee not to fail during the experiment. Syntheticfailuretrace

Realfailuretrace

Numberoffailures

30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 NodeID

Fig. 3. Number of failures

284

T. Nguyen, A. Cutway, and W. Shi

To prepare for the experiment, first, we need to have a files generator and distributor to create and distribute files randomly to 20 nodes. The total number of files in the synthetic trace is 400, and that of the DZero trace [10] is 25,951 files. Second, for request and failure traces, we used both synthetic traces and modified real traces. Figure 3 shows the detailed number of failures in both synthetic and real failure trace. The total failure duration time of each node is displayed in Figure 4. The simulation time for the synthetic traces is 50 minutes. The real failure trace is from the availability information of the first 20 nodes of Microsoft PCs trace [11] measured during 35 days since July 6, 1999; and the real request trace is from a physical application of the DZero experiment [10] on April 2004. Since it was not practical to conduct the experiment for an entire month, both real failure and request traces are scaled down to one day only. In scaling down the request trace, we encountered the problem of congestion. This is because the request trace was forced to request too many files at the same time. As a result, we modified the trace so that the time to ask for the file is slightly different if they are the same in the original trace. The availability results returning from the OPERA server are shown in Figure 5. 80000

1200

70000

1000

60000 800

50000 40000

600

30000

400

20000 200

10000 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Realfailuretrace

Syntheticfailuretrace

Fig. 4. Failure duration time in the failure trace

Syntheticfailuretrace

Realfailuretrace

1.2

Availability

1 0.8 0.6 0.4 0.2 0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 NodeID

Fig. 5. The availability of the system

In this section, we measure two main metrics: execution time and availability of different service types. Table 3 shows the availability of the system using the synthetic traces and the modified real traces. From the synthetic results, we can see significant improvement in the availability of the Sr with different configurations of the neighbor list (two or three neighbors). In addition, we also found that the availability of Sd+o is better than that

Differentiated Replication Strategy in Data Centers

285

Table 3. The availability of DiR under synthetic and real traces Service-type Sr (2) Sr (3) Sd Sd+o Sd+o+u Availability (synthetic trace) 44.62% 86.40% 61.62% 70.06% 98.13% Availability (real trace) 28.02% 29.32% 82.17% 89.79% 99.95%

of Sd . This means we can improve the availability of Chord/DHash with the aid of OPERA. The table also shows that we cannot tell which type is “better” generally. One can argue that Sr (3) (with 3 neighbors) is the best. However, this statement is true for the availability only. It is easy to see that, the Sr (3) strategy costs more resources (in term of link number and bandwidth) than the others. From the real results, the availability of Sd+o is also better than that of Sd . The availabilities of Sr are poor because the real request trace requests several hundred files at the same time; and together with the failure trace, it crashed some of our search servers and hence, produced those poor availabilities. Another experiment was about Sd+o+u alone. Figure 6 shows the availability of 10 different files of random size. The horizontal line in the figure represents the expected availability that was set to three nines by default. The result was measured in 50 minutes.

Availability

Availability(serviceͲtype3)

ExpectedAvailability

1.002 1 0.998 0.996 0.994 0.992 0.99 0.988 0.986 0.984 1

2

3

4

5

6

7

8

9

10

FileIDs

Fig. 6. Availability of files using Sd+o+u

700

StoreSd

StoreSd+o

StoreSd+o+u

600

FetchSd

FetchSd+o

FetchSd+o+u

Time(ms)

500 400 300 200 100 0 4

8

16

32

64

128

256

512

1024

Filesize(KB)

Fig. 7. DiR performance of Sd , Sd+o and Sd+o+u

286

T. Nguyen, A. Cutway, and W. Shi

To measure the performance of DiR, we created files with various sizes, inserted them into the system and then retrieved them. We only measured the response time of the successful requests. Assuming that there was no failure, we got the results shown in Figure 7.

5 Related Work Differential service is one of the key aspects of DiR, encompassing flexible availability, reliability, durability, file placement, and search methods. Many studies are related to and have led to the culmination of DiR. Beside Chord/DHash [5], perhaps the closest work to ours is the Total Recall [6]. Total Recall file system of Bhagwan et al. also provides an option to choose the availability of objects. However, their users are in fact system administrators. They aimed at relieving administrators’ burden by maintaining the degree of replicas automatically. Another paper of Zhong et al. [12], which also employed the non-uniform replication, considers the optimal number of replicas of an object for high availability to be directly proportional to the object’s popularity. However, their approach does not consider the different type of services from users. Also related to replication strategy, Cohen [13] finds the optimal replication strategy, in term of search size, lies between a uniform and a proportional strategy. Different placement algorithms were introduced in [14–17]. To increase availability, Giwon’s work in [18] is concerned with dynamically replicating objects, and Acunam’s group uses a fixed number of replicas depending on peer availability [19]. In regard to durability, Wang et al. focuses on the durability of data through replication [20], while Chun’s research improves the durability of large amounts of data using a replication algorithm [21]. Besides the breadth first search and the Chord mentioned in the previous sections, [7] presented a search algorithm using multiple random walks to improve performance over the Gnutella like flooding search method.

6 Conclusions and Future Work Diversity in the requirement of services will soon be an important feature and requirement in cloud computing. In this work, we took the first step to address this problem. Focusing on the replication technique, we proposed the concept of differentiated replication that can offer different types of services, and developed a prototype storage system focusing on the availability. Future work in this project will involve improving OPERA so it will be capable of monitoring other metrics, such as performance, durability, and so on. In addition, we will investigate the techniques to maintaining multiple service types in the dynamic environments, e.g., large-scale data centers [1], which have many different applications simultaneously.

Acknowledgments We would like to thank the anonymous reviewers for their comments and Cole Brown at Rochester for his editing changes on this manuscript.

Differentiated Replication Strategy in Data Centers

287

References 1. Bryant, R.E.: Data-intensive supercomputing: The case for disc. Technical Report CMU-CS07-128, School of Computer Science, Carnegie Mellon University (May 2007) 2. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al.: Above the clouds: A berkeley view of cloud computing. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-28 (2009) 3. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. ACM Commun. 53(4), 50–58 (2010) 4. http://wiki.apache.org/hadoop/ 5. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peerto-peer lookup service for internet applications. In: ACM SIGCOMM 2001 (2001) 6. Bhagwan, R., Tati, K., Cheng, Y.-C., Savage, S., Voelker, G.M.: Total recall: System support for automated availability management. In: Proc. of NSDI 2004, pp. 25–25. USENIX Association, Berkeley (2004) 7. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstructured peerto-peer networks. In: ICS 2002: Proceedings of the 16th International conference on Supercomputing, pp. 84–95. ACM, New York (2002) 8. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S.: A scalable content addressable network. In: Proc. of ACM SIGCOMM 2001 (2001) 9. Rowstron, A., Druschel, P.: Pastry: Scalable, distributed object location and routing for large scale peer-to-peer systems. In: IFIP/ACM Middleware 2001 (2001) 10. Doraimani, S.: Filecules: A New Granularity for Resource Management in Grids. Master thesis in Computer Science, University of South Florida (2007) 11. Bolosky, W.J., Douceur, J.R., Ely, D., Theimer, M.: Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. In: Proc. SIGMETRICS (2000) 12. Zhong, M., Shen, K., Seiferas, J.: Replication degree customization for high availability. SIGOPS Oper. Syst. Rev. 42(4), 55–68 (2008) 13. Cohen, E., Shenker, S.: Replication strategies in unstructured peer-to-peer networks. SIGCOMM Comput. Commun. Rev. 32(4), 177–190 (2002) 14. Chervenak, A.L., Schuler, R.: A data placement service for petascale applications. In: PDSW 2007: Proceedings of the 2nd International Workshop on Petascale Data Storage, pp. 63–68. ACM, New York (2007) 15. Chervenak, A.L., Schuler, R., Ripeanu, M., Amer, M.A., Bharathi, S., Foster, I., Iamnitchi, A., Kesselman, C.: The globus replica location service: Design and experience. IEEE Transactions on Parallel and Distributed Systems 99(1) (2008) 16. Douceur, J., Wattenhofer, R.: Optimizing file availability in a secure serverless distributed file system. In: Proceedings 20th IEEE Symposium on Reliable Distributed Systems, pp. 4–13 (2001) 17. Lian, Q., Chen, W., Zhang, Z.: On the impact of replica placement to the reliability of distributed brick storage systems. In: ICDCS 2005: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, Washington, DC, USA, pp. 187–196. IEEE Computer Society, Los Alamitos (2005) 18. On, G., Schmitt, J., Steinmetz, R.: The effectiveness of realistic replication strategies on quality of availability for peer-to-peer systems. In: P2P 2003: Proceedings of the 3rd International Conference on Peer-to-Peer Computing, Washington, DC, USA, p. 57. IEEE Computer Society, Los Alamitos (2003)

288

T. Nguyen, A. Cutway, and W. Shi

19. FM, C.A., Martin, R., Nguyen, T.: Autonomous replication for high availability in unstructured p2p systems. In: Proceedings 22nd International Symposium on Reliable Distributed Systems, pp. 99–108 (October 2003) 20. Wang, A.I.A., Reiher, P., Kuenning, G.: Introducing permuted states for analyzing conflict rates in optimistic replication. In: SIGMETRICS 2005: Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 376–377. ACM, New York (2005) 21. Chun, B.-G., Dabek, F., Haeberlen, A., Sit, E., Weatherspoon, H., Kaashoek, M.F., Kubiatowicz, J., Morris, R.: Efficient replica maintenance for distributed storage systems. In: NSDI 2006: Proceedings of the 3rd conference on Networked Systems Design & Implementation, p. 4. USENIX Association, Berkeley (2006)

Eﬃcient Pipelining Parallel Methods for Image Compositing in Sort-Last Rendering Wei Fang1,2 , Guangzhong Sun1 , Peng Zheng2 , Tiening He2 , and Guoliang Chen1 1

2

School of Computer Science and Technology, University of Science and Technology of China, Hefei, China [email protected], {gzsun,glchen}@ustc.edu.cn Institute of Computer Application, China Academy of Engineering Physics, Mianyang, China {eliza zheng,htn2005}@126.com

Abstract. It is well known that image compositing is the bottleneck in Sort-Last rendering. Many methods have been developed to reduce the compositing time. In this paper, we present a series of pipeline methods for image compositing. Our new pipeline methods based on Direct Send and Binary Swap. However, unlike these methods, our methods overlap the rendering time of diﬀerent frames to achieve high f ps(frames per second) in ﬁnal display. We analyze the theoretical performance of our methods and take intensive experiments using real data. The results show that our new methods are able to achieve interactive frame rates and scale well with both the size of nodes and screen resolution. Keywords: Pipelining, parallel methods, image compositing.

1

Introduction

Polygon rendering and volume rendering are important in scientiﬁc visualization. As the size of data produced by large-scale computation and simulation grow larger and larger, a natural solution for eﬀective visualization of very large dataset is utilizing a supercomputer or a PC-cluster to parallelism rendering work. Molnar et al. [1] described a framework for parallel rendering. They proposed three types of parallel rendering algorithms: sort ﬁrst, sort-middle and sort-last, depending on where the sort and redistribution of data occurs. Sort-last algorithm is widely used in parallel rendering because of its scalability and simple task decomposition for achieving load balance. However, the image compositing stage in sort-last could be very expensive because of large amount of messages exchanged. Therefore, image compositing could become a bottleneck that aﬀects the eﬃciency of the sort-last parallel rendering. A lot of methods have been developed to composite image for sort-last rendering. Direct Send [2], Binary Tree [3], Binary Swap [4], and parallel pipeline [5] [6] are already been proved to be eﬃcient ways for image compositing. But all the methods mentioned above have some disadvantages in communication. In this C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 289–298, 2010. c IFIP International Federation for Information Processing 2010

290

W. Fang et al.

paper, we construct a tree-based pipeline system, using some additional nodes named transfer node to transfer/composite pixels, and implement Direct Send, Binary Swap, and Hybrid Partition methods on this pipeline system. Intensive experiments have been done to test these algorithms and we found that our new methods can greatly reduce the number of pixels communicated among processors and increase the frame rates of ﬁnal display, especially in low bandwidth, high latency networking systems. This paper is organized as follows: related works are introduced in section 2. We give some theoretical analysis for Direct Send and Binary Swap methods in section 3. Our new pipeline algorithms are described in detail in section 4. Section 5 is dedicated to implementation details and experimental results. Finally, concluding remarks are discussed in section 6.

2

Related Works

Image compositing is the key stage of sort-last rendering method. Most performance lost happens in this stage. So a number of parallel image compositing method have been implemented, both in software and hardware. Software-based methods for image compositing have been proposed in the literature, and can be applied either to polygon or volume rendering. The simplest way is sending pixels directly to the node who are responsible for blending them. This is called Direct Send [2]. The disadvantage of this method is that many nodes would send pixels to the same node at the same time, which would cause link contention and hurt the performance greatly. Ma [4] proposed Binary Swap method, which in most cases is better than Direct Send. Many work have been done to improve Direct Send and Binary Swap method, such as scheduled linear image compositing [8], multiple bounding rectangle [5], run-length encoding [9], interleaved splitting [9], 2-3 swap [10] and so on. Garcia et.al [7] devised a hybrid image- and object- space partitioning algorithm to perform parallel rendering. They divide the rendering nodes into several groups, and perform object space partitioning among these groups. Within each group, they perform image space partitioning. This strategy can eﬃciently decrease the pixels communicated among nodes. There are also some pipeline work to reduce the time of image compositing [5] [6]. All of them implement pipeline in the process of rendering one frame. Unlike these methods, our new pipeline algorithm implement pipeline in the process of rendering several frames. Contiguous frames ﬂow in the pipeline at the same time. The way we used to improve the f ps in ﬁnal display is overlapping the time of compositing diﬀerent frames, not reducing the time of compositing one frame. This is the major diﬀerence between our methods and the other pipeline methods. We will describe it in detail in section 4.

3

Theoretical Analysis

We give some theoretical analysis for Direct Send and Binary Swap methods, which are the most representative and commonly used.

Eﬃcient Pipelining Methods for Image Compositing in Sort-Last Rendering

291

In this paper, we assume that every frame should be displayed in display node. And for the current generation of network device, the network usually supports full-duplex send and receive operations. So, sending and receiving can be overlapped. Some notations used in this paper list below. – n : the number of rendering nodes. – T (M ) : the reciprocal of f ps, using method M. – Trendering : the time of rendering one image and reading back color buﬀer(and depth buﬀer in polygon rendering) from GPU by rendering nodes. – Tcomm : the communication time between rendering nodes. – Tblending : the time of blending pixels. – Pxy : the size of pixels need to be transferred, where x is the screen width and y is the screen height. – Tdisplay : the time of displaying Pxy pixels in the display node. – L : the sum start-up time of a communication channel and latency of sending and receiving. (Assume sending and receiving have the same latency) – Tc : the data transmission time per byte. To simplify our model, we assume Tc is constant between one node and any other node. – Tb : the blending time per byte. Usually Tb is much smaller than Tc because blending two images is much faster than transfer one image, and Tb can also be speeded up by sse instructions provided by CPU. 3.1

Direct Send

The Direct Send method is simple: the screen is divided into n fractions, and every rendering node deal with one of them. In compositing stage, each node sends pixels direct to the node who is responsible for blending that portion. At the end of compositing stage, every rendering node has n1 part of ﬁnal image. Let Trn be the time of rendering nodes generate one frame, and let Tdn be the time of display node display one frame. Obviously, there are some overlap time between Trn and Tdn : when rendering nodes sending pixels to display node, display node is receiving these pixels. Let Trn dn comm be this overlap time. Thus T (DS) = Trn + Tdn − Trn

dn comm

(1)

where Trn = Trendering + Tcomm + Tblending + Trn dn comm (2) 1 n−1 1 Pxy Tb + Pxy Tc + L (3) = Trendering + (n − 1)( Pxy Tc + L) + n n n and Tdn = Trn

dn comm

+ Tdisplay

(4)

here Tc = Tc , because of link contention. In this condition, we assume every node has the same transferring rate. Thus Tc ≈ nTc , ﬁnally, T (DS) ≈ Trendering + Tdisplay +

2n − 1 n−1 Pxy Tc + Pxy Tb + nL n n

(5)

292

W. Fang et al.

3.2

Binary Swap

In Binary Swap algorithm, compositing one frame needs log2 n stages. In the ith stage, every rendering node needs to send and receive 21i Pxy pixels, and blend 1 2i Pxy pixels. Thus Trn = Tredering + Tcomm + Tblending + Trn

dn comm

(6)

= Trendering + Trn

dn comm

+

Pxy Tc Pxy Tb ( s + + L) 2 2s s=1

(7)

≈ Trendering + Trn

dn comm

+

n−1 Pxy (Tc + Tb ) + L log2 n n

(8)

log2 n

Trn

dn comm

and Tdn are the same as in Direct Send method, so

T (BS) = Trn + Tdn − Trn

(9) 2n − 1 n−1 + Pxy Tc + Pxy Tb + L log2 n (10) n n

dn comm

≈ Trendering + Tdisplay

We can see that Direct Send and Binary Swap have the same complexity of communication. But reported by many researches, Binary swap has better runtime performance. Two reasons can explain this, the one is that Direct Send method needs globe network operations, which would become a disaster when the number of processors increase; the other is that Binary Swap has less network function calls.

4

Pipeline Methods

The key idea of our methods is employing additional nodes(named transfer node) to construct tree-based pipeline, which could reduce the number of pixels need to be transferred among nodes. We try to totally overlap the rendering time, blending time, and the time of sending/receiving pixels. Unlike Direct Send and Binary Swap method, our methods is in an eﬀort to improve the f ps in ﬁnal display, not reduce the time of render one image. In this section we will propose three new algorithms: Pipeline based on Direct Send(PDS), Pipeline based on Binary Swap(PBS) and Hybrid Partition Pipeline(HPP). Notice that most improvement for Direct Send and Binary Swap, such as compression, interleaving and run-level encoding, can be applied in our methods. We don’t use them in this paper in order to face the worst case situation. 4.1

Pipeline Based on Direct Send

The simplest method using transfer node to implement pipeline is displayed in Figure 1. We use m additional nodes as transfer nodes. The screen is divided into m fractions. Each transfer node takes charge of one fraction. Every rendering node communicates with all the transfer nodes, sending one fraction to the

Eﬃcient Pipelining Methods for Image Compositing in Sort-Last Rendering

293

Fig. 1. Pipeline based on Direct Send, m=2, n=4

transfer node who is responsible for that portion. Transfer nodes receive pixels, then blend them, then send them to the display node. While rendering the f th frame, the rendering node is sending the (f − 1)th frame to the transfer node. At the same time, the transfer node is receiving the (f − 1)th frame from all the rendering nodes, blending the (f − 2)th frame, and sending the (f − 3)th frame to display node. PDS algorithm overlap the rendering time, blending time and communication n Pxy time. In every step, rendering node sends Pxy pixels, transfer node receives m 1 pixels and sends m Pxy pixels, display processor receives Pxy pixels. So T (P DS) = max{Trn, Ttn , Tdn }

(11)

In rendering node, we overlap rendering time and communication time. Trn = max{ Trendering , Pxy Tc + (n − 1)L} transfer node needs to blends

1 m Pxy

(12)

pixels (n − 1) times, thus

n−1 1 n Pxy Tb , Pxy Tc + L, Pxy Tc + nL} m m m n−1 n ≈ max{ Pxy Tb , Pxy Tc + L, Pxy Tc + nL} m m

Ttn = max{

(13) (14)

Here Tc ≈ mTc as we discussed in equation (5), and Tdn = max{Tdisplay , Pxy Tc + mL}

(15)

Usually Tb Tc , nL Pxy Tc , Tdisplay < Trendering . Thus T (P DS) = max{Trendering ,

n Pxy Tc + nL, Pxy Tc + max{n, m}L} m

(16)

Compared with equation (5) and (10), equation (16) is a better result. When Trendering < Pxy Tc and m ≥ n, T (P DS) is bounded by Pxy Tc + mL. This is the upper bound of the model. But we could hardly get this performance using this method in real world, because the link contention would be very heavy.

294

4.2

W. Fang et al.

Pipeline Based on Binary Swap

Now we use n − 2 additional transfer nodes to build a binary tree to implement pipeline. In this hierarchical connection architecture, all the leaves are rendering nodes. We pair up these rendering nodes, each takes charge of rendering diﬀerent half of screen. Every rendering node ﬁrst swaps half of screen pixels with its partner, then blend them and send pixels of its partition to transfer node, who is responsible for compositing that portion. Every transfer node receives pixels from two children, then blend them, then send them to its parent(another transfer node or display node). Figure 2 shows at a certain moment this algorithm behaves.

Fig. 2. Pipeline based on Binary Swap, n=4

Every rendering node need to send and receive 12 Pxy pixels to/from its partner, and blend 12 Pxy pixels. It also needs to send 12 Pxy pixels to transfer node. Every transfer node needs to receive Pxy pixels, blend 12 Pxy pixels and send 12 Pxy pixels to its parent. Let Trn tn comm be the communication time between rendering node and transfer node, then Trn = max{Trendering , Trn tn comm , Tcomm + Tblending } 1 1 1 = max{Trendering , Pxy Tc + L, Pxy Tc + L + Pxy Tb } 2 2 2

(17) (18)

Here Tc = Tc , because the sending operation happens at the same time in each pair. We assume Tc ≈ 2Tc , then Trn ≈ max{Trendering , Pxy Tc + L + 12 Pxy Tb }

(19)

Ttn = max{Pxy Tc + L, 12 Pxy Tb , 12 Pxy Tc + L} = max{Pxy Tc + L, 12 Pxy Tb }(20) Tdn is the same as in equal(15). Thus 1 T (P BS) = max{Trendering , Pxy Tc + L + Pxy Tb } (21) 2 Using approx n transfer nodes in PDS algorithm, equation (16) shows a better performance than equation (21). But consider that PBS method has no link contention and Tb Tc , this algorithm should have better performance than PDS method in real world. The disadvantage of this method is that there is about log2 n frames latency in ﬁnal display.

Eﬃcient Pipelining Methods for Image Compositing in Sort-Last Rendering

4.3

295

Hybrid Partition Pipeline

In this section we present hybrid partition pipeline method which employs the idea from hybrid image- and object- space partitioning method [7]. We use kn rendering nodes to render images. There are another k(n − 2) nodes to be transfer nodes. Rendering nodes are divided into n groups, and each group has k rendering nodes. Datasets are also divided into n groups, and every node in the same group has the same data. In every group, we divide the screen into k partitions, and each node takes charge of diﬀerent part. We pair up the groups. And a rendering node’s partner is the node who draws the same screen partition in the paired group. Thus each rendering node only needs to swap 1 2k Pxy pixels with its partner. Easy to see that, using this method the number of pixels communicated among nodes will be greatly reduced. Figure 3 shows a certain moment this algorithm behaves, where k = 2 and n = 4.

Fig. 3. Hybrid partition pipeline, k=2,n=4

As we did in PBS method

Tdn

Trn = max{Trendering , Trn tn comm , Tcomm + Tblending } 1 1 ≈ max{Trendering , Pxy Tc + L + Pxy Tb } k 2k 2 1 1 Ttn = max{ Pxy Tc + L, Pxy Tb , Pxy Tc + L} k k k 2 1 = max{ Pxy Tc + L, Pxy Tb } k k is the same as in equal(15), and Tb Tc . Thus

(22) (23) (24) (25)

2 T (HP P ) ≈ max{Trendering , Pxy Tc + L, Pxy Tc + 2kL} (26) k When k increases, the communication time of Trn and Ttn will be reduced. Equation (26) gets better performance than PBS method. But the communication time of Tdn is still the same. And as we discussed in section 4.1, this is the best we can do in this model. Of course, Tdn can be improved using other approaches, as many researchers have done a lot of works before.

296

4.4

W. Fang et al.

Summary

Better than PDS, PBS and HPP have no link contention, and only need a few network function calls in every step. Even when rendering nodes increase, the number of communicated pixels and network function calls are still stay constancy. This is a huge advantage in large-scale parallel rendering. However, this is based on the assumption that we have enough nodes. If limited nodes can be used, Trendering may be larger than communication time, and our methods could be worse than DS or BS, because our pipeline methods need some nodes to be transfer nodes and less rendering nodes could cause render time even longer. Let Ttotal be the time of one node renders total graphic primitives. Let N be the number of nodes can be used and deﬁne function g(N : M ) be the number of rendering nodes using method M. Namely, g(N : P DS, m) = N − m, Ttotal Ttotal N + 1. If g(N + 2Pxy Tc , g(N : P BS) = N2+2 and g(N : HP P, k) = 2k :M) N this means we lack of nodes to render images, Direct Send or Binary Swap is a better choice than our pipeline methods in this situation.

5

Experiment Results

The application program renders polygon-data, written in C using OpenGL. We conducted our experiments on a PC-cluster of 64 nodes. Each node has dualXeon 3.4GHz CPU and 8GB of memory. All the nodes are connected by 1G Ethernet. "DS" "BS" "PDS" "PBS" "HPP"

28 26 24

fps in final screen

22 20 18 16 14 12 10 8 6 4 2 0 2

4

6

8

12 14 16

28 32

64

number of nodes

Fig. 4. The fps performance of DS,BS,PDS,PBS and HPP

Figure 4 shows the DS, BS, PDS, PBS and HPP methods’ performance of rendering 1,201,287,716 polygons with 800 × 500 screen resolution. The y axis represents the f ps in ﬁnal display, and the x axis represents the number of nodes, including transfer nodes in PDS, PBS and HPP methods. We adopted n = m for

Eﬃcient Pipelining Methods for Image Compositing in Sort-Last Rendering

297

PDS method. When the number of rendering nodes is small, all the methods’s fps are less than 6, because rendering polygons dominates the whole time. All the methods achieved the highest fps when the number of rendering nodes is 16, and the performance dropped while the number of rendering nodes increasing, because more rendering nodes means more images need to be composited. HPP method has the highest f ps than other methods, and needs the most auxiliary nodes correspondingly. It achieved 26 fps using 16 rendering nodes and 12 transfer nodes. Table 1. The scalability with the number of rendering nodes

PDS(1:1) PDS(1:2) PBS HPP

2 20.9 24.6 -

number 4 18.7 20.1 16.3 27.2

of rendering nodes 8 16 10.2 12.3 24.5 21.4 14.5 14.5 26.7 24.0

32 10.2 12.0 -

Table 1 shows our methods’ scalability with the number of rendering nodes. Each rendering node renders a small set of polygons to make sure the bottleneck is network transformation. PDS(1:a) represents n = am in PDS method. The HPP have the best scalability, when the number of rendering nodes changing 4 to 16, the f ps just lost 10%. PDS method’s behavior was better than we expected, but was unstable both in PDS(1:1) and PDS(1:2). PBS also shows nice scalability, and have better f ps performance than PDS(1:1). Table 2. Performance predictions and measurements for diﬀerent screen resolution

T(s) 800x500 31.2 1000x800 15.6 1280x1024 9.5

PDS(1:1) M(s) P(%) 10.2 33 6.6 42 1.7 18

PBS M(s) P(%) 12.0 38 8.0 51 5.5 58

HPP M(s) P(%) 24.0 77 12.8 82 7.9 85

The inter-nodes connection of our PC-cluster has a bandwidth about 100MBs. In order to test our methods’ scalability for rendering diﬀerent screen size, we run our pipelining methods using as many nodes as we can to test f ps performance of rendering 800 × 500, 1000 × 800 and 1024 × 1080 screen size. That is to say, PDS(1:1) use 32 rendering nodes and 32 transfer nodes, PBS use 32 rendering nodes and 30 transfer nodes and HPP use 16 rendering nodes and 48 transfer nodes. We calculated the upper bound of f ps and measured real f ps of our pipeline method for diﬀerent screen size. The theoretical results(T = Pxy Tc ) and experimental results(M ) as well as the percentage(P = T /M ) comparisons are showed in table 1. When screen size increase, the f ps of PBS and HPP method are getting closer to theoretical results, that means we use the system more eﬀective, proﬁting from simple topology structure and a few communication mates for each node. The f ps of PDS method, on the other hand, dropped signiﬁcantly when screen size increase, due to the globe networking operation.

298

6

W. Fang et al.

Conclusion

In this paper, we proposed three tree-based pipeline algorithms for image compositing, which reduce the cost of sort-last parallel rendering. We analyzed the theoretical performance and measured the real performance of our methods, and compared with Direct Send and Binary Swap. The experiments shows that using transfer nodes to reduce communication messages and to avoid link contention, the PBS and HPP method have better behaviors than other methods. Moreover, these methods are highly scalable because when the number of rendering nodes increase, every node still have constant number of communication mates.

Acknowledgement This work is supported by Chinese Universities Scientiﬁc Fund (CUSF), National Natural Science Foundation of China (No.60873210) and National High Technology Research and Development Program of China (No.2009AA01A134).

References 1. Molnar, S., Cox, M., Ellsworth, D., Fuchs, H.: A Sorting Classiﬁcation of Parallel Rendering. IEEE Computer Graphics and Applications 14, 23–32 (1994) 2. Neumann, U.: Communication Costs for Parallel Volume-Rendering Algorithms. IEEE Computer Graphics and Applications 14, 49–58 (1994) 3. Shaw, C., Green, M., Schaeﬀer, J.: A VLSI architecture for image composition. In: Advances in Computer Graphics Hardware III, pp. 183–199. Springer, New York (1991) 4. Ma, K., Painter, J., Hansen, C., Krogh, M.: Parallel Volume Rendering Using Binary-Swap Compositing. IEEE Computer Graphics and Applications 14, 59–67 (1994) 5. Lee, T., Raghavendra, C., Nicholas, J.: Image Composition Schemes for Sort-Last Polygon Rendering on 2D Mesh Multicomputers. IEEE Transactions on Visualization and Computer Graphics 2, 202–217 (1996) 6. Cavin, X., Mion, C., Filbois, A.: COTS Cluster-Based Sort-Last Rendering: Performance Evaluation and Pipelined Implementation. In: Proceedings of IEEE Visualization 2005, pp. 111–118. IEEE Press, New York (2005) 7. Garcia, A., Shen, H.: An Interleaved Parallel Volume Renderer with PC-clusters. In: Proceedings of Eurographics Workshop on Parallel Graphics and Visualization, pp. 51–59. Eurographics Association, Aire-la-Ville (2002) 8. Stompel, A., Ma, K., Lum, K., Ahrens, J., Patchett, J.: SLIC: Scheduled Linear Image Compositing for Parallel Volume Rendering. In: Proceedings of IEEE Symposium on Parallel and Large-Data Visualization and Graphics, pp. 33–40. IEEE Press, New York (2003) 9. Takeuchi, A., Ino, F., Hagihara, K.: An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors. Parallel Computing 29, 1745–1762 (2003) 10. Yu, H., Wang, C., Ma, K.: Massively parallel volume rendering using 2-3 swap image compositing. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE Press, New York (2008)

memCUDA: Map Device Memory to Host Memory on GPGPU Platform* Hai Jin, Bo Li, Ran Zheng, Qin Zhang, and Wenbing Ao Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology, Wuhan, 430074, China [email protected]

Abstract. The Compute Unified Device Architecture (CUDA) programming environment from NVIDIA is a milestone towards making programming many-core GPUs more flexible to programmers. However, there are still many challenges for programmers when using CUDA. One is how to deal with GPU device memory, and data transfer between host memory and GPU device memory explicitly. In this study, source-to-source compiling and runtime library technologies are used to implement an experimental programming system based on CUDA, called memCUDA, which can automatically map GPU device memory to host memory. With some pragma directive language, programmer can directly use host memory in CUDA kernel functions, during which the tedious and error-prone data transfer and device memory management are shielded from programmer. The performance is also improved with some near-optimal technologies. Experiment results show that memCUDA programs can get similar effect with well-optimized CUDA programs with more compact source code. Keywords: GPU, CUDA, memory mapping, programming model.

1 Introduction In the past decade, the era of multi-core computer processor is coming due to power and performance limitations of VLSI design. As a new computing platform, GPGPUs become more and more attractive because they offer extensive resources even for non-visual, general-purpose computations: massive parallelism, high memory bandwidth, and general-purpose instruction sets. It is also important to revisit parallel programming model and tools for general-purpose computing on GPU. Programming on GPU is so complex that it is a significant burden for developers. NVIDIA released their new product GTX architecture GPU with CUDA support in 2007 [1], which is a flagship tool chain and has become a de facto standard towards the utilization of massively parallel computing power of Nvidia GPUs. * This work is supported by National 973 Basic Research Program of China under grant No.2007CB310900, the Ministry of Education-Intel Information Technology special research fund (No.MOE-INTEL-10-05) and National Natural Science Foundation of China (No.60803006). C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 299–313, 2010. © IFIP International Federation for Information Processing 2010

300

H. Jin et al.

CUDA allows the programmer writing device code in C functions called kernels. Compared to a regular function, a kernel is executed by many GPU threads in a Single-Instruction-Multiple-Thread (SIMT) fashion. Each thread executes the entire kernel once. Launching a kernel for GPU execution is similar to calling the kernel function. Local references in a kernel function are automatically allocated in registers (or local memory). References in device memories must be created and managed explicitly through CUDA runtime API. The data needed by a kernel must be transferred from host main memory into device memory before the kernel is launched, and result data also needs to be transferred back to host main memory after kernel execution. Note that these data transfer between host memory and device memories are both performed in an explicit manner. For CPU-GPU heterogeneous computing architecture, the address spaces of host CPU and device GPU are separate with each other. The developer has to use CUDA APIs to manage device memory explicitly to realize the CPU accessing GPU data, including transfer data between a CPU and a GPU. Thus, it is very difficult to manage data transfer between host memory and various components of device memory, and manually optimize the utilization of device memory. The programmer has to make decisions on what data to move to/from device memory, when and how to move them. Practical experience shows that the programmer needs to make significant tedious and error-prone code changes. The explicit data transfer and communication between CPU and GPU has become one of the biggest obstacles when programming on GPU. This paper introduces a scheme that could improve the comprehensive performance of the system. The present CUDA only supports data-parallel and not task-parallel, thus, the GPU is always exclusively used by one kernel. In fact, CUDA does not even allow independent kernels from the same process to run concurrently. When the kernel is flying on GPU, there are few high-efficiency mechanisms to support any kinds of communication between CPU and GPU. Due to the currently available NVIDIA GTX architecture GPU does not support task-level parallel (the next generation NVIDIA GPGPU Fermi [18] is claimed to support task-level parallel), the whole kernel invocation procedure must strictly pass through three following phrases: a) performing input data transfer from host memory to device memories (global, constant and texture memory); b) invocating the kernel to process the data; c) performing result data transfer from device memories to host memory. Generally, there are another two stages running on CPU, preprocessing or post-processing stages. As a trivial example, Fig. 1 illustrates the implementation of CUDA-based vector addition. The above mentioned data transfers between host memory and device memory in CUDA program is performed explicitly. It not only increases development time, but also makes the program difficult to understand and maintain. Moreover, the management and optimization on data allocation in GPU device memories involves heavy manipulation of communication, which can easily go wrong, prolonging program development and debugging. With these challenges in mind, we design memCUDA, a high-level directive-based language for CUDA programming, which can automatically map device memory to host main memory. Just using several pragma directives, the developer can focus on the

memCUDA: Map Device Memory to Host Memory on GPGPU Platform

301

usage of host main memory instead of device memory. Therefore, it shields device memory from programmer and relieves the programmer’s burden out of manually data transfer between host memory and device memory. Therefore, it supports the same programming paradigm familiar with CUDA. We have implemented a source-to-source compiler prototype that translates a memCUDA program to the corresponding CUDA program. Experiments with six standard CUDA benchmarks show that the simplicity and flexibility of memCUDA are provided and come at no expense to performance. //Comput vector sum c = a + b __global__ void vecAdd (float *a, float *b, float *c) { Int I = threadIdx.x + blockDim.x * blockIdx.x; c[i] = a[i] + b[i]; } int main() { /* allocat h_a, h_b, and h_c with size N, And initialize host(CPU) memory */ float *h_a = ... , *h_b = ... , *h_c = ...; //allocat device (GPU) memory. float *d_a, *d_b, *d_c; cudaMalloc( (void **) &d_a, N*sizeof(float) ); cudaMalloc( (void **) &d_b, N*sizeof(float) ); cudaMalloc( (void **) &d_c, N*sizeof(float) ); //copy host memory to device memory cudaMemcpy( d_a, h_a, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, N*sizeof(float), cudaMemcpyHostToDevice); //execute the kernel on N/256 blocks of 256 threads each vecAdd <<< N/256, 256>>> (d_a, d_b, d_c); //copy the result data from device back to host memory cudaMemcpy( h_c, d_c, N*sizeof(float), cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); }

Fig. 1. CUDA vector addition example codelet

The remainder of this paper is organized as follows. In section 2, we review some related works. In section 3, we give a briefly introduction of the main framework of memCUDA. Section 4 introduces memCUDA language directives in details. Section 5 gives a description of memCUDA runtime supports. The sixth section, the memCUDA implementation details are introduced and we conduct some experiments to validate the performance of memCUDA and also give concluding remarks and research directions for future work in section 7.

2 Related Work To make GPU programming more user-friendly, there has been lots of efforts to improve the development of new programming frameworks, among which RapidMind [2] is a well-known one and has been commercialized. Existing general purpose programming languages for the GPU are based on the stream processing mode. These languages include Brook [11][14], Sh [17], and etc. There are also some works using the source-to-source technology to improve the programmability or performance [15][16].

302

H. Jin et al.

BSGP (Bulk Synchronous GPU Programming) [6], a new programming language for general-purpose computation on GPU, is easy to read, write, and maintain. BSGP is based on BSP (Bulk Synchronous Parallel) model. Multicore-CUDA (MCUDA) [12] is a system that efficiently maps the CUDA programming model to a multicore CPU architecture. They use the source-to-source translation process that converts CUDA code into standard C language that interfaces to a runtime library for parallel execution. CUDA-lite [4] is a memory optimization tool for CUDA programs, which is also designed as a source-to-source translator. With some annotations, the compiler automatically transforms the flat program to CUDA code with the utilization of shared memory. This tool takes a CUDA program only with global memory, and optimizes its memory performance via access coalescing. This work puts the shield the shared memory from programmer. Our work could be considered as a supplement of CUDA-lite. CUDA-lite aims at automatic memory optimization, but it still requires the programmer to write a full-fledged data movement between host memory and device memory function in program. In contrast, memCUDA aims at shielding data movements and device management for programmer. A directive-based language called hiCUDA [7] is defined for programming NVIDIA GPUs. The language provides programmers high-level abstractions to carry out the tasks mentioned above in a simple manner, and directly to the original sequential code. The use of hiCUDA directives makes it easier to experiment with different ways of identifying and extracting GPU computation, and managing device memory. The syntax of memCUDA directives bears some similarity to hiCUDA directives. Our work partly is inspired by hiCUDA. However, there are significant differences between the two studies. hiCUDA adopts source-to-source compiling technology like macro replacement without any optimization. Although data movement APIs of CUDA are warped by hiCUDA directives, it also needs programmer consider data movement direction and explicit data movement. Our work focuses on memory mapping from device memory to main memory. The device is completely shielded from programmer. Moreover, the asynchronous execution optimization is also performed in memCUDA.

3 Main Framework Fig. 2 shows the software architecture of memCUDA. At application level, memCUDA provides APIs and pragma directives to programmers. With the help of the API to explicitly express the usage of host main memory, programmers are alleviated from difficult and explicit data transfer and device memory usage, instead of focusing on performance tuning. Similar to OpenMP programming grammar, memCUDA API is built on top of C/C++ so that it can be easily adopted. Below the application layer is memCUDA system layer, which consists of a source-to-source transformer, source code generator and a number of runtime libraries. The source-to-source transformer translates memCUDA pragma directives into native CUDA codes with runtime library APIs. At the same time, it also decides the

memCUDA: Map Device Memory to Host Memory on GPGPU Platform

303

near-optimal technology, which includes utilizing the page-locked memory to achieve the overlap between kernel execution and data transfer through an adaptive algorithm. The lower NVIDIA NVCC is invoked to compile the transformed CUDA code to produce the binary code, which could run on the GPGPU platform. The runtime libraries in Fig. 2 are implemented as wrapper calls to the libraries, which is used to dynamic memory mapping.

Fig. 2. Software framework of memCUDA

Current implementation of memCUDA is built on top of NVIDIA CUDA compiler NVCC for GPU. Instead of directly generating CPU and GPU native machine codes, memCUDA compiler generates CUDA source codes from memCUDA codes on the fly and then uses the system compilers to generate the final machine codes.

4 memCUDA Pragma Directives The memCUDA directives are specified using the pragma mechanism provided by C and C++ standards. Each directive starts with #pragma memcuda and is case-sensitive. Preprocessing tokens following #pragma memcuda are subject to macro replacement with the runtime libraries and original CUDA APIs. In naive CUDA programming framework, device memory is typically allocated with cudaMalloc() and freed with cudaFree(). Data movements between host memory and device memory are typically done with cudaMemcpy() and some other APIs. These APIs are all abandoned in memCUDA system, instead, they are all invoked implicitly by wrapped as memCUDA APIs. Through the simple pragma directives, memCUDA high-level compilers perform a source-to-source conversion of a memCUDA-based program to an original CUDA program. Currently, memCUDA does not support the mapping of texture and constant memory, and also does not support CUDA array structure direct mapping. memCUDA current provides four directives: map, remap, unmap, and update for global memory mapping, just as following: • map directive performs the functions of establishing the mapping relationship. It indicates which references need to be mapped from device memory to host memory.

304

H. Jin et al.

The size of data block also should be given. The shape of the references can eventually be directly derived from source code by compiler parser. Then device memory will be allocated. The new record of the mapping also will be inserted into Memory Mapping Table (mmTable for short, which will be introduced in the next section). In some cases the same data block in host memory needs to be mapped to several counterparts in device memory. Thereby the map directive does not perform the data transfer, which will be performed by the update directive. • remap directive indicates refreshing the mapping relationship. In some cases, especially for the applications with multiple kernels, the old device memory block needs to be freed and substituted by a new block of device memory. So the remap operation refreshes the mapping relationship. • unmap directive is for annotating to dissolute the mapping relationship. It will invoke the operation that free data block of device memory. The mapping record in mmTable will also be deleted. • update directive triggers data transfer operation between host memory and device memory. The transfer direction is decided by STATE attribute of mmTable. For rigorous programmer, the update directive could be combined with map or remap directive. After the mapping relationship installed, the operation which performed by update directive could be executed automatically. That means the host memory block must be initialized before the mapping relationship installed. For the flexibility of programming, we keep the update directive independency of data transfer from the map and remap directive. Fig. 3 shows the current form of the pragma directives in memCUDA. Each directive manages the lifecycle of mapping relationship between data block in device memory and the correspondent data block in host memory. Since data management in global memory always happens before or after kernel execution, the pragma directives must be placed in host code regions. In contrast, the kernel region does not need to be modified. All the pragma directives are stand-alone, and the associated actions taken by runtime libraries happen at the place of the directive.

Fig. 3. memCUDA pragma directives

For comparison purpose, the original hand-written CUDA version of TPACF in Parboil benchmark suite [5] and memCUDA rewritten version both are shown in Fig. 4. For naive CUDA code (see Fig. 4(a)), it needs programmer to allocate device memory and move data from host memory to device memory through invoking CUDA APIs. However, in memCUDA version (see Fig. 4(b)), there are no references in device address space. After using the map pragma, device memory allocation and the mapping relationship all are undertaken automatically. Thus, the kernel can use host memory directly. When data requires moving between host memory and device memory, the

memCUDA: Map Device Memory to Host Memory on GPGPU Platform

305

update pragma needs to be inserted in the rational place in host code region. Then data movements are all performed by runtime libraries and the movement direction (e.g. from host memory to device memory, or vice versa) determined in the parsing phase of the compiler. Above all, memCUDA shields device memory and never exposes it to programmer.

(a) Original CUDA Code

(b) memCUDA Code

Fig. 4. Original CUDA code vs memCUDA TPACF code

From Fig. 4, we can see that memCUDA code is much simpler to write, understand and maintain. The programmer does not need to manage GPU device memory nor use explicit data movement between device memory and host memory. Nonetheless, memCUDA supports the same programming paradigm familiar to CUDA programmers.

5 Runtime Support Runtime support is in charge of memory mapping when the program is running. The lifecycle of the memory mapping could be described through a state transition diagram as shown in Fig. 5. • Mapped state: In this state, the mapping relationship has been established. The data block in host memory and device memory both are allocated. The row which records this mapping relationship is also inserted into mmTable. • Transferred state: In this state, the data has been transferred from host/device memory to device/host memory. The transfer direction is determined by the value of STATE attribution in mmTable. • Destroyed state: This state means the mapping relationship is terminated. Responding to these states, the mapping tasks include: Map(), Update(), Remap(), and Unmap(). Especially, the Update() could perform two directions’ data transfer between host memory and device memory. The following of this section describes memCUDA runtime libraries.

) p( ma Un

)

H. Jin et al.

Un ma p(

306

Fig. 5. State transition diagram for data mapping

5.1 Mapping Table Structure Currently CUDA framework does not support concurrent executions of two kernels, and the size of host memory generally is much larger than device memory, memory mapping is designed as one-to-one mapping between host address space and device address space, shown in Fig. 6. The mapping information is maintained by a structure under a table shape, basically used for looking up and translating operations: mmTable, which holds information for optimizing the look up mechanism and implementing the mapping from GPU device memory address space to host memory address space. A row contains following four attributes for mapping mechanisms: • BASE_H: the base address of data block in host memory address space, which is the key attribute of the table; • BASE_D: the base address of the correspondent data block in device memory address space; • SIZE: the size of data block; • STATE: the state of current mapping operation, which is used to determine the next data movement direction from host memory to device memory, or vice versa. mmTable

r1

...

_d_r2

r2

device memory

host memory mmTable Items

_d_r1

BASE_H

BASE_D

SIZE

STATE

Fig. 6. Illustration of memory-mapping mechanism

memCUDA: Map Device Memory to Host Memory on GPGPU Platform

307

5.2 Memory Mapping The implementation of mapping mechanism is divided into two phases. The first phase takes place before launching the kernel to setup the mapping relationship; the second one occurs inside the mapping runtime system when the kernel flying. For the first phase, it requires some coordination with the compiler supported by source-to-source code replacement. When the mapping operation pragma is invoked, a data block will be allocated in device memory and the size is the same with its counterpart in host memory. The data will be moved from host main memory to device memory. After that, the mapping relationship will be established. Then a new row will be inserted into mmTable to record this mapping relationship, and the attributes of the base address of data block in host device memory address space, the base address of the correspondent data block in device memory address space, and the size of data block and the state of current mapping operation will all be evaluated. The pseudo code is described in Algorithm 1.

For the second phase, when a host memory reference is adopted as an input parameter of a GPU kernel, the implementation will keep track about the base address stored in mmTable. Each time a new instance of a memory reference occurs in kernel region, the implementation checks out the base address and size of the corresponding reference in device memory address space from mmTable. The host memory reference will be replaced by the device memory reference which is checked out from mmTable. It is under the compiler responsibility to assign an entry in mmTable for each memory reference in the code. mmTable is updated at the end of mapping process. The row assigned to memory reference the mapping operation was treating is appropriately filled: base address of data block in host main memory, base address of the corresponding block data in GPU device memory， the size of data block and the state of data block.

308

H. Jin et al.

5.3 Asynchronous Concurrent Execution Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel. At the same time, asynchronous copies are allowed if the used host memory is allocated as page-locked memory, which the page will never be swapped out of the memory. In this case, the GPU computation and data transfer between page-locked memory and device memory can be overlapped to improve the overall performance of the program. However, page-locked host memory is never be swapped out, it is a scarce resource and allocations as page-locked memory may start failing long before allocations in page-able memory. In addition, by reducing the amount of physical memory available to the operating system for paging, allocating too much page-locked memory will reduce overall system performance. In memCUDA, we also consider to adopt this property of page-locked memory to improve the performance. memCUDA will adaptively allocate page-locked memory instead of page-able memory even if programmer does not use it explicitly. memCUDA will transfer the program into CUDA code with asynchronous concurrent execution support, when meeting the following conditions: 1. There are multiple kernels in the application. Only in this case, the former kernel’s data transfer can be overlapped with the next kernel execution. 2. There is no data dependence among the kernels. Currently, data flow dependence analysis is an armature technique and there are lots of tools could achieve this goal, including GCC. In memCUDA system, the data dependence analysis is based on related function module including the Cetus [3]. 3. The amount of device memory that kernels require is less half of host memory, the reason has been mentioned in last paragraph.

6 Implementation and Performance Evaluation We implement a prototype system to translate input memCUDA programs to equivalent CUDA programs under the Cetus source-to-source compilation framework [3]. This allows the use of existing CUDA compiler tool chain from NVIDIA to generate binaries. Fig. 7 shows the compilation flow. Our work focuses on memCUDA Extensions Handler module and the Running Libs module. First, we extend the naive CUDA grammar rules with slight modifications to the IR and preprocessor to accept ANSI C with language extensions of CUDA and memCUDA pragma directives. Then ANTLR[13], which is used as an internal C language parser by Cetus, scans and parses the memCUDA source code to establish a Cetus Intermediate Representations (IR for short) syntax tree. Second, memCUDA Extensions Handler module is in charge of source-to-source transformation and data flow dependence analysis based on the prior Cetus IR tree. It transforms memCUDA IR syntax tree to naive CUDA IR tree. memCUDA-specific directives will be replaced by naive CUDA and runtime libraries APIs. At the same time, some optimization manners will also be imposed in the phase such as the asynchronous current execution through data flow analysis. So the output of this phase is a modified Cetus IR tree against the original one. Then the code generator module will print out the source code through traversing the final IR tree. At last, NVCC compiler will be called to compile the source code into binary code.

memCUDA: Map Device Memory to Host Memory on GPGPU Platform

309

Fig. 7. The compilation workflow of memCUDA system

The experiments are conducted on a NVIDIA GeForce GTX 260+ GPU device. The device has 896MB of DRAM and has 27 multiprocessors (MIMD units) clocked at 576MHz. Each multiprocessor has 8 streaming processors running at twice the clock frequency of the multiprocessor and has 16KB of shared memory per multiprocessor. We use CUDA 2.1 for the experiments. The CUDA code is compiled with NVIDIA CUDA Compiler (NVCC) to generate device code launched from the CPU (host). The CPU is an Intel Core 2 Quad 9550 at 2.83GHz with 12MB L2 cache. The size of memory is 4GB. The GPU device is connected to CPU through a 16-x PCI Express bus. The host programs are compiled using Intel C Compiler 10.1 at -O3 optimization level. We present six applications, listed in Table 1, as the benchmarks: MM, MRI-q, MRI-FHD, CP, PNS, and TPACF, the running example of this paper. Except MM is extracted from CUDA SDK 2.1, the other five applications1 are selected from Parboil benchmark suite [5]. For the limitation of the space, we do not explain the detail configure parameters of each benchmark. The interested reader could refer to the web site [5]. The input dataset and parameters setting are all selected from the standard sample datasets of the Parboil benchmark suite. For the matrix multiply benchmark, we write our own memCUDA version and obtained CUDA version from NVIDIA CUDA SDK [7]. For the other five benchmarks, we obtain the sequential version and CUDA version from the Parboil benchmark suite. In that benchmark suite, CUDA version is heavily optimized [9][10]. We rewrite the memCUDA versions and use memCUDA to compile them. We compare the memCUDA version’s performance against the original version from Parboil suite. In Fig. 8, we can see that there are no noticeable performance differences between original CUDA version and memCUDA version. Generally, extra overhead is less than 5% comparing with the original version. This means memCUDA does not introduce much performance penalties. The performance loss arises from the extra operations on the mapping between host memory and devices memory. However, the tiny overhead does not beyond the boundary that we could afford. 1

Currently, memCUDA only realizes the global memory mapping, and does not support the texture, constant memory and CUDA array structure mapping. So the SAD and RPES in the benchmark suite are not used in our experiments.

310

H. Jin et al. Table 1. CUDA benchmarks for evaluating the memCUDA compiler

Kernel Matrix Multiplication (MM)

Configuration Vary by dimensions of matrix

Magnetic Resonance Imaging Q (MRI-Q)

Large size; Small size

Magnetic Resonance Imaging FHD (MRI-FHD)

Large size; Small size

Coulombic Potential (CP)

40000atoms, Vary by grid size 2000×2000 Scale of Petri net; Trajectory 4 Default configure

Petri Net Simulation (PNS)

1.1

Original memCUDA

57

Execution Time(sec.)

Execution Time(msec.)

Two Point Angular Correlation Function (TPACF)

54 51 12 9 6

Original memCUDA

1.0 0.9 0.2 0.1

3

0.0

0 16x16

64x64

256x256

1024x1024 2048x2048

small

Scale

(b) MRI-Q

Original memCUDA

Execution Time(sec.)

Execution Time(sec.)

(a) MM 1.10 1.05 1.00 0.95 0.90 0.15 0.10 0.05 0.00

small

large Scale

large

0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.2 0.1 0.0 Original

memCUDA

Scale

(c) MRI-FHD

(d) TPACF

Fig. 8. Performance comparison of memCUDA version over CUDA version

Second, we also evaluate the efficiency of asynchronous execution optimization. Both CP and PNS repeatedly invoke kernel functions in a for-loop structure, so we conduct the experiment on CP and PNS. The others have only one kernel to execute, so the asynchronous execution optimization is not valid to them. The experimental results are shown in Fig. 9. The original version is naive CUDA version from Parboil benchmark suite. memCUDA label presents the memCUDA version without asynchronous execution

memCUDA: Map Device Memory to Host Memory on GPGPU Platform

311

optimizations. Optimized version is the one that automatically performs the asynchronous execution optimizations by memCUDA. In Fig. 9, we can see the optimized version is superior the prior two versions obviously. In most cases, the performance boosts about 30% comparing to the other two versions.

1.4

Original memCUDA Optimization

Execution Time(sec.)

Execution Time(sec.)

1.6

1.2 1.0 0.8 0.6 0.4 0.2 0.0

128x128

256x256

512x512 1024x1024

Scale

(a) CP

2.0

Original memCUDA Optimization

1.5 1.0 0.5 0.0

100

1000

10000

Scale

(b) PNS

Fig. 9. Performance comparison of original CUDA version, memCUDA version and optimized memCUDA version

7 Conclusions and Future Work In this paper, we present a study to reduce programmers’ burden of data movement between host memory and device memory under GPU CUDA programming environment. memCUDA high-level language is implemented with the composition of source-to-source compiling and runtime libraries technologies. memCUDA produces code with performance comparable to hand-optimized version programs. The coding of memCUDA is lower than the same transformations by hand and a layer of abstraction is provided from the definition of warps in CUDA. Since memCUDA does not handle parallelizing aspects of GPU programming, as the memory optimizing module of an eventual overall framework, memCUDA will facilitate GPGPU programming to encompass parallelization and resource usage decisions to maximize performance. The latest CUDA 2.3 supports a zero-copy mechanism, which could avoid allocating a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel. It seems that main idea of zero-copy is same as ours. However, there is a tough constrain for zero-copy mechanism. Host block used for mapping the device memory must be the page-locked memory (which will never be swapped out by OS). In fact, the GPU driver always uses DMA (Direct Memory Access) from its internally pinned memory buffer when copying data from the host memory to the GPU global memory. The up limited size of available page-locked memory is the half of main memory. So lots of applications could not use the zero-copy mechanisms to get a performance improve. Unlike the hardware mapping supported by current CUDA devices and software, memCUDA uses

312

H. Jin et al.

implicit copy operations inserted by its source-to-source compiler to maintain consistency between the memory spaces. This avoids the problem of needing to pin system pages. Our currently ongoing and future work are followings: a) extend memCUDA to leverage constant and texture memory mapping; b) use some classic compiler optimization to automatically optimize transformation performance, such as adaptive loop unrolling to achieve more efficiency asynchronous execution when a kernel invoked in for-loop structure; c) simplify the directives in memCUDA, some of which can be replaced by compiler analyses.

References 1. NVIDIA. NVIDIA CUDA, http://www.NVIDIA.com/cuda 2. McCool, M.D., Wadleigh, K., Henderson, B., Lin, H.-Y.: Performance Evaluation of GPUs Using the RapidMind Development Platform. In: Proceedings of the ACM/IEEE Conference on Supercomputing (2006) 3. Lee, S.-I., Johnson, T., Eigenmann, R.: Cetus - an extensible compiler infrastructure for source-to-source transformation. In: Proceedings of the International Workshop on Languages and Compilers for Parallel Computing (2003) 4. Ueng, S.-Z., Lathara, M., Baghsorkhi, S.S., Hwu, W.-M.W.: CUDA-lite: Reducing GPU programming complexity. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 1–15. Springer, Heidelberg (2008) 5. IMPACT Research Group. The Parboil benchmark suite (2007), http://www.crhc.uiuc.edu/IMPACT/parboil.php 6. Hou, Q., Zhou, K., Guo, B.: BSGP: bulk-synchronous GPU programming. ACM Transaction on Graphics 27(3) (2008) 7. Han, T.D., Abdelrahman, T.S.: hiCUDA: a high-level directive-based language for GPU programming. In: Proceedings of the Second Workshop on General Purpose Processing on Graphics Processing Units (2009) 8. NVIDIA, http://developer.download.nvidia.com/compute/cuda/2_2/ toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf 9. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.-M.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 15th ACM SIGPLAN Principles and Practice of Parallel Computing (2008) 10. Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., Hwu, W.-M.W.: Program optimization space pruning for a multithreaded GPU. In: Proceedings of the International Symposium on Code Generation and Optimization (2008) 11. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream computing on graphics hardware. ACM Transaction on Graphics 23(3), 777–786 (2004) 12. Stratton, J.A., Stone, S.S., Hwu, W.-M.W.: MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008) 13. ANTLR, http://www.antlr.org/

memCUDA: Map Device Memory to Host Memory on GPGPU Platform

313

14. Liao, S.-W., Du, Z., Wu, G., Lueh, G.-Y.: Data and computation transformations for Brook streaming applications on multiprocessors. In: Proceedings of the 4th International Symposium on Code Generation and Optimization (2006) 15. Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A Compiler Framework for Optimization of Affine Loop Nests for GPGPU. In: Proceedings of the 22nd Annual International Conference on Supercomputing (2008) 16. Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009) 17. Sh, A.: High-Level Metaprogramming Language for Modern GPUs (2004), http://libsh.sourceforge.net 18. NVIDIA, http://www.nvidia.com/object/fermi_architecture.html

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor Qian Cao, Chongchong Zhao*, Junxiu Chen, Yunxing Zhang, and Yi Chen University of Science and Technology Beijing, 100083 Beijing, China [email protected]

Abstract. Software cache promises to achieve programmability on Cell processor. However, irregular references couldn’t achieve a considerable performance improvement since the cache line is always set to a specific size. In this paper, we propose an adaptive cache line prefetching strategy which continuously adjusts cache line size during application execution. Therefore, the transferred data is decreased significantly. Moreover, a corresponding software cache adaptive line size cache is designed. It introduces a hybrid Tag Entry Arrays, with each mapping to a different line size. It’s a hierarchical design in that the misshandler is not invoked immediately when an address is a miss in the short line Tag Entry Array. Instead, the long line Tag Entry Array is checked first, which significantly increases the hit rate. Evaluations indicate that improvement due to the adaptive cache line strategy translates into 3.29 to 5.73 speedups compared to the traditional software cache approach. Keywords: Adaptive, Software cache, Irregular reference, Cell processor.

1 Introduction Irregular application is widely used in scientific computing, which exposes unclear aliasing and data dependence information. Such applications are frequently seen in reservoir numerical simulation, molecular dynamics, etc. Heterogeneous multicore is an area and energy efficient architecture to improve performance for domain-specific applications. The Cell processor is a representative heterogeneous multicore, which comprises a conventional Power Processor Element (PPE) that controls eight simple Synergistic Processing Elements (SPEs), as illustrated in Figure 1. PPE has two levels of cache that are coherent with the globally memory, while SPEs don’t have cache but each has 256KB of local store. PPE can access main memory directly while SPE only operates directly on its local store and works as an accelerator. Software cache is a common approach to automatically handle data transfers for irregular reference, providing the user with a transparent view of the memory architecture. *

Corresponding author.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 314–328, 2010. © IFIP International Federation for Information Processing 2010

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor

315

There has been substantial research [1-6] on software cache specifically for Cell processor. Eichenberger et al. [1] propose a compiler-controlled software cache. It’s a traditional 4-way set-associative cache implemented in software. It adopts the LRU policy and SIMD mode to look up for a match among the four tags in a set.

PPE L1 cache

512K B L 2 cache

Interrupt controller

RA M SPE0 LS

SPE1 LS

SPE2 LS

SPE 3 LS

M IC

RA M

Elem ent Interconnect Bus (EIB) SPE 4 LS

SPE 5 LS

SPE 6 LS

SPE7 LS

IO BEI

IO

Fig. 1. Cell Architecture

Balart et al. [2] demonstrate a software cache for Cell which adopts hashed list for lookup and allows for full associative. This design enables a user to determine code regions guaranteed not to have any cache conflicts. In such a region, the user can reorder lookup and misshandler operations, so communication and computation can be efficiently overlapped. The strategy usually performs well for specific loops containing few cache accesses with high temporal locality, but it introduces a large implementation overhead for the general applications. The COMIC runtime system proposed by Lee et al. [3] provides the application with an illusion of a shared memory, in which the PPE and the SPEs can access the shared data. The management of synchronization and coherence is centralized in the PPE and the release consistency is achieved by software cache. A hybrid access-specific software cache is presented by Marc Gonzalez et al. [4, 5]. It classifies memory accesses into high locality and irregular, and the corresponding high locality cache and the transactional cache are designed. The former applies the write-back mechanism while the latter supports the write-through policy. Its motivation is similar to the direct buffer plus software cache approach. Chen et al. [6] propose an integrated software cache and direct buffer approach so as to efficiently execute the loops that include both references. Their solution provides compile time analysis and runtime support to minimize the coherency operations. The software caches usually suffer from poor performance, especially when the irregular reference is encountered. The solutions above always set the cache line to a specific size, which introduces the reduction of data transfers and increases the memory bandwidth overhead. The cache design with adaptive line size could obviously improve the irregular application performance. There are some proposals [7, 8] for the hardware adaptive cache line solutions. But the SPE on Cell has no hardware cache, so we focus on adaptive cache line designs implemented in software. To the best of our knowledge, the adaptive cache line scheme proposed by Sangmin Seo et al. [9] is the only strategy which continuously adjusts the cache line on Cell processor. Their design is called extend set-index cache (ESC), which is based on

316

Q. Cao et al.

4-way set-associative cache. Nevertheless, the number of TEs could be greater than the number of cache lines. The adaptive strategy in ESC utilizes the runtime to adapt to characteristics specific to the loop considering that the loops are invoked many times. But their strategy is applied to parallel loops only, and it isn’t sensitive to variation across iteration of the loop. Additionally, its storage overhead is large. In this paper, we propose an adaptive cache line size strategy, which adaptively adjusts cache line size according to the characteristics specific to the irregular reference. The solution gathers the addresses accessed by the irregular reference and divides them into long line addresses (long addresses) and short line addresses (short addresses). The algorithm adaptively chooses the optimal cache line size, regardless of how many times the loop is invoked. Moreover, a corresponding software cache design - adaptive line size cache (ALSC) is presented. It is based on the 4-way set-associative cache (4WC) and adopts a hybrid Tag Entry Arrays, a long line Tag Entry Array and a short line Tag Entry Array, with each mapping to a different line size. The operations to the long line Tag Entry Array is the same as the traditional 4WC, but when a miss occurs in the short line Tag Entry Array, the misshandler is not invoked at once. Instead, the long line Tag Entry Array is checked. So the miss rate is significantly decreased. In order to implement cache replacement policy for cache design with multiple line sizes, we present a novel LRU policy - IndAlign_LRU. It adopts a link array, with each link mapping to one set in the long line Tag Entry Array and two successive sets in the short line Tag Entry Array. The data field of the link node stores the cache line index. IndAlign_LRU policy is implemented by moving nodes to the link head or tail. The experimental results show that our approach obtains speedup factors from 3.29 to 5.73 compared to the traditional software cache scheme with specific line size. Moreover, it significantly reduces the miss rate and the total transferred data size. Additionally, the adaptive approach we proposed shows good scalability. The rest of the paper is organized as follows. The adaptive software cache line algorithm is presented in Section 2. Section 3 describes the ALSC design. The ALSC operational model is presented in Section 4. Section 5 evaluates our adaptive approach. The last section concludes the paper.

2 The Adaptive Software Cache Line Algorithm The adaptive cache line prefetching scheme, which is based on our previous work [15], consists of four steps. A loop with normalized boundaries is extracted from CG in the NAS benchmark suite for a clear explanation, as illustrated in Figure 2a. And the adaptive strategy is shown in Figure 2b. The first step is to initialize the cache lines. For the sake of simplicity, two cache lines, 128B and 256B, are introduced. The second step is to divide the addresses into long and short addresses. Our adaptive algorithm is applied to each iteration range. We propose a dynamic address collecting solution, which means the address collecting is stopped when the first set conflict is encountered.

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor

317

for(k=0; k
/*step1: initialization*/ lb_tmp=0; Longln=256;//size of long line Shortln=128;//size of short line do{ /*step2: the addresses dividing process*/ ub_tmp = collect_dynamic(lb_tmp, ub_tmp); for(k=lb_tmp; k
To explain it clearly, an example is shown in Figure 3. We assume the following: •

•

•

There are seven memory requests with addresses ranging from a1 to a7 in an iteration range and they are all mapped to the addresses in the range of 256N (256(N+2)-1)). The data located in addresses from 256N to (256(N+1)-1) is represented Li, and the former 128B is represented by Li-1 while the latter 128B is expressed as Li-2. Li-1 and Li-2 are “adjacent” lines. The data located in addresses from 256(N+1) to (256(N+2)-1) is represented Lj, and the former 128B is represented by Lj-1 while the latter 128B is expressed as Lj-2. Lj-1 and Lj-2 are “adjacent” lines.

All the seven addresses are taken as short addresses initially, as illustrated in Figure 3a. The addresses a1 and a3 are mapped to the “adjacent” lines. If a short line is fetched from the memory, two DMA operations which respectively fetch the Li-1 and

318

Q. Cao et al.

the Li-2 are required. So our adaptive strategy makes the two short lines Li-1 and the Li-2 merged into a long line Li. And the addresses a1 and a3 are taken as long addresses. Obviously, the data required can be obtained in one DMA. Analogically, addresses a2, a4, a5 are regarded as long addresses. The data located in addresses a6 and a7 is in Lj-2, and there’s no memory requirement in its “adjacent” line Lj-1, so the address a6 and a7 are both taken as short line addresses. 2 5 6 B a lig n e d b o u n d a r y : 1 2 8 B a lig n e d b o u n d a r y :

s h o r t lin e a d d r e s s : lo n g lin e a d d re s s : ad d ress 2 5 6 N

…

256N +128

a1

a2

… a3

L i- 1

2 5 6 (N + 1 )

256N +384

…

…

a4 a5

L i- 2

2 5 6 (N + 2 )

a6

L j-1

a7

…

a7

…

…

L j- 2

Li

Lj a . b e fo re th e a d d re s se s m e rg in g

256N

…

256N +128

a1

a2

… a3

2 5 6 (N + 1 )

a4 a5

256N +384

…

…

a6

2 5 6 (N + 2 )

…

b . a f t e r th e a d d r e s s e s m e r g i n g

Fig. 3. Example for addresses merging

The third step is to adaptively prefetch the long and short lines from lower bound lb_tmp up to upper bound ub_tmp. The last step is the computation loop, which performs the computation.

3 The Adaptive Line Size Cache Structure We design a software cache which corresponds to the adaptive cache line algorithm. For the sake of simplicity, we describe the cache including only two kinds of lines, 128B and 256B. It’s composed of the following structures, as depicted in Figure 4. The Cache Storage is set to be 64KB in this paper. C a c h e P a r a m e te r 1 lin e s iz e 1 m a s k 1 h ash1 G lo b a l a d d re s s

ea

s ta tic m a p p in g TE TE TE TE … …

S1

T a g E n tr y A r r a y 1

o ffs e t TE TE TE TE

h ash2 S2 lin e s iz e 2 m a s k 2 C a c h e P a r a m e te r2 T a g E n tr y A r r a y 2

in d e x e s 0 0 D 1 2 1 D 3 4 5 128B 6 N 7 0 1 2 3 … … 1 D lin e N 1 7

1 0

D D

…

…

…

… …

lin e lin e 0 l i n e1 lin e 2

D 1 C a c h e S to r a g e V _ d ir 1 D 1 V _ d ir 2 D 2

Fig. 4. The ALSC structure

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor

319

The Cache Parameter1 includes two components, the mask1 and the line size1 (L1). L1 is set to be 256B, so the number of the long line (n_L) is 256 (64KB/256B). The Tag Entry Array1 is a long address tag lookup table which is composed of S1 (S1 = n_L/4 = 64) sets. Each TE statically maps to a 256B long line. The Cache Parameter2 is similar to the Cache Parameter1. The line size2 (L2) is set to be 128B and the number of the short line (n_S) is 512. In Tag Entry Array2, each tag statically maps to a short line. Every set in the Tag Entry Array1 maps to 8 successive short cache lines. The short lines are orderly numbered 0, 1, 2, …, 7, 0, 1, 2, …, 7, which are the line indexes. In order to implement LRU replacement policy among multiple line sizes, we extend the traditional LRU replacement policy and present IndAlign_LRU, which adopts an Index Link Array. Its initial information and the mapping relationship between the Index Link Array and the cache lines are illustrated in Figure 5.

7

lin e 5 1 1

…

2 … 7 ^

0

head

1 2 … 7 ^

s ta tic m a p p in g

ta il 0 1 2 …

…

… 0 1 2

… lin e 5 0 4 lin e 5 0 5 lin e 5 0 6

…

7

… lin e 1 5

1

…

…

7 0 1 2

0

…

… lin e 7 lin e 8 lin e 9 lin e 1 0

In d e x L in k A r r a y lin k [2 ] lin k [6 4 ] …

…

lin e 0 lin e 1 lin e 2

lin k [1 ]

…

0 1 2 …

lin e in d e x e s C a c h e S to r a g e

7 ^

Fig. 5. The mapping relationship between the Index Link Array and cache lines

The Index Link Array is a link array which has S1 links. Each link is mapped to one set in the Tag Entry Array1 and two successive sets in the Tag Entry Array2. Each link contains 8 nodes, with each node having a data field to store the line index. The data filed of link head stores the index of the cache line which is the earliest accessed while the data field of the link tail stores the index of the line which is the latest accessed. The cache line activity is recorded by moving nodes to the head or the tail.

320

Q. Cao et al.

The V_dir1 and V_dir2 record the valid bits and the D1 and D2 store the dirty bytes. The cache architecture we propose is a hierarchy design. Figure 6 shows set masks of the long line address and short line address. Both the numbers of set in Tag Entry Array1 and Tag Entry Array2 are powers of 2, so a bit-wise AND operation instead of the hash function is used to improve performance, as shown in (1). (1)

SetID = (ea & SetMask) >> N_bits

Where SetID is the number of the set, SetMask is the set mask and 2N_bits equals the corresponding cache line size. Because the number of the set in Tag Entry Array1 is 64 and the line size1 is 256B, the SetID in Tag Entry Array1 is decided by the successive 6 bits, ranging from 8th to the 13th bit. Correspondingly, the SetID in Tag Entry Array2 is decided by the bits from 7th to the 13th. When the cache receives a memory request with a global address ea, if it’s a miss in the Tag Entry Array2, it may be a hit in the Tag Entry Array1. Since the number of the bits which decide the SetID is one bit less.

0 31 0

0

0

1

1

1

0

0

0

15 14 13 12 11 10

9

8

7

6

1

0

0

1

1

0

0

0

0

0

1

1

1

1

1

1

1

1

Fig. 6. Set masks of the long line address and short line address

4 The ALSC Operational Model The simple ALSC operational flowchart is shown in Figure. 7. long addresses

Lookup_long

hit?

yes no

IndAlign_LRU_long

Misshandler_long

Adaptive_ line short addresses

Lookup_short

hit? yes IndAlign_LRU_short no Lookup_long

hit?

yes no

Misshandler_ short Fig. 7. The ALSC operational flowchart

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor

321

4.1 Lookup_long, Lookup_short, IndAlign_LRU_long, IndAlign_LRU_short Both Lookup_long and Lookup_short are the same as the traditional 4WC. IndAlign_LRU_long and IndAlign_LRU_short are invoked respectively when the long address hit and short address hit occur. Suppose a hit of the long address is encountered and the matching set and way is set_L and hit_index_L respectively, the set_Lth link nodes whose data fields are (2*hit_index_L) and (2*hit_index_L+1) moves to the link tail. This operation is called IndAlign_LRU_long. Figure 8a illustrates the operation when hit_index_L is 0. Suppose a short line address comes subsequently and the matching set and way is set_S and hit_index_S respectively, the set_Lth set in the Tag Entry Array1 and the set_Sth set in the Tag Entry Array2 are both mapped to the set_Lth link if the equation (set_L*2=set_S) or (set_L*2 +1=set_S) is satisfied. If the former or the latter is satisfied, the node whose data field is hit_index_S or (hit_index_S+4) moves to the link tail, respectively. The two cases are illustrated in Figure 8b and Figure 8c respectively. The operations are called IndAlign_LRU_short.

Initial index link

head 0 o ld 1

2 3 4 5 6 new ta il 7

h it_ in d e x _ L = 0

a . lo n g a d d r e s s h it

2 3 4 5 6 7 0 1

h it_ in d e x _ S = 3 && s e t_ L * 2 = s e t_ S

b . s h o rt a d d re s s h it a n d s e t_ S is an even num ber

2 4 5 6 7 0 1 3

h it_ in d e x _ S = 3 && s e t_ L * 2 + 1 = s e t_ S

2 4 5 6 0 1 3 7

s e ra c h th e o ld e s t a lig n e d in d e x e s

4 5 6 0 1 7 2 3

c . s h o rt a d d re s s h it a n d s e t_ S is d . lo n g a d d re s s m is s an odd num ber

Fig. 8. Examples for the operations in ALSC

4.2 Misshandler_long For a long address, the function IndAlign(index_L) is defined as follows. If index_L is an even number, IndAlign(index_L) is defined as IndAlign(index_L) = index_L And (index_L +1).

(2)

If index_L is an odd number, IndAlign(index_L) is defined as IndAlign(index_L) = (index_L -1) And index_L.

(3)

When a long address which is mapped to the set_Lth set arrives subsequently and a miss occurs, Misshandler_long is invoked, which includes the following phases:

322

Q. Cao et al.

Choosing the oldest lines to be the victim. We first get the oldest indexes, index_old from the set_Lth link head. The two successive short lines whose indexes are IndAlign(index_old) (as defined in (2)(3)) are chosen to be the victims. Writing back the dirty bytes and setting the corresponding V_dir bits. Fetching the long line required from the main memory and filling the corresponding DE. Moreover, the nodes whose data fields are IndAlign(index_old) moves to the tail of the set_Lth link, as shown in Figure 8d. 4.3 Misshandler_short For a short address which is mapped to the set_Sth set, IndAlign(index_S) is defined as follows. If set_S is an even number, IndAlign(index_S) is defined as IndAlign(index_S) = oldest(0, 1, 2, 3).

(4)

If set_S is an odd number, IndAlign(index_S) is defined as IndAlign(index_S) = oldest(4, 5, 6, 7).

(5)

The oldest(ind1, ind2, ind3, ind4) means the index of the line which is the earliest accessed, with the four parameters in parenthesis denoting the line indexes. If a short address, which is mapped to the set_Sth (set_S /2 = set_L) set in the Tag Entry Array2 arrives subsequently and a miss occurs, the Tag Entry Array1 is searched first. If there’s a valid matching, the operation is similar to IndAlign_LRU_short. Otherwise, Misshandler_short is called. A simplified ALSC design is shown in Figure 9. We assume the address is mapped to the set_Lth link.

lin k in d e x

0 1 s e t_ L 2 T a g E n tr y A r r a y 1 3 4 5 s e t_ S 6 T a g E n tr y A r r a y 2 7

s ta tic m a p p in g

L [0 ] L [1 ] …

…

V _ d ir 2 [ 0 ] V _ d ir 2 [ 1 ] …

L [4 ] V _ d ir 2 [ 5 ] L [5 ] L [6 ] … … V _ d ir 2 [ 7 ] L [7 ] c a c h e s to r a g e V _ d ir 2

Fig. 9. A simplified ALSC structure

V _ d ir1 [0 ] V _ d ir1 [1 ] V _ d ir1 [2 ] V _ d ir1 [3 ]

V _ d ir 1

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor

323

Choosing the oldest index (index_old) from the set_Lth link according to (4) or (5) after judging the parity of set_S. We assume the index_old is 5. Checking the V_dir arrays and writing back the dirty bytes. The V_dir1[2], which is mapped by the index_old should be checked first. If it’s valid, the dirty bytes in L[4] and L[5] should be written back. Meanwhile, the V_dir1[2] is set to be 0. Otherwise, if it’s invalid, the V_dir2[5] is checked. If it’s valid, the dirty data in L[5] is written back. Fetching the required short line from the memory, filling the corresponding DE, and setting V_dir2[5] to valid. Finally, the node whose data field is 5 moves to the link tail.

5 Evaluation 5.1 Evaluation Environments The experiment is conducted on a Cell BE blade [10] with two Cell processors running at 3.2GHz with 1GB of system memory. In this experiment, the programs are bound to one Cell processor to avoid the NUMA effect. The performance is measured with Sparse matrix-vector (SpMV) multiplication and IS, CG, FT, MG from NAS parallel benchmarks [11]. The sparse matrix epb1.mtx is a 14734*14734 symmetric matrix, which is obtained from the University of Florida Sparse Matrix Collection [12]. The benchmarks IS, CG, FT, MG are tested with CLASS A. The sequential regions are executed on the PPE while the iterations in the parallel loops are distributed among the available SPEs. The system runs Fedora9 (Linux Kernel 2.6.25-14). Our programs are compiled in the Cell SDK3.1. 5.2 Execution Speed In this section, we evaluate four software cache configurations. The first one is a traditional software cache design implementing a 4-way set-associative cache. This cache design is with 64KB storage and 128-byte cache lines. It’s referred to TRADITIONAL. The second cache configuration is ESC [9]. It is an alternative to implement the adaptive software cache line on Cell processor, so we compare it with our scheme. The last configuration is our adaptive software line size cache, which adopts a 64KB cache storage. We refer to this configuration as ADAPTIVE. The last two configurations adaptively choose a cache line size among 128B, 256B, 512B and 1024B during application execution. Figure 10 illustrates the normalized speedup of different applications, and the baseline is the execution speed of TRADITIONAL. On the whole, our adaptive cache line strategy combined with the optimized cache structure performs better than the other two cache designs.

324

Q. Cao et al.

We first compare our adaptive cache line solution with the ESC. Obviously, our adaptive cache solution achieves noticeable performance improvements. It mainly results from the following factors: First, the adaptive algorithm in ESC is applied to the parallel loop only, but it isn’t sensitive to variation across iterations of the loop. Our adaptive approach could be applied to not only the parallel loop but also the iteration ranges, so it could choose the optimal cache line size more precisely. 8

TRADITIONAL

Speedup

7

ESC

ADAPTIVE

6 5 4 3 2 1 0 FT-A

IS-A

CG-A

MG-A

SpMV

Fig. 10. Normalized speedup

Second, the algorithm in the ESC sometimes might choose a cache line size which is not the optimal. An example is given in Figure 11. There are five different cache line sizes, LS0 to LS4, with the size increasing. Suppose that their performance levels are 1, 3, 2, 4, 0, with each higher level corresponding to the better performance. Obviously, LS3 is the optimal line size. The adaptive execution in ESC is as follows. s h o rte s t

lo n g e s t

c a c h e lin e s iz e

LS0

LS1

LS2

p e rfo rm a n c e le v e l

1

3

2

T P I3

T P I2

T P I1

TPI

LS3

LS4

4

0

(o p tim a l) (w o rs t)

T P I4

T P I5

Fig. 11. An example of the adaptive algorithm in ESC

When the loop is invoked the first time, it’s executed with the initial line size, LS2. And the time per iteration (TPI) is TPI1. When the loop is invoked the next time, it executed with a shorter line, LS1. The corresponding TPI is TPI2. TPI1 must be greater than TPI2 according to the performance level, so a shorter line LS0 is chosen and the corresponding TPI3 is measured in the next loop invocation. Obviously, TPI3 is greater than TPI2, so LS1 is chosen to be the optimal. Nevertheless, LS3 is the optimal line size according to the performance level. This results from the following two reasons.

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor • •

325

Though TPI1 is greater than TPI2 in the above algorithm, it’s not necessary the case that a shorter line performs better for the loop, as shown in Figure 11. The adaptive strategy in ESC depends on many factors, for example, the initial cache lines size, choosing which line (a longer or a shorter line) in the second invocation of the loop.

Third, the loop has to be invoked many times to compare TPIs in the ESC strategy, which greatly degrades the performance. Additionally, a certain loop may be invoked once in some applications, so the adaptive algorithm in ESC doesn’t work well in such cases. The loop in SpMV is invoked only once, and not surprisingly, the ESC has only a slight performance improvement compared with the traditional approach. Then we compare the ADAPTIVE with the TRADITIONAL configuration. We achieve speedup factors of 3.29 for FT, 4.53 for IS, 5.73 for CG, 4.01 for MG and 4.12 for SpMV. The benchmarks which are sensitive to the cache line size and are dominated by irregular references benefit more from the ADAPTIVE configuration. CG is such an application, so it achieves a significant performance improvement. Though IS isn’t sensitive to the cache line size, it is dominated by irregular memory references and thus it exposes a high miss rate. Not surprisingly, it obtains a significant speedup from our adaptive prefetching scheme. 5.3 Transferred Data Size

Transferred data size

If the transferred data size of the cache design with 128B is the baseline, the transferred data size of cache line design with 128B, 256B, 512B and 1024B is illustrated in Figure 12.

4

1024B

512B

256B

128B

ADAPTIVE

3 2 1 0

FT-A

IS-A

CG-A

MG-A

SpMV

Fig. 12. Normalized transferred data size

Obviously, the transferred data of the ALSC is less than the other cache designs, especially compared with the 1024B line design. Because even though only a few bytes are needed, the whole 1024B line is transferred if the cache line is set to be 1024B. The ALSC outperforms designs with the fixed short cache line because the latter immediately transfers data when a miss occurs. Nevertheless, the ALSC first checks the long line Tag Entry Array. If it’s a hit, there is no need to transfer data.

326

Q. Cao et al.

5.4 Scalability Figure 13 presents the scalability of our approach. All the benchmarks, except IS, show good scalability from 1 to 8 threads, with speedup more than 6 on 8 SPEs. The main reason is that every thread executes in an exclusive SPE. IS doesn’t scale up well because it contains some computations in either master execution or critical codes. Those computations are executed sequentially. Therefore, its speedup of 8 SPEs is only about 3.8. 9

IS

8

MG

FT

CG

SpMV

Speedup

7 6 5 4 3 2 1 1SPE

2SPEs

4SPEs

8SPEs

Fig. 13. Scalability of the adaptive approach on Cell processor

5.5 Storage Overhead Table 1 depicts the extra storage overhead except the storage for N cache line. FAC and 4WC denotes the fully associative and 4-way set-associative cache, respectively. To formally describe the extra storage overhead we assume as followed: • • • •

The number of the short line is N, and the number of set in 4WC is S (S=N/4). S’ is the number of sets in ESC, which is four times the closest power of 2 to S. The tags are 4-byte integers while dirty and valid bits are one byte. All the caches introduce fetch buffers. And every TE needs a line index. The FAC and the ESC both need a field to record the global address. Table 1. Extra storage overhead (in bytes)

Heading level TE TE Array LTE Cache line table Total Size

Example tag+line index =8 TE*N=8N V+D+tag =6 LTE*N= 6N 14N

4WC tag+line index =8 (4*TE)* S=8N V+D=2 LTE*N =2N 10N

ESC 10N (4*TE)* S’=32S’ V+D+tag =6 LTE*N= 6N 32S’+6N

ALSC tag+line index =8 (4*TE)*3S/2=12N (V+D)+(V+D)/2 =3 LTE*N=3N 15N

Suppose that the whole cache size is 64K, the total storage overhead is listed in Table 2. The ALSC storage overhead is a little more than that of the FAC and 4WC and much less than that of the ESC.

Adaptive Line Size Cache for Irregular References on Cell Multicore Processor

327

Table 2. Extra storage overhead with a 64KB cache (in bytes)

Line size 128B 512B 1KB 4KB

Number of lines 512 128 64 16

FAC 7168 1792 896 224

4WC 5120 1280 640 160

ESC 19456 4864 2432 608

ALSC 7680 1920 960 240

6 Conclusions and Future Works We present an algorithm which adaptively adjusts the software cache line for irregular reference on Cell processor. Moreover, a corresponding software cache design is proposed, which significantly improves the hit rate and decreases the reduction of data transfers. The evaluation results indicate that our strategy achieves the speedup factor from 3.29 to 5.73 compared with the traditional software cache approach. Additionally, the adaptive strategy shows good scalability. Acknowledgments. The research is partially supported by the Hi-Tech Research and Development Program (863) of China under Grant No. 2008AA01Z109, the Key Project of Chinese Ministry of Education under Grant No. 108008, and by the National Key Technology R&D Program under Grant No. 2006BAK11B00.

References 1. Eichenberger, A.J., O’Brien, J.K., O’Brien, K.M., Wu, P., et al.: Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. J. IBM Sys. 45, 59–84 (2006) 2. Balart, J., Gonzalez, M., Martorell, X., Ayguade, E., et al.: A Novel Asynchronous Software Cache Implementation for the Cell BE Processor. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 125–140. Springer, Heidelberg (2008) 3. Lee, J., Seo, S., Kim, C., Kim, J., et al.: COMIC: A Coherent Shared Memory Interface for Cell BE. In: 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 303–314. ACM Press, New York (2008) 4. Marc, G., Nikola, V., Xavier, M., Eduard, A., et al.: Hybrid access-specific software cache techniques for the cell be architecture. In: 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 292–302. ACM, New York (2008) 5. Vujić, N., Gonzàlez, M., Martorell, X., Ayguadé, E.: Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 31–46. Springer, Heidelberg (2008) 6. Chen, T., Lin, H.B., Zhang, T.: Orchestrating data transfer for the cell b.e. processor. In: 22nd Annual International Conference on Supercomputing, pp. 289–298. ACM Press, New York (2008) 7. Dubnicki, C., LeBlanc, T.: Adjustable block size coherent caches. In: 19th Annual International Symposium on Computer Architecture, pp. 170–180. ACM Press, New York (1992) 8. Veidenbaum, A.V., Tang, W., Gupta, R., Nicolau, A., et al.: Adapting Cache Line Size to Application Behavior. In: 13th International Conference on Supercomputing, pp. 145–154. ACM Press, New York (1999)

328

Q. Cao et al.

9. Seo, S., Lee, J., Sura, Z.: Design and Implementation of Software-Managed Caches for Multicores with Local Memory. In: 15th International Symposium on High-Performance Computer Architecture, pp. 55–66. IEEE Press, New York (2009) 10. Altevogt, P., Boettiger, H., Kiss, T., Krnjajic, Z.: IBM BladeCenter QS21 Hardware Performance. IBM Technical White Paper WP101245 (2008) 11. Bailey, D., Harris, T., Saphir, W., Wijngaart, R.V.D., et al.: The NAS Parallel Benchmarks 2.0. NAS Technical Report NAS-95-020, NASA Ames Research Center, Moffett Field, CA (1995) 12. University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices/ 13. Gelado, I., Kelm, J.H., Ryoo, S., Lumetta, S.S., et al.: CUBA: An Architecture for Efficient CPU/Coprocessor Data Communication. In: 22nd Annual International Conference on Supercomputing, pp. 299–308. ACM Press, New York (2008) 14. Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on cell. In: 6th Annual International Symposium on Code Generation and Optimization, pp. 155–164. ACM Press, New York (2008) 15. Cao, Q., Zhao, C.C., Zhang, Y.X., Chen, J.X., et al.: Adaptive Tuning of Sparse MatrixVector Multiplication on Cell Architecture. In: 2nd International Conference on Computer Engineering and Technology, pp. 292–296. IEEE Press, New York (2010) 16. Schneider, S., Yeom, J.S., Rose, B., Linford, J.C., et al.: A Comparison of Programming Models for Multiprocessors with Explicitly Managed Memory Hierarchies. In: 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 131–140. ACM, New York (2009) 17. Crawford, C.H., Henning, P., Kistler, M., Wright, C.: Accelerating Computing with the Cell Broadband Engine Processor. In: 5th ACM Conference on Computing Frontiers, pp. 3–12. ACM Press, New York (2008) 18. Wang, Z., O’Boyle, M.: Mapping Parallelism to Multicores: A Machine Learning Based Approach. In: 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 75–84. ACM Press, New York (2009) 19. Jimenez, V.J., Vilanova, L., Gelado, I., Gil, M., et al.: Predictive Runtime Code Scheduling for Heterogeneous Architectures. In: Seznec, A., Emer, J.S., O’Boyle, M.F.P., Martonosi, M., Ungerer, T. (eds.) HiPEAC 2009. LNCS, vol. 5409, pp. 19–33. Springer, Heidelberg (2009) 20. Wang, P., Collins, J.D., Chinya, G., Jiang, H., et al.: Architecture and Programming Environment for a Heterogeneous Multicore Multithreaded System. In: Conference on Programming language Design and Implementation, pp. 156–166. ACM, New York (2007) 21. Ren, M., Park, J., Houston, M., Aiken, A., et al.: A tuning framework for softwaremanaged memory hierarchies. In: 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 280–291. ACM, New York (2008)

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors Wei Mi2,*, Xiaobing Feng1, Jingling Xue3, and Yaocang Jia1,2 1

Key Laboratory of Computer System and Architecture. Institution of Computing Technology 2 Graduate University of Chinese Academy of Sciences 3 School of Computer Science and Engineering. University of New South Wales {miwei,fxb,jiayaocang}@ict.ac.cn, [email protected]

Abstract. DRAM row buffer conflicts can increase the memory access latency significantly for single-threaded applications. In a chip multiprocessor system, multiple applications competing for DRAM will suffer additional row buffer conflicts due to interthread interference. This paper presents a new hardware and software cooperative DRAM bank partitioning method that combines page coloring and XOR cache mapping to evaluate the benefit potential of reducing interthread interference. Using SPECfp2000 as our benchmarks, our simulation results show that our scheme can boost the performance of the most benchmark combinations tested, with the speedups of up to 13%, 14% and 8.06% observed for two cores (with 16 banks), two cores (with 32 banks) and four cores (with 32 banks). Keywords: Row Buffer Locality, Cache Locality, Address Mapping.

1 Introduction The DRAM memory system is a critical shared resource among multiple cores in a chip multiprocessor system. In a multi-programmed workload, multiple applications (i.e., threads) competing for DRAM will impede each other’s progress due to interthread interference. Accesses from one thread can cause row buffer conflicts, bank conflicts and data/address bus conflicts to accesses from other threads. Therefore, uncontrolled interthread interference can significantly degrade overall system performance. DRAM row buffer conflicts, i.e. row misses occur when a sequence of DRAM accesses to different rows go to the same DRAM bank, causing much higher access latency than row hits. For a single-core system, some bitwise XOR address mapping schemes [1][2] and memory scheduling policies like FR_FCFS (First Ready FirstCome-First-Serve) [3] can be employed by the memory controller to reduce row buffer conflicts. In addition, bitwise XOR cache mapping schemes [4][5] can also be incorporated into the last-level cache to reduce both row buffer conflicts and last-level cache misses [1]. * Supported by the National Basic Research 973 Program of China under Grant No.2005CB321602, and the National Natural Science Foundation of China under Grant No.60736012. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 329–343, 2010. © IFIP International Federation for Information Processing 2010

330

W. Mi et al.

For a multi-core system, existing techniques for reducing row buffer conflicts appear to be all hardware-based, focusing mostly on improving memory scheduling policies [6][7][8][9][10][11][12]. By leveraging the solutions for single cores [1][2][3], these hardware-based schemes attempt to optimize a multitude of objectives, including row buffer utilization, memory efficiency, DRAM throughput, fairness and QoS. But they suffer from a limited scope (when done purely in hardware) and are constrained by the conflicting nature of multiple objectives. In addition, the row latency reduction techniques through improving row hit rates become less effective due to the increased bank contention [6]. In this paper, we address the problem of reducing intrathread and interthread row buffer conflicts for running multi-programmed workloads on multi-core systems that keep the last-level cache private to each core (e.g., AMD Athlon). We introduce for the first time a static software-hardware cooperative DRAM bank partitioning scheme to reduce both kinds of conflicts, thereby improving overall system performance. By partitioning the DRAM banks among the applications, interthread interference is reduced or controlled in a more deterministic manner. This allows potentially existing techniques [6][7][8][9][10][11][12] to be applied more effectively. We have evaluated our scheme using the SPECfp2000 benchmarks on two- and four-core systems by using the cycle-accurate x86 full system CMP simulator FeS2 [13] (Section 5). Our preliminary results show that our scheme can boost the performance of the most benchmark combinations experimented with, with the speedups of up to 13%, 14% and 8.06% observed for two cores (with 16 banks), two cores (with 32 banks) and four cores (with 32 banks). respectively.

2 Background and Motivation An SDRAM system consists of multiple banks that can be accessed in parallel. Each DRAM bank is a DRAM cell array organized in rows and columns. Each bank contains a row buffer used to cache the data in the most recently accessed row. The row buffer size is usually 2/4KB (i.e., one-half page/one page). The latency of a memory access can vary greatly depending on whether the access is a row (buffer) hit or miss. Normally, a DRAM access for a bank is realized in three stages, precharge, row access and column access [14]. There are two modes for DRAM accesses: open-page and close-page. In the open-page mode, if the next access to the same bank goes to the same row (a row hit), only column access is necessary. If the next access is a row miss (row buffer conflict), however, the precharge does not start until after the request has arrived. The close-page mode allows the precharge to start immediately after the current access. Like the prior work on improving row buffer locality, this work can be beneficially applied when the open-page mode is used. In this case, when an access is a row hit, the data required is already cached in the row buffer. The data can be directly operated on in the row buffer without the precharge and row access operations, reducing the DRAM access time by half or more compared to when an access is a row miss [8][12][14]. For a single-core system, a conventional address mapping scheme allocates consecutive data blocks to consecutive memory banks using a modular mapping function, i.e., memory interleaving. Zhang et al. [1] found that the resulting address mapping

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors

331

symmetry between the last-level L2 cache and DRAM is a significant source of row buffer conflicts. Such symmetry refers to the fact that the bank index bits are usually part of the L2 cache set index bits. As a result, L2 cache conflicts or writebacks usually lead to row buffer conflicts. They proposed to use an XOR address mapping scheme implemented in the memory controller to break the address mapping symmetry. By XORing the bank index bits and a portion of cache set index bits, as shown in Figure 1(a), the data blocks are permuted (or distributed evenly) across the memory banks. Without the XOR address mapping, the read RD and the L2 cache writeback WB caused by RD, as shown in Figure 1(b), will result in a row miss because they have the same L2 cache set index. In addition to address mapping, memory scheduling policies [3][6][7][8][9][10][11] are also effective in reducing row buffer conflicts by prioritizing row hit accesses over others. Furthermore, the XOR cache mapping schemes [4][5] can also reduce simultaneously both row buffer conflicts and the lastlevel cache misses [1]. This work is the first to apply page coloring in combination with a bitwise XOR cache mapping scheme to manage DRAM bank partitioning to reduce both intrathread and interthread row buffer conflicts while not increasing bank conflicts unduly. Our software-hardware cooperative scheme can improve row buffer locality effectively for a multi-programmed workload and may potentially enable existing techniques [7][8][9][10][11][12] to better optimize other objectives such as fairness and QoS in future.

Fig. 1. Memory controller address mapping

3 DRAM Bank Partitioning Our proposed approach is to use page coloring to map the DRAM accesses from different applications to different banks (if possible) in order to reduce interthread row buffer conflicts. Simultaneously, we also apply a bitwise XOR cache mapping scheme at the last-level, i.e., L2 cache to reduce intrathread row buffer conflicts and cache misses. 3.1 Page-Coloring-Based Bank Partitioning We use the classic OS page-coloring [15] to partition the DRAM banks for a multiprogrammed workload. In the OS page coloring, all physical pages are divided into groups with all pages in the same group being labeled a distinct color. When a new physical page is requested by an application, the OS will allocate a page whose color

332

W. Mi et al.

is in the set of colors assigned to the application. If an application has no colors in common with other applications, and in addition, if different colored pages are mapped to different DRAM banks, then there are no row buffer conflicts among the accesses from different applications. As a result, all interthread row buffer conflicts are avoided. For two applications that share some pages and thus some banks, their interthread row buffer conflicts can be reduced if the number of their shared pages, i.e., banks can be reduced. Figure 2 illustrates the page-coloring-induced bank partitioning for a 16-bank DRAM. The number of colors available is determined by the number of the bank index bits that are not part of the page offset.

Fig. 2. Page coloring for partitioning

Fig. 3. Conventional and XOR address mapping

a 16-bank memory

3.2 Bitwise XOR L2 Cache Mapping Once the banks are partitioned across the applications, interthread row buffer conflicts are reduced. How do we also reduce intrathread row buffer conflicts? How do we also improve cache utilization since the last-level cache is also partitioned unexpected by page coloring? Figure 3 illustrates why conventional and XOR address mapping schemes implemented in the memory controller are not helpful in reducing intrathread row buffer conflicts in the presence of page coloring. As shown in Figure 3(a), the conventional address mapping allocates consecutive data blocks to consecutive DRAM banks so that different colored pages are mapped to different DRAM banks. Its main advantage is that the static bank partitioning results obtained by page coloring are preserved but its main drawback is that the address mapping symmetry problem discovered in [1] is prevalent. On the other hand, as shown in Figure 3(b), the XOR address mapping scheme [1] suffers from the opposite problem: it can break the address mapping symmetry and thus reduce row buffer conflicts significantly, but it reshuffles colored pages to banks and thus disables the partitioning effect done by page coloring.

Fig. 4. Bitwise XOR cache mapping

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors

333

What we want is a solution that breaks the address mapping symmetry while also preserving the static bank partitioning results achieved by page coloring. When we partition DRAM via page coloring, the private last-level L2 cache in each core is also partitioned at the same time. In this paper, we propose to use a bitwise XOR cache mapping scheme to redistribute the cache accesses from otherwise the restricted part of the L2 cache to the whole cache so as to remove the cache partitioning effect. As studied earlier [1], the XOR cache mapping is capable of not only reducing cache misses but also breaking the address mapping symmetry (as effectively as the XOR address mapping in the memory controller, in general). Figure 4 illustrates the XOR cache mapping scheme used together with page coloring. Some bits of the cache set index and an equal number of lower-order bits in the cache tag are XORed to produce a new cache set index. In addition, the bits thus modified in the cache set index correspond exactly to the bits of the DRAM bank index. In Figure 1, the read RD and the L2 cache writeback WB caused by RD are necessarily mapped to the same bank without the XOR address mapping in the memory controller because their bank indexes are part of their L2 cache set indexes, which happen to be identical. In Figure 4, RD and WB should have the same new cache set index because WB is caused by RD. Therefore, RD and WB must have different cache set indexes and thus different bank indexes before the XOR cache mapping operation is applied. This means that RD and WB are mapped to different banks, giving rise to no row buffer conflict. Therefore, we can use the XOR cache mapping at the lastlevel cache together with the conventional address mapping in the memory controller to substitute for the XOR address mapping employed in the memory controller [1]. This solution has no impact on the mapping of colored pages to the DRAM banks achieved by page coloring. The XOR cache mapping requires several XOR operations to obtain a new L2 cache index. Since all the XOR operations can be done in parallel, the extra delay incurred is one XOR gate. Depending on implementations, the XOR gate may or may not be on the critical path [4][5].

Fig. 5. MRC, SRC and BCC of fma3d for a 32-bank DRAM (with 32 page colors when its row buffer is one page)

4 Cost Model and Partitioning Algorithm We give an algorithm that determines statically which colors are allocated to which application in a multi-programmed workload. An assignment of colors to the applications

334

W. Mi et al.

in a workload dictates the placement of these applications’ data across the DRAM banks. To this end, we must first build a cost model. When we give fewer colors to an application to minimize the number of banks it has in common with other applications, its interthread row buffer conflicts may decrease but its intrathread row buffer conflicts may increase. Therefore, we need to minimize both kinds of conflicts for all applications simultaneously. In addition, we need to make sure that each application has at least a certain number of banks in order to maintain its bank parallelism. As a result, our algorithm will strive to minimize both kinds of conflicts while not increasing bank conflicts, i.e., reducing bank parallelism unnecessarily. 4.1 Cost Model We use a row buffer miss rate curve (MRC) to estimate the intrathread row buffer conflicts of an application. The MRC of an application, which is obtained by profiling, gives its row buffer miss rate as a function of the number of colors assigned to it. The MRC of an application represents its demand for the DRAM bank space. As an example, Figure 5(a) plots the miss rate curve of fma3d. We use a so-called sensitivity rate curve (SRC) to characterize the interthread row buffer conflicts of an application, which is defined in terms of the politeness and robustness associated with that application. The politeness of an application shows to what extent the application affects the row buffer miss rates of other applications. In general, as the L2 cache miss rate of an application increases, its politeness often tends to worsen. So we use empirically the L2 cache miss rate of an application as its politeness. The higher the L2 cache miss rate of an application is, the less its politeness is. The robustness of an application shows to what extent the row buffer miss rate of the application is affected by other applications. The robustness of an application is represented as its row buffer miss rate when run together with its co-runners. The SRC of an application allows us to find its robustness as a function of the (average) politeness of its co-runners treated as a whole. For example, let us compute the SRC for fma3d for a workload consisting of all SPECfp2000 benchmarks. We select five representative applications, vpr, mgrid, equake, fma3d and swim, as a standard application group. Their politeness values are ranked from lowest to highest in that order. We run fma3d with each of these applications as a co-runner, obtain five row buffer miss rate and politeness pairs, and finally, use these five pairs to fit a quadratic polynomial curve to obtain t he SRC of fma3d, which is plotted in Figure 5(b), to represent its robustness. Given the MRC and SRC of each application, we can estimate the overall row buffer miss rate for a multi-programmed workload. Let there be M colors and N applications. The MRC and SRC of application i are denoted by MRCi and SRCi, respectively. Suppose application i gets mi colors with mij common colors with application j in a bank partitioning. The L2 cache miss rate of application i is denoted by cmri. We estimate the Row Buffer Miss (RBM) rate of application i, denoted RBMi, in this bank partitioning as follows: RBM i = MRCi (mi ) +

∑

j∈(1, N ), j ≠ i

(( SRCi (cmrj ) − MRCi ( M )) ∗ mij / mi )

(1)

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors

①

335

②

where represents the number of intrathread row misses of application i and represents the number of interthread row misses of application i caused when it is run together with all other applications. In , subtracting MRCi(M) (the row buffer miss rate of application i when it gets all banks) from SRCi(cmrj) (the row buffer miss rate of application i when it gets mi banks) gives the number of interthread row misses introduced to application i by application j, after it is scaled proportionally by mij/mi. Presently, we model only the number of overlapping colors, between two applications but do not differentiate exactly how their colors, i.e., data banks are laid out in DRAM). To minimize bank conflicts for an application, we also use a Bank Conflicting Curve (BCC) to ensure that the number of colors an application gets does not drop below a certain threshold. The BCC of an application gives its performance (measured in terms of its IPC) as a function of the number of colors assigned to it in the close-page mode. Therefore, the BCC of an application represents the effect of bank conflicts (without row buffer conflicts) on performance when its color count is varied. We use the BCC of an application to find a bank conflict turning point, a threshold that represents the minimum number of colors that should be allocated to the application to guarantee its bank parallelism. Figure 5(c) plots the BCC of fma3d with its bank conflict turning point being 6 (assuming 32 colors).

②

4.2 Bank Partitioning Algorithm Based on our cost model, we have developed a bank partitioning algorithm for a multi-programmed workload consisting of N co-running applications by assigning each application with some colors from a set of M colors available. Since the search space consisting of all possible color assignments is huge, Figure 6 gives a hillclimbing searching algorithm, ColorMap, for finding a feasible solution. Hill-climbing has a shortcoming of getting stuck easily at a local optimum. We alleviate this problem by populating Candidate_Set with all possible candidates C = (m1, … , mN) such that its i-th element, mi, represents the number of colors assigned to application i, which ranges from the bank conflict turning point given by its BCC to M (line 23). We make use of the following notation to represent the overall row buffer miss rate for all applications for the current color count assignment specified in an Nvector C: RBM (C ) =

∑

1≤ k ≤ N

RBM k

(2)

Note again that our cost model is simple since it ignores the actual colors assigned to a particular application. The while loop in ColorMap processes all candidates in Candidate_Set, one at a time. Given a candidate, which initially indicates only the number of colors assigned to each application (lines 2 and 3), Find_Local_Optimum aims to find an assignment of actual colors to each application. In addition, an application may get more colors than initially indicated in candidate if doing so will improve overall row buffer locality. This is achieved by a hill-climbing process. Initially, every application is initialized with some colors to start with (line 4). Then Find_Local_Optimum chooses an application, selects a new color and assigns it to the application in each hill climbing

336

W. Mi et al.

step (lines 6 – 16). Of all possible choices made in each hill climbing step, the one that minimizes the overall row buffer miss rate of the entire workload is taken (lines 11 -- 13). So the steepest ascent direction is always preferred. If this is not possible (lines 14 and 15), we check to see if we have at least allocated the minimum number of colors to each application as initially indicated in candidate (line 17). If this is the case, we are done. Otherwise, we assign a new color randomly to each application whose number of assigned colors is still less than its minimum specified in candidate (lines 18 and 19). This enables the hill-climbing process to be started again. Finally, Save_Best_Color_Layout reveals the best solution found.

Fig. 6. A DRAM bank partitioning algorithm

5 Evaluation Methodology We evaluate our work using the cycle-accurate x86 full system CMP simulator FeS2 [13], which is based on the Simics virtual machine [16]. The page coloring algorithm is implemented on Linux 2.6.26. The memory system is modeled using DRAMsimv1.2 simulator [17]. Table 1(a) shows the major processor and DRAM parameters. By default, the row buffer is one page of the size 4KB. We use all 14 SPECfp2000 benchmarks for evaluation. Each benchmark is compiled using gcc-4.1.2 with “-O3” optimizations and run for 100 million instructions.

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors

337

6 Experimental Results We use the three acronyms, CC, CX and XC, to represent three address mapping schemes listed in Table 1(b). We start by applying page coloring to vary the color count of an application and evaluate their row buffer miss rates and L2 cache miss rates with a single core to see which scheme combines the best with page-coloringbased DRAM bank partitioning. (Scheme “XX” cancels out each other’s partitioning effect and is thus omitted.) Then we evaluate our bank partitioning mechanism “XC+Page Coloring” and our partitioning algorithm with a large number of benchmark combinations running on two- and four-core systems. Table 1. Experiment parameters

6.1 CC, CX or XC + Page Coloring on Single Cores We evaluate CC, CX and XC when each combined with page coloring for a 16-bank DRAM for running a benchmark on one core. Since the row buffer is one 4KB page (Table 1(a)), there are 16 colors. Figure 7 and Figure 8 compare their row buffer and L2 cache miss rates as the number of colors that a benchmark gets changes. In Figure 7, we see that CX exhibits lower row buffer miss rates than XC in most of the benchmarks. But the gaps between the two become smaller as the number of colors assigned to a benchmark increases. This is because CX disables the bank partitioning effect of page coloring but XC keeps it. When given the same number of colors, CX can use more banks than XC does by reshuffling (evenly or randomly) the colored pages allocated to a benchmark to all the banks. In addition, we can also see that XC has lower row buffer miss rates than CC for most of the benchmarks, because XC can break the address mapping symmetry but CC cannot. As can be observed in Figure 8, CX exhibits significantly higher L2 cache misses than XC for several benchmarks, such as art, ammp, galgel and mgrid. This is because XC can eliminate the cache partitioning effect of page coloring but CX cannot, implying that XC can ultilize more cache space than CX. When an application gets all 16 colors, .i.e., all the banks, CX

338

W. Mi et al.

and XC have nearly the same row buffer miss rate and L2 cache miss rate, a result also confirmed earlier in [1]. So XC and page coloring represent a good combination for DRAM bank partitioning.

Fig. 7. Row buffer miss rates of CX and XC

Fig. 8. L2 cache miss rates of CX and XC

6.2 XC + Page Coloring on Multiple Cores 6.2.1 Two Cores We evaluate the effectiveness of our software-hardware cooperative bank partitioning mechanism and our partitioning algorithm in improving the overall row buffer locality, L2 cache hit rate and overall system performance when running a pair of applications on a two-core system. Recall that the row buffer is a 4KB page, resulting in a total of 16 colors in total. One color controls the placement of one page in a DRAM bank. We first consider a 16-bank DRAM and then move to a 32-bank DRAM. We consider all 105 pair-wise benchmark combinations and present our results in Figure 9. For all these pairs, the speedups, regardless whether they are positive or negative, correlate well to their row buffer miss rates, indicating the significance of row buffer locality optimization techniques on boosting overall program performance. The speedup values for a total of 105*2=210 benchmark executions are divided into the following six intervals:

Looking at the number of benchmarks falling into (-∞, <-5%) and (>5%, +∞), we find that the benchmarks with the largest speedups significantly outnumber those with the worst slowdowns. Some small performance degradations are observed in some benchmarks because their row buffer and/or L2 cache miss rates are made slightly worse by “XC+Page Coloring”. Our cost model can be crude when estimating the row buffer conflicts for an application when it is run together with its co-runners. For example, when the SRC of an application over-approximates its interthread conflicts with other applications, our partitioning algorithm will usually not allocate enough banks to the application, causing its intrathread conflicts to increase. The L2 cache miss rates for some applications are often difficult to estimate statically. In Figure 9, the benchmarks with decreased and increased L2 cache misses are nearly equally divided (with a ratio 117:92). In addition, the benchmark executions with performance slowdowns of <-1% when their L2 cache miss rates are increased to > 1% total only 7 among all 210 benchmark executions. This seems to suggest that the impact of L2 cache misses on performance is less pronounced than that of row buffer misses.

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors

339

We have also evaluated the same 105 pairwise workloads for a 32-bank DRAM and plot the results similarly in Figure 10, which displays similar trends as Figure 9 for the same performance metrics evaluated. In Figure 10, the 210 benchmark executions again fall into the following six intervals according to their speedups:

In comparison with the table for 16 banks, there are more benchmarks falling into (1%~5%) (-5%~-1%) but fewer into (0%~1%) (-1~0%). When there are many colors (32 rather than 16 banks), our partitioning algorithm has produced fewer pairs that share equally the 32 banks. There is another reason for the existence of more benchmarks in (-5%~-1%) with 32 banks. The SRC functions of some applications may happen to be less accurate when more banks are available. The two benchmarks with the largest performance slowdowns were caused by this reason. Better performance results are expected when more accurate cost modes and partitioning algorithms are used.

Fig. 9. Results for 105 pairs of benchmarks (16 banks)

Fig. 10. Results for 105 pairs of benchmarks (32 banks)

As for L2 cache misses, their increases or decreases can now impact more on program performance when the corresponding row buffer miss rates do not change much, as shown in Figure 10. As in the 16-bank case, the benchmarks with increased and decreased L2 cache misses are nearly evenly divided (with a ratio of 98:112). The number of benchmark executions with negative speedups of <-1% when their L2 cache miss rates are > 1% is 17 for all 210 benchmark executions. Overall, although

340

W. Mi et al.

some more benchmarks have suffered performance slowdowns, but their negative speedups are small, particularly when compared with the positive speedup cases. 6.2.2 Four Cores A total of 10 benchmark combinations are simulated with and without DRAM bank partitioning. Their partitioning strategies are listed in Table 2. The performance results are plotted in Figure 11. The negative effects of increased L2 cache miss rates on performance in some benchmarks are offset by the benefits obtained from significant row buffer miss rate reductions. For the 10 groups used, the speedups of up to 8.22% are observed and the average speedup across these groups is 2.89%. Table 2. Benchmark combinations on four cores Mix1 Mix2 Mix3 Mix4 Mix5 Mix6 Mix7 Mix8 Mix9 Mix10

Benchmark Groups lucas, applu, facerec, mgrid lucas, art, ammp, facerec ammp, galgel, equake, swim applu, facerec, mgrid, wupwise applu, fma3d, galgel, equake applu, facerec, fma3d, sixtrack facerec, galgel, equake, sixtrack lucas, ammp, facerec, mesa facerec, equake, mgrid, sixtrack applu, mgrid, mesa, sixtrack

Color mask bits 0xdffff83f: 0xcf89001c: 0x20000100: 0x1070fee3 0xefff877f: 0x4007ff00: 0xffffffff: 0x10000080 0xffffffff: 0xef811ff0: 0xb1fee00f: 0xffffffff 0x8e800f4f: 0x40001000: 0x317fe0b0: 0xbfff2fff 0xa1fe0000: 0x4e006f3c: 0xb1ff90c3: 0x5801f3ff 0xa9c06780: 0x40000040: 0xbe7ff83f: 0x11be1fbf 0x80000040: 0x6f907f07: 0x786f80bf: 0x17f87fb8 0xcff9fe7f: 0xffffffff: 0x20020000: 0x10040180 0x80000400: 0x5ff8007f: 0x20078380: 0x3e0ffbf0 0x8e30030f: 0x51cfec30: 0x200010c0: 0x31f01fff

Fig. 11. Results of 10 selected workloads on 4 cores (32 banks) 6.2.3 One-Half Page Row Buffer We have repeated all our two- and four-core simulations for a 32-bank DRAM when the row buffer size is one-half page. In this case, one color determines the place of one physical page across two consecutive DRAM banks. For the two-core case, the speedup distributions can be observed from the following table:

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors

341

These statistics are similar to those produced for a 32-bank DRAM when the row buffer is one page. For the four-core case, we have simulated the same 10 mixes listed in Table 2. The speedups of up to 9.7% are observed and the average speedup across all the benchmarks is 2.4%. These results demonstrate that the proposed bank partitioning scheme appears to work well for different row buffer sizes (relative to a fixed page size in a system).

7 Related Work None of prior work about DRAM access optimizations consider to reduce row buffer conflicts for Chip Multiprocessor via software. To our knowledge, this paper is the first to propose a software-hardware cooperative DRAM bank partitioning mechanism for reducing interthread and intrathread row buffer conflicts. Rixner et al. [3] examine various DRAM access scheduling policies and propose the FR-FCFS policies. Hur and Lin [18] introduce adaptive history-based scheduling policies to minimize the average DRAM access delay and to balance the ratio between reads and writes from the processor. Shao and Brian [19] describe a burst access scheduling mechanism to maximize the data bus utilization by read preemption and write piggybacking. Lee et al. [20] suggest to use a prefetching-aware DRAM controller to adaptively prioritize between conventional demand and prefetching operations. Zhang et al. [1] expose an important source of row buffer conflicts in single cores and propose a bitwise XOR address mapping scheme to reduce these conflict significantly. These research efforts focus on the problems for single cores but some of the principles proposed such as FR-FCFS scheduling policies and XOR address mapping are also useful for multi- cores. For a multi-core system, DRAM becomes a major shared resource. The memory controller needs to optimize memory performance by considering a variety of factors simultaneously such as row buffer competition, data bus competition, memory efficiency, the fairness and QoS, but its only means is to control the priorities of the accesses from each thread. So the memory controller needs to weight these factors. Nesbit et al. [9] use a network-fairing-queue based scheduler to provide thread fairness and QoS and a First-Ready Virtual Finish Time First policy (FR-VFTF) to balance row buffer utilization and fairness. Rafique et al. [10] use virtual start time fair queuing instead of virtual finish time and improve fairness based on Nesbit’s work. Mutlu and Moscibroda [11] show the stall time of a thread is a more direct indicator of fairness and propose a stall time fair scheduler. These research efforts are mainly concerned with balancing row buffer competition and thread fairness. Zheng et al. [7] considers a scheduler that combines two factors about memory efficiency and the pending request number of each request to improve system throughput. Mutlu and Moscibroda [8] introduce parallelism-aware batch scheduling to provide a substrate for bank parallelism, row buffer utilization, fairness and QoS, and their work cleverly adjust the degree of inclining to any factor via controlling the size of a batch. Due to the complexity of considering so many performance factors, ipek et al. [12] suggest to add a reinforcement learning mechanism to a DRAM access scheduler. In comparison with the prior work, our work is the first to reduce interthread row buffer conflicts via DRAM partitioning through software-hardware cooperation. It represents an orthogonal means to improving row buffer locality for multi-programmed workloads. As a result, our

342

W. Mi et al.

mechanism may enhance existing techniques by enabling them to focus on optimizing other factors.

8 Conclusion In order to reduce DRAM row buffer conflicts for multi-core systems, we present a software-hardware cooperative scheme to realize a static DRAM bank partitioning. We apply page coloring to map different applications to different partitioned physical spaces and propose to use a bitwise XOR cache mapping scheme together with page coloring to reduce both intrathread row buffer conflicts and last-level cache misses. In order to determine a DRAM bank partitioning strategy, we discuss how to build a cost model and develop a heuristics-based algorithm for partitioning the DRAM banks among the applications in a multi-programmed workload. Our simulations demonstrate that DRAM bank partitioning can significantly reduce DRAM row buffer miss rates and achieve speedups for two- and four-core systems. In the future, we plan to study how to tackle the DRAM bank partitioning and shared cache partitioning simultaneously.

References 1. Zhang, Z., Zhu, Z., Zhang, X.: Breaking Address Mapping Symmetry at Multi-levels of Memory Heirarchy to Reduce DRAM Row-buffer Conflicts. The Journal of InstructionLevel Parallelism 3 (October 2001) 2. Lin, W.: Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In: Proceedings of the 7th International Symposium on High-Performance Computer Architecture HPCA, January 20-24, p. 301. IEEE Computer Society, Washington (2001) 3. Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D.: Memory access scheduling. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA 2000, Vancouver, British Columbia, Canada, pp. 128–138. ACM, New York (2000) 4. Seznec, A.: A case for two-way skewed-associative caches. In: Proceedings of the 20th Annual international Symposium on Computer Architecture, ISCA 1993, San Diego, California, United States, May 16-19, pp. 169–178. ACM, New York (1993) 5. González, A., Valero, M., Topham, N., Parcerisa, J.M.: Eliminating cache conflict misses through XOR-based placement functions. In: Proceedings of the 11th International Conference on Supercomputing, ICS 1997, Vienna, Austria, July 7-11, pp. 76–83. ACM, New York (1997) 6. Zhu, Z., Zhang, Z.: A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. In: Proceedings of the 11th international Symposium on HighPerformance Computer Architecture, HPCA, February 12-16, pp. 213–224. IEEE Computer Society, Washington (2005) 7. Zheng, H., Lin, J., Zhang, Z., Zhu, Z.: Memory Access Scheduling Schemes for Systems with Multi-Core Processors. In: Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP, September 9-11, pp. 406–413. IEEE Computer Society, Washington (2008) 8. Mutlu, O., Moscibroda, T.: Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In: Proceedings of the 35th Annual International Symposium on Computer Architecture, June 21-25, pp. 63–74. IEEE Computer Society, Washington (2008)

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors

343

9. Nesbit, K.J., Aggarwal, N., Laudon, J., Smith, J.E.: Fair Queuing Memory Systems. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, December 9-13, pp. 208–222. IEEE Computer Society, Washington (2006) 10. Rafique, N., Lim, W., Thottethodi, M.: Effective Management of DRAM Bandwidth in Multicore Processors. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, PACT, September 15-19, pp. 245–258. IEEE Computer Society, Washington (2007) 11. Mutlu, O., Moscibroda, T.: Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, December 1-5, pp. 146–160. IEEE Computer Society, Washington (2007) 12. Ipek, E., Mutlu, O., Martínez, J.F., Caruana, R.: Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. In: Proceedings of the 35th Annual International Symposium on Computer Architecture, June 21-25, pp. 39–50. IEEE Computer Society, Washington (2008) 13. Neelakantam, N., Blundell, C., Devietti, J., Martin, M.M.K., Zilles, C.: FeS2: A Fullsystem Execution-driven Simulator for x86. In: Proceedings of the 13th International Conference on Architectural Support For Programming Languages and Operating Systems ASPLOS XIII, Seattle, WA, USA, March 1-5 (2008) (poster session) 14. Jacob, B., Ng, S., Wang, D.: Memory Systems: Cache, Dram, Disk. Morgan Kaufmann Publishers Inc., San Francisco (2007) 15. Kessler, R.E., Hill, M.D.: Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst. 10(4), 338–359 (1992) 16. https://www.simics.net/ 17. Wang, D., Ganesh, B., Tuaycharoen, N., Baynes, K., Jaleel, A., Jacob, B.: DRAMsim: a memory system simulator. SIGARCH Comput. Archit. News 33(4), 100–107 (2005) 18. Hur, I., Lin, C.: Adaptive History-Based Memory Schedulers. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture, Portland, Oregon, December 4-8, pp. 343–354. IEEE Computer Society, Washington (2004) 19. Shao, J., Davis, B.T.: A Burst Scheduling Access Reordering Mechanism. In: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture HPCA, February 10-14, pp. 285–294. IEEE Computer Society, Washington (2007) 20. Lee, C.J., Mutlu, O., Narasiman, V., Patt, Y.N.: Prefetch-Aware DRAM Controllers. In: Proceedings of the 41st Annual IEEE/ACM international Symposium on Microarchitecture, November 8-12, pp. 200–209. IEEE Computer Society, Washington (2008)

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems Xiaodong Wu1, Yuan Lin1, Jian-Jun Han1, and Jean-Luc Gaudiot2 1

College of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei province, China 2 Department of Electrical Engineering and Computer Science, University of California, Irvine, U.S.A. [email protected], [email protected], [email protected], [email protected]

Abstract. Nowadays, energy savings have become one of the most critical issues. In this paper, we propose an energy-efficient approach to scheduling periodic real-time tasks in the multicore context. Within a voltage/frequency domain (VFD), a simple static voltage/frequency scaling schedule (SimpleVS) is first introduced to select the utilization of the heaviest-loaded core as the shared operating frequency of this VFD. Next, the slack reallocation policy is proposed to further reclaim slack times while satisfying timeliness requirements. The slack reallocation strives to redistribute the slack times uniformly to the cores on the same VFD by appropriate job migrations. Experimental results show that compared with the static frequency schedule, our proposed policy can achieve energy savings up to 22 percent when the system is under-utilized. Keywords: Energy-efficient, multicore systems, real-time scheduling.

1 Introduction Nowadays power has become the most critical design constraint. The increasing on-chip power dissipation and the demand of portability make low power consumption one of the primary circuit and system design goals. This not only applies to battery powered devices, but also to desktop computers and high-performance systems. Many solutions have been proposed to mitigate the energy pressure. The processor industry is now shifting towards Chip Multi-Processors (CMP) architectures [1] by integrating several cores onto a single chip to reduce the power consumption. Multicore systems usually can deliver a higher throughput at the cost of lower power consumption than uniprocessor systems. As the demand for energy efficiency grows, multicore processors have been widely used in real-time systems. It is well known that the dynamic voltage/ frequency scaling (DVFS) [2- 6] is one of the most promising solutions to optimizing the energy consumption, which scales the operating voltage and clock frequency of processors together to meet dynamic performance demands. Pillai and Shin [2] present a class of real-time DVS (RT-DVS) algorithms, including the static voltage scaling, cycle-conserving RT-DVS and Look-ahead RT-DVS, which modify the OS’s real-time scheduler and task management service to provide energy savings C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 344–357, 2010. © IFIP International Federation for Information Processing 2010

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems

345

while maintaining real-time deadline guarantees. However, in the context of multicore systems, all active cores need to work at the same operating voltage/frequency [7,8]. This characteristic of the multicore system is referred to as frequency synchronization in this work. This implies that when the operation frequency is scaled on one core, the performances of the other cores are intuitively influenced. For instance, if a low speed is selected for a processor core to aggressively reduce the energy consumption, the jobs running on other cores may miss their deadlines if there are heavy workloads mapped onto those processor cores. Recently, Globally Asynchronous, Locally Synchronous (GALS) [9-12] design is being widely investigated as an alternative solution to the totally synchronous voltage/frequency design, because it inherits the advantages of both synchronous and asynchronous design. The GALS technology makes the clock distribution and timing closure more manageable. On the other hand, the GALS can be well adapted to the voltage-frequency island (VFI) technology [12,13]. Multicore systems implemented with the GALS are partitioned into several voltage/frequency domains (VFD). In such systems, each core belongs to a specific VFD, while the active cores within the same VFD must share the same supply voltage and operating frequency. The operating frequency/voltage of each domain can be adjusted independently of other domains so as to meet the dynamic performance demands. Moreover, considering the fact that the processor cores in a VFD-based GALS system can be easily put into power saving state with the clock gating technique, an idle processor core or even a complete domain can be placed into the sleep mode to further reduce both the dynamic and the leakage power [12]. Some energy-aware approaches using DVFS have been presented to schedule real-time tasks on multicore architectures. Yang et al. [8] proved that the energy-efficient scheduling of periodic real-time task set in the multicore context is an NP-hard problem. Seo et al. [14] presented a dynamic repartition algorithm for periodic tasks to balance the system load as well as to reduce the leakage power at run-time. Pepijn et al. [15] tried to reduce the leakage current, supply voltage and clock frequency in an integrated way in order to gain the maximum system energy savings. Dinna et al. [16] proposed a power-aware real-time scheduler for the multicore multithreaded processor, which implements dynamic voltage scaling techniques. But they aimed at the soft real-time applications and supposed that all cores in the system shared a global frequency. Wan [17] addressed an energy-saving scheduling scheme of periodic real-time tasks with the DVFS on the lightly loaded multicore platform. However [17] assumed that the tasks can be divided into multiple independent subtasks and concurrently executed in parallel on multiple cores. It is also presumed there are more processing cores than running tasks and each core could adjust its operating frequency independently. The previous studies usually concentrated on two power management mechanisms. They either assumed that each processor core can adjust its operating frequency independently of the other cores, or presumed the global frequency synchronization, namely, all active cores in the system share the same operating voltage/frequency. In this paper, we will address the problem of the energy-efficient scheduling for periodic real-time tasks in the VFD-based multicore system. The proposed scheduling algorithm strives to balance the slack times on the cores in VFDs so as to scale down the operating frequency and reduce energy consumptions, while meeting timing constraints.

346

X. Wu et al.

Given a set of periodic real-time tasks, we define the hyperperiod as the least common multiple (LCM) of the periods of these tasks. Intuitively, each task releases its invocations at the same relative phases of different hyperperiods. For this reason, based upon the worst-case execution time of the tasks, we off-line simulate the dynamic executions of the tasks in the interval of one hyperperiod, then off-line calculate and record the start times as well as the executing speeds of the invocations of each task. At run-time, for each invocation of a task, the scheduler only needs to look up its scheduling information from a pre-determined table. As a result, the on-line scheduling overhead can be significantly reduced. For uniprocessor systems, the static voltage scaling mechanism is proposed in [2] to select the lowest possible operating frequency that will allow the scheduler to meet all the deadlines for the given task set. This static policy set frequency statically, and the frequency will not be changed unless the task set is changed. We will extend the static voltage scaling policy to the multicore platform (called SimpleVS). The SimpleVS algorithm selects the maximum utilization among all cores in a VFD as the shared operating frequency, such that the frequency synchronization of the VFD, the deadline constraints of all tasks and the energy savings can be met simultaneously. Based on the SimpleVS schedule, we propose a slack reallocation algorithm to further reclaim the slack times to conserve energy. This policy adjusts the slack times on each core and divides the schedule into segments. Then, for each segment, it seeks to balance the slack times on cores within a VFD via appropriate job migrations. Hence, almost the same amount of slack time will be allocated to each core in a VFD, thereby a lower operating frequency can be selected for this VFD so as to save energy. We evaluated the performance of our proposed algorithm through extensive simulation experiment under different situations. The results show that compared with the static voltage scaling schedule (SimpleVS), the proposed slack reallocation policy can gain the energy savings by about 22% under low-loaded conditions. The rest of this paper is organized as follows: Section 2 describes the system models, including the energy consumption and the task set. Section 3 introduces the static SimpleVS and gives motivational examples to illustrate the target problem. The proposed energy-efficient scheduling algorithm and the experimental results are discussed in detail in Section 4 and Section 5, respectively. Section 6 concludes this paper and offers our future work.

2 Research Model 2.1 System Model In this research, we consider the multicore system with voltage-scalable processor elements for our study. There are cn identical cores in the system: C = {c1 , c 2 ..., c cn } . Furthermore, the GALS and VFD are implemented in our target platform. All those cores are partitioned into dn VFDs: D = {d1 , d 2 ..., d dn } . The i th domain d i contains dn

dcni processor cores, and then we have: cn = ∑ dcn i . i =1

In the target system, each core has two states: sleep and active. If all tasks on a core finish their executions, this core can be placed into the sleep state for the sake of energy

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems

347

reduction. Further, a VFD is considered to be put in the sleep mode if all cores on this VFI are in the sleep state. A VFD can control its supply voltage and operating frequency independently of other VFDs, while all active cores in the same VFD must share the same operating frequency at any given time. Here, the maximum operating frequency of a core is normalized to 1. 2.2 Energy and Power Model

As for the CMOS-based processor, the total power consumption consists of dynamic power Pdynamic and leakage power Pleakage . The detailed energy and power model used in this paper can be found in [14]. For some very large scale integrated systems, Pleakage is also a critical issue because the off-state current increases about five times per generation [18]. However, the dynamic power still dominates the overall energy consumed by a processor core [19]. Therefore, we only target reducing the dynamic energy consumptions in this paper. For simplicity, we assume that the overheads of voltage/frequency transition and job migration are negligible in this paper. However, these overheads can be easily incorporated into the processing time of a task, if necessary. 2.3 Task Model

The task set to be scheduled is composed of n periodic tasks: T = {t1 , t 2 ,..., t n } . Each task ti is associated with a tuple { pi , wi } , where pi denotes the period of task ti and wi the worst-case execution time (WCET) of ti at the maximum processing speed. Here, the period of any task t i is assumed to be the same as the relative deadline of this task. Then, the load of task t i , tli , is defined as wi / pi . Each invocation of the task is called a job and the k th invocation of task ti is denoted by jik . Furthermore, every task starts to release jobs at time 0. If the number of tasks mapped onto ci is ctni , then cn

we can obtain: n = ∑ ctni . i =1

We use ℜ(ti ) to indicate the core onto which task ti is mapped. It is noteworthy that all jobs invoked by any task ti must execute on core ℜ(ti ) except for job migrations. Then, the utilization of core ci , cui , can be defined as: cu i =

∑ tl k .

ℜ (t k )= c i

Thus, the average utilization acui of a VFD d i is: acui = ∑ cu k dcni , where dcni is c k ∈d i

the number of cores in the VFD d i . It is assumed that there are no interdependencies among the tasks, that is, no common resources are shared by these tasks and they have no precedence relationships. The tasks are scheduled by the EDF in a preemptive way on any processor core. Due to the repeatable job release of a task, we only need to concentrate on the schedule within the interval of a hyperperiod. In addition, the schedule is calculated off-line once and for all such that the on-line scheduling cost can be reduced.

348

X. Wu et al.

Since data transfer across different VFD will result in more energy consumption as well as a high overhead, the job migration is only permitted within the same VFD by our approach.

3 Motivational Examples In this section we give the motivational examples and introduce the SimpleVS that extends the static voltage scaling to the multicore platform. For the target multicore system, each VFD can control its supply voltage and operating frequency independently of other VFDs. Further, as long as the schedule of each VFD is carried out, the global schedule result can be easily obtained. Hence, for simplicity, we will focus on the energy-efficient schedule within one VFD. Consider scheduling six real-time periodic tasks in a VFD with three cores. The parameters of the tasks are listed in Table 1. It is easy to see that the hyperperiod of this task set is 12. Since we mainly concern with the schedule during a hyperperiod, the tasks are scheduled within the interval [0, 12]. Table 1. Parameters of the task set Task ID 1 2 3 4 5 6

WCET 5 1 1 1 1 1

Period 12 3 4 6 6 6

Utilization 5/12 1/ 3 1/ 4 1/ 6 1/ 6 1/ 6

Table 2. Task-core mapping Core ID 1 2 3

Task ID 1, 6 2, 5 3, 4

Utilization 7/12 1/ 2 5/12

The task-core mapping listed in Table 2 is determined by the Worst-Fit Decreasing (WFD) policy (see Subsection 4.2). As can be seen, the heaviest- loaded core among all cores is core 1 and the corresponding utilization of core 1 is 7/12.

(a)

(b)

Fig. 1. (a) Schedule the tasks by EDF algorithm without the voltage scaling. (b) Schedule the tasks with the static SimpleVS algorithm.

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems

349

In Fig. 1, each job is denoted by a block labeled with the job ID and its absolute deadline. For instance, the block with label “ j11 | 12 ” denotes the job j11 with its absolute deadline 12. Fig. 1 (a) shows the EDF schedule of the task set without energy-saving policy, where all cores work at the maximum operating frequency. In contrast, as shown in Fig. 1 (b), SimpleVS algorithm chooses the maximum utilization among all cores as the shared operating frequency of this VFD so as to conserve energy, while ensuring the schedulability of the task set. Consequently, due to the frequency synchronization, the operating frequencies of all cores in the VFD are scaled down to 7/12. Then, the slack on core 1 is fully utilized. In Fig. 1 (b), 73.6% dynamic energy can be saved by the SimpleVS scheduling versus the NonVS scheduling as shown in Fig. 1 (a). However, we can observe from Fig. 1(b) that, there are still some slack times unused on the cores with light utilizations (e.g., core 2 and 3) because of the frequency synchronization. As a matter of fact, these slack times can be utilized to further reduce the energy consumption. To this end, we propose the slack reclamation approach.

4 Hyperperiod-Based Multicore Voltage Scaling Schedule This section describes in detail our proposed hyperperiod-based multicore energy-efficient voltage scaling scheduling algorithm (called as HMVS). We first present the high-level description of our approach. 4.1 The High Level Description

The proposed approach is based upon an EDF schedule. Then, the slack times are reclaimed and utilized by two strategies, i.e., SimpleVS and slack reallocation, to conserve energy. The high level description of the proposed HMVS is shown in Algorithm 1. First, all tasks are mapped onto the cores by the heuristic of WFD in line 1 (see Subsection 4.2). Then, we use two policies to reclaim the slack times for the energy savings on a per-VFD basis. Finally, the execution speed is calculated and assigned to each job.

In line 3, the tasks are scheduled by the NonVS (the EDF without DVFS) algorithm on each core. Next, the SimpleVS policy tries to utilize the slack times by selecting the maximum utilization among all cores as the shared operating frequency of this VFD. However, the majority of the slack times are usually generated closely to the end of a

350

X. Wu et al.

hyperperiod, which prevents the slack reallocation from effectively exploring these slack times. For this reason, line 5 tries to move the slack times forward. At the same time, the start times of some jobs are postponed. Fig. 2 (a) (see Subsection 4.4) shows the scenario after the slack shifts have been applied. Then, the slack reallocation in line 6 strives to redistribute the slack times that are not reclaimed by the SimpleVS uniformly to the cores by appropriate job migrations. 4.2 Task-Core Mapping Heuristic

Many heuristics have been introduced to map the tasks onto processors, such as Best-Fit Decreasing (BFD), First-Fit Decreasing (FFD), Next-Fit Decreasing (NFD), Worst-Fit Decreasing (WFD), etc. Aydin and Yang [20] pointed out that balancing the utilizations of the processing elements facilitates maximizing the energy savings. They further claimed that the WFD enables to generate a better balanced partition than other well-known heuristics. Therefore, we choose the WFD as the policy of task-core assignment for our study. Before the tasks are assigned to the processor cores, they are sorted in a non-increasing order of their loads (i.e., tl ). Next, these ordered tasks are partitioned into cn subsets by the WFD. Then, we allocate these subsets of tasks to the cn processor cores respectively. Consequently, a quasi-balanced task-core mapping can be obtained. 4.3 Moving Slack Times Forward

Observe that the slack times are usually distributed closely to the end of the hyperperiod, which prevents efficient slack utilizing. Hence, we adjust the slack times in the original schedule, while maintaining the schedulability of these tasks. As mentioned before, we off-line simulate the dynamic executions of the tasks, and record the necessary information of jobs on each core. Assume that the schedule determined by the SimpleVS on a core ci is represented by tsn[i ] continuous time slices, ci .Sched = {ts1 , ts 2 ,..., tstsn[i ] } . These time slices are sorted in an increasing order of their start times. The time interval of a slice ts is denoted by [ts.starttime, ts.endtime] . Since the preemptive scheduling is applied, a job may execute in several time slices because of the job preemptions. A time slice ts has a predecessor ts. previous and a successor ts.next except that it is the first or the last time slice of the schedule. Then we have: ts.starttime = ts. previous.endtime, ts.endtime = ts.next.starttime . Each time slice has two states: active or idle . If there exists one job executing during the time slice ts , then we say this time slice is active: ts.state = active . Further, this job and its absolute deadline are represented by ts. job and ts. job.deadline , respectively. Otherwise, this time slice is in the idle status in the sense that it is a slack: ts.state = idle . Algorithm 2 describes how our proposed slack-shifting policy adjusts the schedule on a core. The slack adjustment algorithm tries to postpone the start times of job executions during the active time slices, while the idle time slices (i.e., slack times) are moved “forward” at the same time. The shifting operation is performed from the last time slice

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems

351

to the first one on each core. From line 4 to line 8, the algorithm shifts the current time slice ts backward when the following three conditions are satisfied simultaneously: 1) the current time slice is active, 2) the next one is idle, and 3) the absolute deadline of the current-running job is greater than the completion time of the current time slice. The algorithm delays the active time slice ts while ensuring that it will complete execution before ts. job.deadline (line 5). Thus, it can be inferred that this adjustment does not violate the schedulability of any task. Fig. 2 (a) illustrates the scenario after the slack adjustment has been applied.

4.4 Slack Reallocation

The slack reallocation is adopted after the slack adjustment. First, the whole hyperperiod is divided into segments for each VFD prior to the slack reallocation. Then the slack reallocation is performed on a per-segment basis. This policy attempts to redistribute the available slack times to the cores by job migrations so that a lower operating frequency of a VFD can be obtained. We divide the hyperperiod into sn consecutive and non-overlapped segments in a VFD: SG = {sg1 , sg 2 ,...sg sn } . The segment sg i starts from sg i .starttime and ends at

sgi .endtime . Suppose that SG is determined by sn + 1 boundary points: B = {b0 , b1 ,..., bsn } . The start and end time of the segment sg i are denoted by bi−1 and bi ( 1 ≤ i ≤ sn ), respectively. The boundary points are defined as following: b0 = 0, bsn = hyperperiod ,

(1)

bi = min{ts.starttime | ts ∈ ∪ ci .Sched ∧ ts.starttime > bi −1 ∧ ts.stae = idle},0 < i < sn,

(2)

dcn i =1

where dcn is the number of cores contained in the VFD, and ci .Sched is defined in Section 4.3. Thus, SG can be determined by (1) and (2). The reason the segments are defined in such a way is that when a job executes, only the available slack prior to its execution is allowed to be used by this job to lower its executing speed. Therefore, the schedulability of the task set can be guaranteed.

352

X. Wu et al.

(a)

(b)

Fig. 2. (a) Adjust the slack times by shifting them forward. (b) Reallocate the slack times and scale down the operation frequency in [0, 4].

After the segments have been determined, some time slices may be separated by these segments as well (e.g., j51 on core 2). We re-partition these time slices such that each time slice only belongs to one segment. As shown in Fig. 2 (b), the interval of the hyperperiod [0, 12] is divided into 4 segments by the dashed lines. In multicore systems, all cores usually share the last-level cache and main memory. This implies the cost of context switching caused by the migration of a job from a core to another is relative low. However, we constrain frequent job migrations to reduce the on-line overhead. Hence, the degree of un- balancedness ub for a segment sg in a VFD is introduced to help improve the trade-off between the overhead and the energy-saving for each segment:

ub =

max{slack i | 1 ≤ i ≤ dcn} − min{slack i | 1 ≤ i ≤ dcn} × 100%, max{slack i | 1 ≤ i ≤ dcn}

(3)

where slacki is the available amount of slack times on ci during sg , and dcn is the number of cores contained in the VFD. The job migration is applied only when the degree of unbalancedness is greater than a given threshold value. Further, the number of migrations for each job is constrained by an upper bound mb during its execution in order to reduce the overhead caused by the job migrations. Hence, the trade-off between the energy-saving and the overhead can be flexibly controlled. In this work, the threshold of the degree of unbalancedness ub and the upper bound of migrations for a job mb are set to 10% and 3, respectively. Algorithm 3 illustrates the slack reallocation within a VFD d i . The slack reallocation is performed on a per-segment basis (line 1). The amount of slack on each core in VFD d i and the ub within the current segment sg are calculated in line 2 and line 3, respectively. Then, the total amount of slack times in VFD d i is summed and the amount of slack that can be ideally distributed to each core is calculated in line 4. From line 5 to line 11, the algorithm strives to balance the slack times on the cores in d i within the current segment sg until ub is less than the given threshold 10%. If the current core c j has more slack than the ideal slack, then line 7 selects the core cmin with the

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems

353

minimum slack. Next, line 8 tries to perform the job migrations from cmin to c j so as to reduce the slack on c j to the ideal slack. Thus, the slack times distributed on all cores can be more evenly.

It is noteworthy that the job migrations are on the basis of maintaining the schedulability. Further, unlike [17], we assume that one job is disallowed to execute on different cores at the same time (i.e., serial-run-rule). As a result, after the slack reallocation is processed, the slack times are not necessarily distributed absolutely uniformly on the cores in the system.

5 Experimental Results In this section, the simulation experiments are used to evaluate the performance of the proposed scheduling approach with respect to energy consumption. In this experiment, the simulator scheduled the randomly generated tasks and recorded the energy consumptions within a hyperperiod under different scheduling policies, i.e., NonVS, SimpleVS, and the proposed scheduling with slack reallocation (HMVS). Usually, the current CMOS- based systems provide discrete operating voltage/frequency. However, we still evaluate the performance of our algorithm under continuous voltage/ frequency levels as well as discrete voltage/frequency levels in this experiment. The parameters ub and mb are set to 10% and 3, respectively. Namely, the slack reallocation and job migrations are performed only when the degree of unbalancedness (defined in Subsection 4.4) is greater than 10% and one job cannot be migrated more than 3 times. Recall that the maximum frequency of any processor core is normalized to 1. In the case of discrete frequency levels, seven different frequency levels were assumed to be supported by the target system: {0.36,0.55,0.64,0.73, 0.82,0.91,1.0} . In Fig. 3(a), the normalized energy consumed by the SimpleVS and the NonVS are compared. It can be seen that the SimpleVS achieved significant energy savings against the NonVS. Furthermore, more energy-saving was achieved by the scheduling with the

354

X. Wu et al.

continuous operating frequency selection, while the performance improvement decreases with increasing the average core utilization acu . In the case of the continuous frequency scaling, more than 99% dynamic energy was saved by the SimpleVS when acu is between 0 and 0.1. Whereas only less than 7% energy could be saved when acu was in the range of 0.9-1.0. In contrast, the energy consumptions saved by the SimpleVS with the discrete frequency scaling are 88% and 0, respectively. NonVS

SimpleVS (Continuous Fre)

SimpleVS (Discrete Fre)

HMVS(Continuous Fre)

1.2

1.05

1

1 gyr en0.95 E de 0.9 izl am ro0.85 N

y g r e 0.8 n E d e z 0.6 i l a m r 0.4 o N 0.2

0.8

0

0.75

0.05

0.15

0.25

0.35 0.45 0.55 0.65 Average Core Utilization

(a)

0.75

0.85

0.95

2

4

HMVS(Discrete Fre)

6 8 10 Core Number Per Domain

12

16

(b)

Fig. 3. (a) Normalized energy consumption of NonVS and SimpleVS schedule. (b) Normalized energy consumption of HMVS with continuous and discrete frequency selection, acu ∈ (0.2,0.4) .

Fig. 3(b) shows the energy consumption of the proposed scheduling algorithm with the continuous and the discrete frequency selection. The average utilization acu was in the range [0.2,0.4] , and the energy consumptions of their corresponding SimpleVS schedules are normalized to 1. In contrast to the discrete frequency selection, the HMVS algorithm gains about 5% to 11% more energy saving with the continuous frequency selection. This is because that the later is more flexible in selecting the operating frequency and can utilize the slack times more efficiently to save energy. As most of the state-of-the-art realistic multicore systems can adjust their operating voltages and executing speeds discretely, we will focus on the experiment based on the discrete frequency selection hereafter. In this case, when our algorithm assigns an operating frequency to a processor core, the nearest discrete frequency no less than the operating frequency which is calculated under the continuous voltage/frequency levels is selected. The impact of various VFD granularities on the energy consumption is depicted in Fig. 4. In general, the performance improvement brought by the HMVS versus the SimpleVS begins to decrease with a coarser granularity of a VFD. The reason is that, it is usually easier for HMVS to reallocate the slack times evenly to the cores in a fine-grained VFD than in a coarse-grained VFD. Hence, multicore systems with fine-grained VFDs can help improve the effectiveness of the slack utilization and can lead to more energy savings. We can find from Fig. 4 (a) that, when acu falls into the range of [0.3,0.4] , 22% energy was saved by HMVS in a system with 4 cores in a VFD, while only 1% energy could be saved when a VFD contains 16 cores in the target system.

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems

SimpleVS

SimpleVS

HMVS

HMVS

1.05

1.05 1

1

y g0.95 r e n E 0.9 d e z i0.85 l a m r o 0.8 N

yg re0.95 nE de 0.9 zi la0.85 mr No 0.8

0.75

0.75

0.7

355

2

4

6 8 10 Core Number Per Domain

12

16

0.7

2

4

6 8 10 Core Number Per Domain

(a)

12

16

(b)

Fig. 4. Normalized energy consumption under different VFD granularity (a) The average core utilization acu ∈ [0.3,0.4] . (b) The average core utilization acu ∈ [0.6,0.7] .

Moreover, by the comparison between Fig. 4(a) and (b) we can observe the influence of acu upon the energy saving. In principle, the energy consumption increases with an increasing acu . This is because that there are more available slack times on the core with a lower utilization than those on the core with higher utilization, which can be used to reduce the energy consumption. As far as the system in which each VFD contains 4 cores is concerned, the HMVS saved 22% energy compared with the SimpleVS when the acu is between 0.3 and 0.4. On the contrary, only 4.7% energy could be saved when the acu falls into the range [0.6,0.7] . Fig. 5 presents the energy consumption of the proposed algorithm with various acu . Note that the energy consumption of the two scheduling policies are the same when acu is smaller than 0.35. The reason is that the minimum discrete executing frequency of the active core is 0.36 in the experiment. The operating frequency could not be decreased any more even if the utilizations of all cores are quite low (i.e., less than 0.36). Therefore, the discrete frequency levels will affect the energy saving efficiency to a certain extent. As Fig. 5 (a) shows, our algorithm achieved better energy conservations when acu is between 0.3 and 0.4. SimpleVS

HMVS

SimpleVS

1.05

HMVS

1.05

1

1

y gr 0.95 en E 0.9 de zi l 0.85 mar oN 0.8

y g r 0.95 e n E

d e 0.9 z i l a m r 0.85 o N

0.75

0.8

0.7

0.75

0-0.3

0.35

0.45

0.55

0.65

Average Core Load

(a)

0.75

0.85

0.95

0-0.3

0.35

0.45

0.55

0.65

0.75

0.85

0.95

Average Core Load

(b)

Fig. 5. Normalized energy consumption under different core utilizations. (a) Each voltage/ frequency domain contains 4 cores. (b) Each voltage-frequency domain contains 16 cores.

356

X. Wu et al.

It can also been observed by comparing Fig. 5(a) and (b) that the system with finer grained VFDs can achieve more energy saving. When the acu falls into the range [0.3,0.4] , compared with NonVS schedule, the HMVS algorithm can save energy by 22% and 1% if each VFD contains 4 and 16 cores, respectively.

6 Conclusions and Future Work In this paper, the problem of energy-efficient scheduling for real-time periodic tasks on multicore systems is studied. The static voltage/frequency scaling of the multicore system (SimpleVS) is firstly introduced. SimpleVS selects the utilization of the heaviest-loaded core in a VFD as the operating frequency of each core in this VFD. Next, based upon the SimpleVS, the slack reallocation is proposed to take use of slack times under the constraint of the frequency synchronization and real-time schedulability. The slack reallocation tries to uniformly redistribute the slack times to the cores on the same VFD so that the synchronous executing frequency can be lowered. Consequently, energy savings and timeliness requirements can be satisfied concurrently. The experimental results show that, as far as multicore systems with discrete frequency selection is concerned, the proposed policy can reduce the dynamic energy by up to 22% versus the static policy when the system is under-utilized. It should be noted that as the proposed scheduling algorithm is based on the worst case executing time (WCET) of tasks and the reclamation of static slack times. However, there are usually dynamic slack times at run-time caused by the difference between the WCET and the actual executing time (AET). We will target studying the on-line schedule dealing with the dynamic slack reclamation.

Acknowledgments This work is supported in part by the NSF of China under grants No.60503048, by China Post-Doc Foundation (20070410280), by State Key Lab for Novel Software Technology, Nanjing University (kfkt2009b13), and by the US National Science Foundation under grant number CCF-0541403.

References 1. Multi-Core Processors—The Next Evolution in Computing, white paper, Advanced Micro Devices, Inc., http://www.amd.com/us-en/assets/content_type/white_ papers_and_tech_docs/IDC_Multi-Core_64-bit_White_Paper.pdf 2. Pillai, P., Shin, K.G.: Real-Time Dynamic Voltage Scaling for Low-Power Embedded Operating Systems. In: Proc. 18th ACM Symp. Operating Systems (SOSP 2001), pp. 89–102 (2001) 3. Lee, J., Kim, N.: Optimizing total power of many-core processors considering voltage scaling limit and process variations. In: Proceedings of the 14th ACM/IEEE Int’l Symp. Lower-Power Electronics and Design (2009)

Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems

357

4. Yang, C.Y., Chen, J.J., Kuo, T.W.: Energy-efficiency for multiframe real-time tasks on a dynamic voltage scaling processor. In: Int’l Conf. on Hardware Software Codesign (2009) 5. Han, J.J., Li, Q.H.: Dynamic Power-Aware Scheduling Algorithms for Real-Time Task Sets with Fault-Tolerance in Parallel and Distributed Computing Environment. In: 19th IEEE International Parallel and Distributed Processing Symposium (2005) 6. Zhuo, J., Chakrabarti, C.: Energy-efficient Dynamic Task Scheduling Algorithms for DVS Systems. ACM Trans. Embedded Computing Systems 7(2) (2008) 7. Magklis, G., Semeraro, G., Albonesi, D.H., Dropsho, S.G., et al.: Dynamic Frequency and Voltage/frequency scaling for a Multiple-Clock- Domain Microprocessor. IEEE Micro. 23(6), 62–68 (2006) 8. Yang, C., Chen, J., Luo, T.: An Approximation Algorithm for Energy-Efficient Scheduling on a Chip Multiprocessor. In: Proc. Design, Automation and Test in Europe Conf. and Exhibition (DATE 2005), pp. 468–473 (2005) 9. Iyer, A., Marculescu, D.: Power and Performance Evaluation of Globally Asynchronous Locally Synchronous Processors. In: ISCA, pp. 652–661 (2002) 10. Semeraro, G., Magklis, G., Balasubramonian, R., et al.: Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture (ISHPC), pp. 29–40 (2002) 11. Semeraro, G.P., Albonesi, D.H., Magklis, G., et al.: Hiding Synchronization Delays in GALS Processor Microarchitecture. In: ASYNC 2004, pp. 159–169 (2004) 12. Niyogi, K., Marculescu, D.: Speed and voltage selection for GALS systems based on voltage/frequency islands. In: Proc. ASP-Des. Autom. Conf., January 2005, pp. 292–297 (2005) 13. Marculescu, D., Talpes, E.: Variability and energy awareness: A micro- architecture-level perspective. In: Proc. Des. Autom. Conf., June 2005, pp. 11–16 (2005) 14. Seo, E., Jeong, J., Park, S., Lee, J.: Energy efficient Scheduling of Real-Time Tasks on Multicore Processors. IEEE Trans. Parallel and Distributed Systems 19(11), 1540–1552 (2008) 15. Langen, P.D., Juurlink, B.: Leakage-Aware Multiprocessor Scheduling for Low Power. In: Proc. 19th IEEE International Parallel and Distributed Processing Symposium (2006) 16. Bautista, D., Sahuquillo, J., Hassan, H., et al.: A simple power-aware scheduling for multicore systems when running real-time applications. In: Proc. IEEE International Parallel and Distributed Processing Symposium (2008) 17. Lee, W.Y.: Energy-Saving DVFS Scheduling of Multiple Periodic Real-Time Tasks on Multi-core Processors. In: Proc. IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, pp. 216–223 (2009) 18. Borkar, S.: Design challenges of technology scaling. IEEE Micro 19(4), 23–29 (1999) 19. Sengupta, D., Saleh, R.A.: Application-Driven Voltage-Island Partitioning for Low-Power System-on-Chip. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 28(3), 316–326 (2009) 20. Aydin, H., Yang, Q.: Energy-Aware Partitioning for Multiprocessor Real-Time Systems. In: Proc. Int’l Parallel and Distributed Processing Symp. (2003)

The Core Degree Based Tag Reduction on Chip Multiprocessor to Balance Energy Saving and Performance Overhead Long Zheng1,2 , Mianxiong Dong2,3 , Hai Jin1, , Minyi Guo4 , Song Guo2 , and Xuping Tu1 1

Huazhong University of Science and Technology, Wuhan, 430074, China 2 University of Aizu, Aizu-Wakamatsu, 965-8580, Japan 3 University of Waterloo, N2L, 3G1, Canada 4 Shanghai Jiao Tong University, Shanghai, 200240, China [email protected]

Abstract. Tag reduction is an approach to save energy of the cache system in a processor. Our previous work described that it can save more energy on a Chip Multiprocessor (CMP) than on a single-core processor. In this paper, we further investigate the problem on balancing energy saving and performance overhead when tag reduction is used for the low power Chip Multiprocessor (CMP). We ﬁrst introduce the core degree concept which is deﬁned as the number of cores that tag reduction can use for each thread. We then propose a core degree based tag approach that is to optimize the core degree such that the best balance of energy and performance can be achieved. In particular, as the basis for such optimization, the theoretical upper bounds of the energy savings and performance overhead are decided as function of the core degree. In our experiments, we use a 16-core CMP for example. In order to obtain the energy consumption and performance overhead with various core degrees, we construct an experimental environment, which is based on the Linux operating system. With the experimental environment, benchmarks of SPEC CPU2006 are used to evaluate our core degree based tag reduction. Finally, the experimental results show that the most desired balance of energy saving and performance overhead is achieved when core degree is set to 6.

1

Introduction

Over the last decade, manufactures competed to advance the performance of processors by raising the clock frequency. However, the dramatically increased power consumption and thermal problem caused by high clock frequency have ended this race. Now duplicating a number of cores in a chip, called Chip Multiprocessor (CMP) [1], as multi-core processor, is considered as an easier and more eﬃcient way to enhance the performance of microprocessors. Multi-core architecture is quickly becoming the mainstream that can achieve higher computational

Corresponding author.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 358–372, 2010. c IFIP International Federation for Information Processing 2010

The Core Degree Based Tag Reduction on CMP

359

capacity. Intel has acclaimed that more than 64 cores will be integrated into one processor in the near future [2]. Nvidia has presented its many-core solutionGPGPU [3]. Moreover the multi-core architecture that is applied to embedded system and mobile computing has emerged recently [4]. It is well known that cache system is one of the most power consuming component in a processor, as it is implemented as SRAM to get a fast clock frequency. Taking two widely-used commercial processors as example, the cache system in Alpha 21164 uses up to 25% of the total energy consumed by the processor [5], and even higher in StrongARM-110 with the Instruction Cache (I-Cache) and Data Cache (D-Cache) of up to 26% and 17% [6], respectively. The less power dissipation means to relieve the thermal problem and prolong the battery lifetime in hand-held devices. Whereas, saving energy usually leads to performance degradation, so the energy and performance are always two major issues that should be well balanced in design and optimization of microprocessors. This paper exploits tag reduction on Level 1 I-Cache (L1 I-Cache) to achieve energy and performance balance. We propose a method to extend tag reduction on CMP, meanwhile, introduce the core degree as the constraint to balance the energy and performance. The main idea of this paper is to distribute instruction page frames to several cores with the core degree constraint, such that the tagreduction conﬂicts can be reduced or even avoided. This approach can improve the capacity of energy saving by tag reduction compared to the one on previous single-core processor. However, it causes the extra context switches as the overhead to assign instruction pages to multiple cores. The core degree constraint controls the overhead at all acceptable level by restricting of the number of cores that the instruction page frames are allowed to. By varying the values of core degree, we can ﬁnd out the most appropriate value of core degree that achieves the best balance energy saving and performance overhead from our experiment. It is diﬃcult to get comprehensive data to measure the energy saving and performance overhead directly. Instead, we develop a model that can evaluate both energy saving and performance overhead of tag reduction quantitatively. Besides, since the results of tag reduction relies on the real distribution of instruction page frames, it is crucial to collect the distribution information of instruction page frames residing in physical memory, which we consider carefully in our experiments. The remainder of this paper is structured as follows. Section 2 brieﬂy presents the background of tag reduction and the related work. Section 3 propose the methodology that extends the tag reduction on CMP with energy saving and performance overhead, and presents a core degree based heuristic algorithm. Section 4 develops the model to evaluate the balance of energy and performance. Section 5 gives the experimental methods and results, as well as the analysis. Section 6 concludes our work and points out our ﬁndings.

2

Background and Related Work

In this section, we ﬁrst review the basic concept of tag and tag reduction very brieﬂy. Other related techniques on energy savings are also discussed. With a

360

L. Zheng et al.

Virtual-Indexed Physically-Tagged (VIPT) cache system of a processor, when the processor accesses L1 I-Cache, the 32-bit virtual address is divided into two parts. The most signiﬁcant 20 bits constitute the Virtual Page Number (VPN) and the remainder is the page oﬀset which is further divided into two 6-bit pars. The lower 6 bits are sent to processor directly, while the higher 6 bits are used as index to access the corresponding entry of the L1 I-Cache. Meanwhile, VPN is translated into Physical Page Number (PPN) through the TLB. Each entry of L1 I-Cache is composed by a tag and instructions. The tag stores the PPN of these instructions. Only if the tag in the entry is equal to the PPN translated by the TLB, called as L1 I-Cache hit, the instructions in the entry indexed by the most signiﬁcant 6 bits of page oﬀset are conﬁrmed to be what the processor really needs to access. Otherwise, the L1 I-Cache miss leads to the secondary access to L2 I-Cache. In our paper, with the function of tag described above, shortening the length of tag can reduce the energy consumption of L1 I-Cache. Tag reduction is a way to save energy consumed by L1 Cache and the Table Lookaside Buﬀer (TLB) in a processor [7] [8] [9] [10]. It has received much attention in the literature because it can save energy of caches signiﬁcantly. Tag reduction has been applied to general or customizable embedded processor [7] [9], to the TLB [8], as well as to the processor using heterogeneously tagged caches [10]. Most research eﬀects of tag reduction focus on its utilization on a single-core processor. Our previous work [11] is the ﬁrst one that implements the tag reduction to the CMP by proposing three heuristic algorithms. However, it does not consider the performance issue, which motives us to further exploit tag reduction on CMP and to analyze the balance of energy saving and performance overhead.

3 3.1

Methodology The Basic Idea of Tag Reduction on CMP

Tag reduction is an eﬀective way to save energy of L1 cache system for two main reasons. One is that in order to achieve a fast clock, tag is implemented as the energy-consuming SRAM. The tag hardware is an array that includes several entries. All bits in a tag entry reside in a wordline. Each time when the tag entry is accessed, all bitlines are charged ﬁrst and then discharged, resulting in a signiﬁcant amount of energy consumption [7]. On the other hand, when Physical Address Extension (PAE) is used or not, a modern 32-bit processor can support up to 4GB or 64GB physical memory, respectively. However, it is a kind of waste to use 20-bit tag to distinguish the entries that have the same page oﬀset, because OS does not always exhaust all the physical memory. Therefore, according to the organization of tag hardware, we can power oﬀ some unnecessary bits by reducing the length of a tag needed to save energy. All cores in CMP shares Level 2 or Level 3 caches, but L1 caches are always private for each core. Therefore, each core only needs to distinguish the page frames that are assigned to itself. We ﬁrst consider the ideal case that no

The Core Degree Based Tag Reduction on CMP

361

tag-reduction conﬂicts will occur, when the tag reduction is applied to CMP. i Under such considerations, tag size Bcmp of each core i is ⎧ i ⎨ Bcmp = log2 Ni w (1) Ni ⎩N = i=0

where Ni is the number of page frames assigned to core i, N is the number of all page frames in physical memory and w is the number of the cores. On the other hand, Equation (1) shows the best scenario is that all page frames are consecutively distributed in physical memory. When the page frames are not consecutive, the tag-reduction conﬂict exists [11]. Diﬀerent from the tag reduction on a single-core processor, a well designed page frame assignment strategy on CMP can minimize or even avoid the tag-reduction conﬂicts. For example, we can separate the page frames, which would cause conﬂict, from each other by assigning them to diﬀerent cores. The opportunities that allow tag reduction on CMP to decrease conﬂicts can achieve an even higher energy saving than the case on the single-core processor. 3.2

The Core Degree

For the notions of thread and process, OS does not always distinguish them all the time. Especially in process scheduling, OS treats threads and processes the same way. Since these diﬀerences do not aﬀect the discussion of this paper, from now we do not distinguish a thread and a process explicitly, and we use the same word thread to represent them, except for some special cases when we need to distinguish them. There are multiple cores in a CMP, therefore two or more threads can run parallel on diﬀerent cores which is Thread Level Parallelism (TLP). This is how CMP gains performance improvement. Nevertheless, TLP on CMP would be much complicated. In order to decrease cache missing and achieve further optimization of each core, CMP with the support of operating system enables an aﬃliation mechanism that dedicates each thread running on a particular core, just like that each thread aﬃliates to a particular core. Since Windows NT 4.0 and Linux 2.4 supported Shared Memory Multiprocessor (SMP), implying that they also support CMP, the aﬃliation mechanism has been implemented. Because of aﬃliation mechanism, in the real CMP computer system, all I-Pages of a thread are assigned to a particular core, and the thread always runs on this core. Based on the analysis in Section 3.1, tag reduction on CMP can exploit more opportunities to gain more energy saving than the one on a single-core processor by assigning page frames to diﬀerent cores. In practice, tag reduction on CMP divides the whole page frames of each thread into several parts to decrease or avoid the tag-reduction conﬂicts. The core degree is thus introduced here to represent the number of cores that the page frames of a thread can be assigned to. The minimal core degree is 1 and the maximal is equal to the total number of

362

L. Zheng et al.

cores. The core degree equal to 1 means that all the page frames of each thread are assigned to one particular core. On the contrary, if core degree is equal to the number of cores, it means that page frames of each thread are assigned to all cores. The larger the value of core degree is, the more cores the page frames of each thread can be distributed to. In theory, a larger value of core degree provides opportunities to avoid tag-reduction conﬂicts, and at the same time, allows less number of page frames in average to be assigned to each core decreases. Both are beneﬁcial to the energy savings. Recall that tag reduction needs page frames of each thread to be divided into parts, we consider each part as non-parallel, as the data dependency between parts may exist. Each non-parallel part is designed to be treated as a thread by OS. The number of non-parallel parts depends on the decision of tag reduction on CMP. Because the concurrency of threads still remains, the non-parallel parts of diﬀerent threads can run on diﬀerent cores simultaneously. Besides, these non-parallel parts also follow the aﬃliation mechanism of OS, that is, each nonparallel part is always executed on the core that it is originally distributed by OS. Now, Equation (1) can be reﬁned as ⎧ pi i ⎪ ⎪ Nij ⎨ Bcmp = log2 j=1 (2) pi d ⎪ ⎪ Nij , d ≤ σ ⎩N = i=1 j=1

where Nij is the number of page frames of the j-th parts on core i, pi is the number of parts on core i, d is the number of cores that have the parts and σ is the value of core degree. Although tag reduction on CMP does not break the concurrency of threads, the non-parallel parts lead to extra thread switches. The thread switch takes system time, so the extra thread switches slow down the system. This is the performance overhead induced by tag reduction on CMP. Therefore, as the value of core degree goes up, the extra thread switches increases and then the performance of system goes down. In a word, the core degree used by tag reduction on CMP aﬀects both energy saving and the overhead of performance. When it grows, tag reduction can save more energy of CMP but causes more overhead of performance. The balance of energy saving and performance overhead is determined by core degree. 3.3

Heuristic Algorithm

In this section, we propose a heuristic algorithm to implement the method. The algorithm incorporates the aﬃliation mechanism and applies the tag reduction on CMP under a given core degree. As shown in the analysis above, the eﬀect of tag reduction on CMP depends on the number of page frames of each core and the chance of tag-reduction conﬂicts. Since the tag-reduction conﬂict is always more crucial than the other, we consider it as the priority factor to design this heuristic algorithm. That is, we try to decrease and avoid the tag-reduction conﬂict as much as possible in this algorithm.

The Core Degree Based Tag Reduction on CMP

363

For loading instruction pages of Thread M, the heuristic algorithm can be described concisely as follows. (1) Load a page of instructions that is part of Thread M into an I-Page frame with page frame number N. (2) If the number of cores that the I-Page frames of Thread M have been assigned to is less than core degree, then go to step 3; otherwise go to step 4. (3) In all cores, ﬁnd out the one with the shortest reduced tag if the N-th I-Page frame is included; then go to Step 5. (4) Within the cores that Thread M has already been assigned to, ﬁnd out the one with the shortest reduced tag if the N-th I-Page frame is included. (5) Dispatch the I-Page frame to the chosen core and make tag reduction. (6) If more pages of instructions of Thread M need to be loaded into physical memory, return to Step 1; otherwise terminate.

4

Energy and Performance Analysis

In Section 3, we propose a tag reduction technique for a given core degree on CMP and discuss the corresponding energy saving and performance overhead issues conceptually. In this section, we develop an analytical model to quantitatively evaluate both energy consumption and performance overhead, as well as to study their balance. 4.1

Energy Consumption

Recall that accessing a tag once undergoes a cycle of charging and discharging as described in Section 3.1. We assume the energy consumption of one such cycle for one bit tag to be e Joules. When programs run on a processor, some instructions are jumped over and, on the contrary, some are executed repeatedly. However, the instructions are executed by a processor like an instruction ﬂow from view of the processor. Each instruction fetching from cache and memory hierarchy needs to access L1 I-Cache tag. Because the execution of diﬀerent instructions takes diﬀerent time, the number of instructions executed in a processor time slice may be not equal. For a CMP with w cores, running t threads, the energy consumption E on the L1 I-Cache tag in a period with s time slices can be calculated as follows. E =e·

aij w s

T agR(

i=1 j=1 n=1

t

AddEx(Iij (m, n))),

(3)

m=1

subject to ∀i, 1 ≤ i ≤ s, ∀m, 1 ≤ m ≤ t, and ∀n, 1 ≤ n ≤ aij σ≥w−

w

φ(Iij (m, n)).

(4)

j=1

The variables and functions in (3) and (4) are detailed as follows. ∀i, 1 ≤ i ≤ s and ∀m, 1 ≤ m ≤ t, we use Iij (m, n) to denote a set that represents thread m’s

364

L. Zheng et al.

instruction ﬂow till the n-th instruction executed by the j-th core during the i-th time slice. Each element records a particular instruction’s physical address and execution time. In the instruction ﬂow, the element of Iij (m, n) is a twodimensional array, just like (A, T ) in which A is the physical address and T is the executed time. The n-th instruction can be expressed as (An , Tn ). Therefore the thread m’s instruction ﬂow till the n-th instruction can be expressed exactly as {(A, T )|(A, T ) ∈ Iij (m), T ≤ Tn }, (5) where Iij (m) represents thread m’s instruction ﬂow executed by the j-th core during the i-th time slice. Function AddEx extracts the A of each (A, T ) of set Iij (m) to compose a new physical address set. Function T agR reduces the length of tag with the physical addess set AddEx(Iij (m)). Furthermore, aij is the total number of instructions executed by the j-th core, during the i-th time slice, i.e.

aij = num(

t

Iij (m))

(6)

m=1

Finally, Equation (4) guarantees that instructions of each thread can be only assigned to at most σ number of cores, due to the core degree constraint. It is formulated by the function φ(S) which returns 1 if set S is an empty set or 0 otherwise. We further deﬁne the average number of bits that a tag requires after tag reduction for the j-th core to execute all the instructions in the i-th time slide, Bij , as follows. aij t 1 Bij = · T agR( AddEx(Iij (m, n))) aij n=1 m=1

(7)

Equation (3) can thus be rewritten as E =e·

s w

(Bij · aij )

(8)

i=1 j=1

Let Ni be the total number of instructions executed by all cores in the i-th time slice, i.e., w Ni = aij (9) j=1

and fij be the ratio of the number of instructions assigned to the j-th core over the total number of instructions to all cores in the i-th time slice, i.e., aij aij fij = = w Ni aij j=1

(10)

The Core Degree Based Tag Reduction on CMP

365

Equation (3) can be eventually expressed as E =e·

s w

(Bij · fij · Ni )

(11)

i=1 j=1

If we deﬁne Nmax ≡ max (Ni ) 1≤i≤s

(12)

the upper bound of energy consumption by tag reduction E can be derived as ⎧ s w ⎪ ⎪ (Bij · fij ) ⎨ E ≤ E = e · Nmax · i=1 j=1 (13) w ⎪ ⎪ φ(Iij (m, n)) ⎩σ ≥ w − j=1

We have obtained the upper bound of energy consumption with tag reduction under the core degree constraints as formulated in (13). The upper bound of energy consumption can be calculated if Bij and fij are known, since both e and Nmax are constant numbers. 4.2

Performance Overheads

To formulate the discussion in Section 3.2, the performance overhead is deﬁned as the number of the additional thread switches induced by tag reduction. We ﬁrst show the total number of switches, ST , on a tag reduction enabled CMP as ST = SP + SR − ε, ε SP, SR

(14)

where SP is the number of thread switches when tag reduction is not used, SR is the number of thread switches caused by the additional non-parallel parts, and ε is the overlap between SP and SR. Generally, the overlap rarely happens, only in some very special cases, for example, when a thread switch caused by additional non-parallel parts should have occurred because of the expiration of time slice of OS. So ε is far less than SP or SR. The performance overhead caused by tag reduction on CMP T is T = SR − ε

(15)

Similar to (3), for a CMP with w cores and t running threads, SR in a period with s time slices can be calculated as t s w Fim − 1 SR = aij Pim j=1 m=1 i=1

(16)

subject to ∀i, 1 ≤ i ≤ s and ∀m, 1 ≤ m ≤ t Fim ≤ σ

(17)

366

L. Zheng et al.

In (16) Pim is the number of I-Pages of the m-th thread during the i-th time slice, Fim is the number of non-parallel threads that are generated by the m-th thread during the i-th time slice; and the deﬁnition of aij is the same as in (6). Due to the core-degree constraint, Fim must not exceed σ, which is described by (17). Similarly, using Ni deﬁned in (9), we can simplify (16) as t s Ni SR = (Fim − 1) P m=1 i=1 im

(18)

So, T can be expressed as T =

t s

(Ni · (Fim − 1)/Pim ) − ε

(19)

m=1 i=1

The upper bound of the performance overhead T caused by extra non-parallel parts when tag reduction applied to CMP can eventually be derived as ⎧ t s ⎨ T ≤ T = Nmax · ((Fim − 1)/Pim ) (20) m=1 i=1 ⎩ Fim ≤ σ We notice that T is only related to Fim and Pim , since Nmax and SP are constants for each parallel thread. When the tag reduction is not used, that is, Fim is always equal to 1, T is equal to 0, as expected. 4.3

Balance of Energy and Performance

In order to facilitate the comparison of the energy consumption and performance overhead under various core degrees for a group of applications, we introduce the normalized metrics θ and ω as follows. θnσ = Enσ /Enσ

min

and ωnσ = Tnσ /Tnσ

, θnσ ∈ (0, 1]

max

, ωnσ ∈ (0, 1]

(21)

(22)

in which θnσ is the ratio of energy consumption of the n-th application when core degree is equal to σ relative to the one when core degree is σ min , and ωnσ is the ratio of performance overhead of the n-th application when core degree is set to σ relative to the one when core degree is σ max . As we apply the tag reduction on CMP, the minimal value of core degree σ min is 2. The maximal value of core degree σ max is equal to the number of cores. Similarly, Enσ and Tnσ represent the energy consumption and performance overhead of the n-th application, respectively, when core degree is equal to σ. Based on our analysis, as core degree σ increases, θnσ goes down from 1, and ωnσ goes up to 1.

The Core Degree Based Tag Reduction on CMP

367

Theoretically, the energy consumption always decreases and the performance overhead always increases as core degree varies from 2 to the number of cores. It is meaningless that compare the energy consumption and performance overhead directly. Instead, we could compare the trends of energy consumption with the one of performance overhead of diﬀerent applications as core degree varies to ﬁnd out the balance of the energy and performance of tag reduction on CMP. So the trend of energy consumption and performance overhead, can be calculated as dθn ηn (σ) = − , σ ∈ (σ min , +∞) (23) dσ and dωn , σ ∈ (σ min , +∞) γn (σ) = (24) dσ where η and γ are the trends of energy consumption and performance overhead of the n-th application, respectively. Equation (23) makes sure that the trend of energy consumption η is positive, since energy consumption usually goes down n is negative. as core degree increases so that the value of dθ dσ So far, we can use the trends of energy consumption and performance overhead to ﬁnd out the appropriate value of core degree. In microprocessor design, it is demanded to consume less energy with low performance overhead. However, with the tag reduction on CMP, the energy consumption always goes down and the performance overhead always goes up, so that with a particular value of core degree, the bigger the value of ηn is and the less the value of γn is, the better the core degree is. Therefore, the appropriate value of core degree of each application can be determined as follows: ⎧ arg min(|γn (σ) − ηn (σ)|) : Case 1 ⎪ ⎪ σ ⎪ ⎨ arg min(γn (σ) − ηn (σ)) : Case 2 σ (25) σn = ⎪ any ∈ [σ min , σ max ] : Case 3 ⎪ ⎪ ⎩ arg min(γ (σ) − η (σ)) : else n n σ

where Case 1, 2, 3 can be expressed as follows. Case 1: ηn (σ min + Δ) > γn (σ min + Δ) and ∃σ ∈ (σ min , +∞), ηn (σ ) = γn (σ ) Case 2: ηn (σ min + Δ) < γn (σ min + Δ) and ∃m ∈ (σ min , +∞), ηn (m ) = γm (m ) Case 3:

ηn (σ) ≡ γn (σ) + C, C is a constant number

In 25, σn is the appropriate value of core degree of the n-th application. There are 3 situations when we decide the best value of core degree described in the

368

L. Zheng et al.

equation above. If we can ﬁnd a common value of σn for sorts of applications, we can know the best value of core degree that achieves the balance of the energy consumption and performance overhead of tag reduction on CMP.

5

Experiments

In our experiments, we use 22 benchmarks from SPEC CPU2006 to ﬁnd out the appropriate value of core degree, as the benchmarks from SPEC CPU2006 represent sorts of classic applications. First, we construct an experiment platform to evaluate 22 benchmarks and collect necessary data; then we calculate the energy consumption, performance overhead and their trend as described in Section 4; ﬁnally, the appropriate value of core degree can be got through the ones of 22 benchmarks. 5.1

Experiment Setup

We choose Linux 2.6.11 32bit without PAE as our experiment platform, because Linux is open source, so that it is easy for us to modify the kernel and add some modules to collect data needed. Three copies of each benchmark are run concurrently every time to create a multi-process environment. After modifying the kernel and adding extra module, we can collect the experimental data from these 3 copies of each benchmark. As analyzed above, the data needed to calculate energy and performance only concerns the physical memory information. There are no diﬀerences in physical memory management of Linux between CMPs with diﬀerent number of cores. Therefore, with the data we collected, we construct a 16-core CMP environment to calculate the energy consumption and performance overhead that are basis of computation of their balance. The experiment environment is listed as that Processor Model is Intel Core2 [email protected] and physical memory capacity is 1GB. The names of 22 benchmarks evaluated are astar, bwaves, bzip2, cactusADM, calculix, dealII, gcc, GemsFDTD, gromacs, h264ref, hmmer, lbm, libquantum, namd, perlbench, povray, sjeng, soplex, sphinx, tonto, Xalan, and zeusmp. The benchmarks consist of integer and ﬂoat point computation. Since in our experiments, we apply the tag reduction on a 16-core CMP, the range of core degree is from 2 to 16, i.e. σmin and σmax are 2 and 16, respectively. We evaluate each of 22 benchmarks with diﬀerent values of core degree from 2 to 16 with interval by 2. 5.2

Experiment Result and Analysis

Based on the analysis in Section 4, with the data collected from our experiment platform, we can calculate θnσ and ωnσ that are energy consumption ratio and performance overhead ratio of each benchmark when core degree varies from 2

The Core Degree Based Tag Reduction on CMP

369

Table 1. The Discrete Results of Energy Consumption as Core Degree Varies (θnσ ) CD astar bzip2 calculix gcc gromacs hmmer libquantum perlbench sjeng sphinx Xalan

4 0.74 0.59 0.66 0.67 0.87 0.71 0.74 0.87 0.75 0.67 0.74

6 0.54 0.40 0.57 0.56 0.75 0.60 0.59 0.76 0.58 0.55 0.63

8 0.43 0.29 0.50 0.50 0.67 0.53 0.52 0.74 0.52 0.48 0.55

10 0.38 0.29 0.45 0.51 0.60 0.50 0.44 0.72 0.46 0.45 0.51

12 0.35 0.29 0.43 0.52 0.56 0.46 0.37 0.62 0.41 0.41 0.50

14 0.33 0.30 0.43 0.46 0.49 0.42 0.34 0.62 0.38 0.37 0.44

16 0.29 0.32 0.41 0.46 0.48 0.40 0.29 0.62 0.35 0.34 0.43

CD

4 0.67 cactusADM 0.72 dealII 0.84 GemsFDTD 0.58 h264ref 0.74 lbm 0.70 namd 0.63 povray 0.73 soplex 0.73 tonto 0.83 zeusmp 0.79 bwaves

6 0.48 0.63 0.66 0.47 0.54 0.49 0.51 0.68 0.64 0.71 0.59

8 0.42 0.53 0.58 0.39 0.52 0.37 0.41 0.60 0.55 0.67 0.50

10 0.39 0.53 0.53 0.34 0.45 0.33 0.40 0.54 0.47 0.60 0.42

12 0.35 0.49 0.50 0.31 0.44 0.28 0.34 0.47 0.44 0.58 0.36

14 0.37 0.47 0.49 0.29 0.40 0.23 0.32 0.45 0.41 0.54 0.38

16 0.32 0.45 0.43 0.29 0.38 0.27 0.30 0.44 0.37 0.53 0.38

Table 2. The Discrete Results of Performance Overhead as Core Degree Varies (ωnσ ) CD astar bzip2 calculix gcc gromacs hmmer libquantum perlbench sjeng sphinx Xalan

2 0.07 0.08 0.07 0.23 0.07 0.29 0.07 0.11 0.07 0.07 0.07

4 0.20 0.25 0.20 0.66 0.20 0.88 0.20 0.34 0.20 0.20 0.20

6 0.33 0.42 0.33 0.68 0.33 0.90 0.33 0.56 0.33 0.33 0.34

8 0.47 0.57 0.47 0.50 0.47 0.92 0.47 0.79 0.47 0.47 0.48

10 0.60 0.56 0.60 0.71 0.60 0.94 0.60 0.74 0.60 0.60 0.60

12 0.73 0.68 0.73 0.10 0.73 0.96 0.73 0.52 0.73 0.73 0.72

14 0.87 0.83 0.87 0.54 0.87 0.7 0.87 0.47 0.87 0.87 0.86

CD

2 0.26 cactusADM 0.52 dealII 0.09 GemsFDTD 0.37 h264ref 0.22 lbm 0.17 namd 0.07 povray 0.07 soplex 0.04 tonto 0.07 zeusmp 0.09 bwaves

4 0.62 0.83 0.26 0.75 0.65 0.43 0.20 0.20 0.09 0.20 0.14

6 0.75 0.96 0.42 0.86 0.67 0.61 0.33 0.34 0.13 0.34 0.39

8 0.87 0.56 0.58 0.89 0.70 0.65 0.47 0.47 0.16 0.47 0.50

10 0.88 1.38 0.67 0.96 0.72 0.76 0.60 0.60 0.20 0.60 0.55

12 0.87 1.25 0.73 0.97 0.75 0.79 0.73 0.73 0.24 0.73 0.58

14 0.97 0.90 0.87 1.00 0.36 0.97 0.87 0.87 0.26 0.87 0.84

to 16 with interval by 2. Diﬀerent values of n mean diﬀerent benchmark. Because we evaluate 22 benchmarks, the n is from 1 to 22. Table 1 illustrates the energy consumption ratio of each benchmark when core degree is equal to 2, 4, till 16. CD in the table is short for core degree. The values of ratio are distributed discretely by the discrete values of core degree, because the original experimental results are discrete. With this table, we can ﬁnd out that the energy of consumption of all the benchmarks decreases when the value of core degree increases. Because the value of energy consumption ratio of all benchmarks is 1 when core degree is 2 as the deﬁnition of normalization in (21), we omit these values in the table. Table 1 and the theoretical analysis in Section 4.1 ﬁt neatly. Similarly in Table 2, it shows the performance overhead ratio of each benchmark. With the deﬁnition of normalization in (22), the performance overhead ratio is 1, when core degree is 16. However, we notice that the maximum of performance overhead ratio of some benchmarks exceeds 1. The performance overhead ratio for most of 22 benchmarks are monotonic increment along the

370

L. Zheng et al.

(a) The results of derivative of energy con- (b) The results of derivative of perforsumption as core degree varies. mance overhead as core degree varies. Fig. 1. The discrete results of derivative of energy consumption and performance overhead ratio

varying of core degree from 2 to 16. However, some benchmarks behave more complicatedly than monotonically. The results of performance overhead ratio from experiments are a bit more complex than the theory in Section 4.2. With the results above, we should follow the analysis in Section 4.3 to get the trends of energy consumption and performance overhead of each benchmark, ηn and γn , respectively. This is basis of determining the appropriate value of core degree. In order to get ηn and γn , we ﬁt the energy consumption ratio function and performance overhead ratio function with the their discrete results with help of Matlab. With the functions ﬁtted, we can get the expressions of derivative of these functions, so that the trends are available for calculation. In Fig. 1, we index each benchmarks from 1 to 22 alphabetically, instead of using its name for convenience. Fig. 1 shows the results of derivative of energy consumption and performance overhead ratio, respectively, when core degree is set to 2, 4 till 16. The x-axis is the value of σ, the y-axis is the index of benchmarks. The z-axis in Fig. 1(a) is the results of derivative of energy consumption, the one in Fig. 1(b) is the results of derivative of performance overhead. In Fig. 1(a), the results of derivative of energy consumption trends to 0 along the σ varies, which implies that the eﬀect of energy saving by tag reduction on CMP is weaker and weaker when core degree varies incrementally. We ﬁnd out that the results of derivative of energy consumption vary similarly among all benchmarks. When core degree is greater than 12, part of benchmarks achieve their low peaks, and others then still remain monotonic, which means after Core Degree is 12, part of benchmarks can save more energy, the others are otherwise. The results of derivative of the performance overhead are shown in Fig. 1(b). We notice that the trends of most of benchmarks remain stable; the ones of others are otherwise. Besides, the negative values appear several times in this ﬁgure, which means the performance overhead decreases when core degree is equal to some particular values. It is obvious that all benchmarks can be divided into two parts: Part A and Part B. The derivative of performance overhead ratio

The Core Degree Based Tag Reduction on CMP

(a)

371

(b)

Fig. 2. The appropriate value of core degree

of benchmarks in Part A stays almost the same; on the contrary the one in Part B changes dramatically as the value of core degree varies. With the results of derivative of the ﬁtted energy consumption and performance overhead ratio of each benchmark, we can get the appropriate value of core degree of each benchmark, i.e. σn , according to (25) in Section 4.3. The distribution of σn is illustrated in Fig. 2(a). Each point in this ﬁgure shows a particular benchmark’s appropriate value of core degree. The x-value of a point is the index of a benchmark, and the y-value of the point is the appropriate value of core degree of this benchmark. There are two districts in this ﬁgure. One is the ﬁlled with light grey and the other is with dark grey. The light grey district is a district around core degree 6 with radius 2; and the dark grey district is a district around core degree 6 with radius 1. The appropriate values of core degree of 22 benchmarks are distributed between 3 and 10. Most of benchmarks are in light grey district, especially in dark grey district. Fig. 2(b) oﬀers the distribution with the number of benchmarks more clearly. In this ﬁgure, the x-axis is the value of core degree, the y-axis is the number of benchmarks. Taking the cylinder at core degree 6 for example, this cylinder represents that the appropriate values of 13 benchmarks is in the interval of (5,7]. Therefore, we can see that all appropriate values of core degree of benchmarks are from 2 to 10, with a peak at core degree 6 that is much higher than ones at other core degree. As the benchmarks from SPEC CPU2006 represent sorts of typical applications, from the results in Fig. 2, we can ﬁnd out when core degree is 6, it is appropriate for applications to balance energy consumption and performance overhead.

6

Conclusion

In this paper, we ﬁrst review the fundamentals of tag reduction on CMP that can save energy consumption. Then we introduce the core degree concept, meanwhile propose the core degree based approach to tag reduction to balance the energy consumption and performance overhead. Furthermore the analysis of balance of energy and performance leads to ﬁnding out an appropriate core degree. In the experiment, we apply our core degree based tag reduction to the 16-core CMP,

372

L. Zheng et al.

and then use 22 benchmarks of SPEC CPU2006 to evaluate our approach, ﬁnally ﬁnd out the optimal value of core degree. The experimental results show that when the core degree is 6, we can get the most desired balance of energy saving and performance overhead.

Acknowledgements This work is supported by National 973 Basic Research Program of China under grant No. 2007CB310900, the National High-Tech Research and Development Plan of China (863 Plan) under Grant Nos. 2008AA01Z106, and Fellowships of the Japan Society for the Promotion of Science for Young Scientists Program, Excellent Young Researcher Overseas Visit Program.

References 1. Spracklen, L., Abraham, S.: Chip multithreading: Opportunities and challenges. In: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pp. 248–252. IEEE Computer Society, Washington (2005) 2. Held, J., Bautista, J., Koehl, S.: From a few cores to many: A tera-scale computing research overview. Research at Intel white paper (2006) 3. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A uniﬁed graphics and computing architecture. IEEE Micro, 39–55 (2008) 4. Monchiero, M., Palermo, G., Silvano, C., Villa, O.: Eﬃcient synchronization for embedded on-chip multiprocessors, vol. 14(10), pp. 1049–1062 (2006) 5. Edmondson, J., Rubinfeld, P., Bannon, P., Benschneider, B., Bernstein, D., Castelino, R., Cooper, E., Dever, D., Donchin, D., Fischer, T., et al.: Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal 7(1), 119–135 (1995) 6. Montanaro, J., Witek, R., Anne, K., Black, A., Cooper, E., Dobberpuhl, D., Donahue, P., Eno, J., Hoeppner, W., Kruckemyer, D., et al.: A 160-mhz, 32-b, 0.5-w CMOS RISC microprocessor. IEEE Journal of Solid-State Circuits 31(11), 1703– 1714 (1996) 7. Petrov, P., Orailoglu, A.: Dynamic tag reduction for low-power caches in embedded systems with virtual memory. International Journal of Parallel Programming 35(2), 157–177 (2007) 8. Petrov, P., Orailoglu, A.: Virtual page tag reduction for low-power TLBs. In: Proceedings of 21st International Conference on Computer Design, pp. 371–374 (2003) 9. Zhou, X., Petrov, P.: Heterogeneously Tagged Caches for Low-Power Embedded Systems with Virtual Memory Support. ACM transactions on Design Automation of Electronic Systems 13(2), 32 (2008) 10. Petrov, P., Orailoglu, A.: Tag compression for low power in dynamically customizable embedded processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23(7), 1031–1047 (2004) 11. Zheng, L., Dong, M., Guo, S., Guo, M., Li, L.: I-Cache Tag Reduction for Low Power Chip Multiprocessor. In: 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications, Chendu, Jiuzhai Valley, China, August 10-12, pp. 196–202 (2009)

Improve Throughput of Storage Cluster Interconnected with a TCP/IP Network Using Intelligent Server Grouping Xuechen Zhang1 , Guiquan Liu2 , and Song Jiang1 1 Wayne State University, University of Science and Technology of China [email protected], [email protected], [email protected] 2

Abstract. Cluster-based storage systems connected with TCP/IP networks are expected to achieve a high throughput by striping ﬁles across multiple storage servers. However, for the storage system interconnected with the TCP/IP network, several critical issues, like Incast eﬀect and data access interference, invalidate the assumption that higher access parallelism always results in increased I/O throughput. To address this issue, we propose a new ﬁle striping strategy, named as storage server grouping (SSG), which changes ﬁle striping pattern across the storage servers based on the analysis of ﬁle popularity and impact of the number of storage servers on the clients’ perceived performance (I/O speedup) to reduce the interference of data accesses to popular ﬁles and avoid dramatic reduction of system throughput caused by the Incast eﬀect. Our experimental evaluation shows that SSG can improve I/O throughput by 22.1% on average. Keywords: The Incast Eﬀect, Interference, File Striping, Lustre.

1

Introduction

Cluster-based storage systems are adopted as an increasingly important platform for supporting applications that demand high I/O performance for large-scale data access [2]. The building blocks of many such systems are a number of small and less capable storage servers which are usually connected with commodity low-cost and high-performance Gigabit TCP/IP-Ethernet networks [1]. Figure 1 illustrates a typical cluster-based storage system. The TCP/IP network usually has suﬃciently high link bandwidth (more than 1Gbps) but the switch has very limited size of buﬀer to store and transfer data. In the storage system managed by parallel ﬁle systems, such as Lustre [3], GPFS [4], and PVFS [5], the data accessed by one request from a client can be distributed on multiple storage servers. Accordingly, a client actually disassembles such a request into several small sub-requests, each to a storage server for a piece of data stored on the server. The servers then concurrently access the pieces of data for each sub-request. In this way, ﬁle striping helps achieving high data access parallelism, which is expected to improve I/O throughput. Parallel C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 373–389, 2010. c IFIP International Federation for Information Processing 2010

374

X. Zhang, G. Liu, and S. Jiang StorageServers Clients

TCP/IPͲEthernet network

ServerDisks Strip

Strip

switch

Strip

Strip

Fig. 1. A typical cluster-based storage system is composed of clients, TCP/IP-Ethernet network including a switch, and a number of storage servers managed by a parallel ﬁle system. In the system, a ﬁle is divided into several strips, each stored on a storage server according to a set of pre-deﬁned conﬁguration parameters.

ﬁle systems usually hide the details on ﬁle placement and access protocol. However, many of them, such as Lustre, provide interfaces allowing the owner of ﬁles to set some critical data striping parameters, such as striping unit size, striping factor (the number of storage servers to store ﬁles), and striping index (the ﬁrst storage server for the striping). We conducted experiments to study performance of such systems. Our key observations are 1) high disk interference on storage servers can be incurred by clients’ excessive exploitation of their data access parallelism without coordination; 2) the limited switch buﬀer size might cause TCP throughput to collapse; and 3) data access to popular ﬁles intensiﬁes the disk head contention among clients on I/O servers. These observations make the conventional approaches less eﬀective, which focus on striping data over many servers and balancing I/O load over the servers [6] in order to explore the parallelism of data accesses for a high I/O throughput. To address the issue, we propose SSG (Storage Server Grouping) as a framework to automatically generate ﬁle striping parameters in an on-line manner. SSG uses the proposed I/O speedup model to ﬁnd the optimal number of storage servers before a ﬁle is striped across storage servers. The I/O speedup model is trained by using relative machine learning technique [11] to correlate the number of storage servers with I/O performance of a workload. SSG keeps tracking ﬁle popularity and intelligently separates ﬁles into diﬀerent server groups by setting the striping index, reducing data access interference on each group. SSG also periodically tunes the ﬁle striping parameters based on the I/O workload characteristics proﬁled on line. We have implemented the SSG scheme on top of Lustre parallel ﬁle system. Our experimental results show that SSG can improve system-wide I/O throughput by up to 38.6% and 22.1% on average. The remainder of this paper is organized as follows. Section 2 presents our experimental observations. Section 3 discusses related work. Section 4 presents the design of the SSG scheme in detail. Section 5 delves into experimental evaluation, and section 6 concludes the paper.

Storage Cluster Interconnected with a TCP/IP Network

2

375

Experimental Observations

To study the performance of TCP/IP network based storage cluster, we set up an experimental platform consisting of sixteen nodes, eight conﬁgured as compute nodes, and the other eight as I/O nodes, managed by the Lustre parallel ﬁle system. File data was striped over the I/O nodes. We used the default Lustre ﬁle striping setting, whose ﬁle striping size is 64KB, striping factor is 8, and striping index is 0. (More details of the experimental platform are described in Section 5.) We used the Iozone benchmark [7] for three types of reads: the sequential read, the random read, and the backward read in the throughput mode, which we can specify the number of active threads in its running. 2.1

The Incast Eﬀect

Increasing the number of I/O servers of a storage cluster connected with TCP/IP networks does not necessarily improve the eﬀective network bandwidth. As we know, the Ethernet switch has only very limited built-in memory. For example, the widely used HP Procurve 2848 Ethernet switch has only 16MB memory. When the incoming packets arrive too fast to be buﬀered in the cache, some of them would have to be discarded due to the limited buﬀer size, which will cause TCP timeout and re-transmission of the packets from the servers. In this scenario, the servers cannot serve the next I/O request from clients until the timeout is detected and all the discarded packets are re-transmitted. The worst scenario, in which network delay is seriously increased, network links are idle, and system throughput is collapsed, is called the Incast eﬀect [1,9]. To observe the impact of Incast eﬀect on the I/O throughput, we run the Iozone benchmark for the three types of reads: sequential read, random read, and backward read on a client with 512KB request size. An Iozone thread on the client accesses a ﬁle which is striped over diﬀerent numbers of I/O servers, ranging from 1 to 8. Figure 2 shows that the I/O throughput does not increase

160 sequential read random read backward read 140

I/O Throughput (MB/s)

120

100

80

60

40

20

0 1

2

3

4

5

6

7

8

# of I/O Servers

Fig. 2. System-wide I/O throughput observed at the application level when running the Iozone benchmark on a single client as the number of I/O servers increases. Request size of these tests is 512KB. Incast eﬀect happens when the number of I/O servers is 8, making the system throughput 15% less than the maximum.

376

X. Zhang, G. Liu, and S. Jiang

with the increase of the number of I/O servers. The throughput even becomes lower by 15% when the number of I/O servers increases from 6 to 8, which is a clear indication of the Incast eﬀect. However, the Incast eﬀect does not necessarily happen for all workloads of any types of access patterns. For example, when average request size of a workload is less than striping unit size, the Incast eﬀect will be less likely to occur since the chance of switch buﬀer saturation caused by the requested data from multiple servers in response to a single request is very low. However, we cannot simply always set the striping unit size larger than the average request size, because this would essentially eliminate data access parallelism. A better way to alleviating the Incast eﬀect and improving the networking eﬃciency is to carefully select the number of storage servers for ﬁle striping. However, it would take too much time for administrators to ﬁnd the number simply by human experience. SSG uses a proposed I/O speedup model to ﬁnd the optimal number of storage servers before a ﬁle is striped across storage servers to avoid the Incast eﬀect. 2.2

Interference among Requests from Diﬀerent Clients

Usually when a storage server is dedicated to serve requests from one client, the eﬃciency of its disks can be well maintained. Compared with increasingly large disk capacity used for storing data for many clients, the amount of data accessed by one client is limited, or the disk region covered by requests from one client is relatively small. This allows the disk heads to move in relatively short distances in continuously serving requests from the same client and to keep relatively high disk throughput. However, serving requests from multiple clients concurrently requires disk heads to move in longer distances among disk regions storing data belonging to diﬀerent clients. The expensive mechanical operation would dramatically reduce both the I/O throughput received by each individual client and the throughput of the entire disk-based server cluster. This inter-client interference becomes more intensive when data are striping over more storage servers and thus each server serves requests from more clients. The worst scenario is to have a long disk-head movement across client-data regions in serving every request. To observe impact of the interference on the I/O throughput, we run the Iozone benchmark on diﬀerent numbers of clients, ranging from 1 to 16, while keeping the number of servers constant, so that each I/O server will serve increasingly large number of concurrent requests. We test two scenarios. In the scenario for light interference, the Iozone thread on each client accesses one set of ﬁles, which are distributed on close disk regions. In the scenario for heavy interference, the thread on each client accesses a diﬀerent set of data ﬁles on the disks. In both scenarios, sequential access pattern is used. Figure 3 shows the system-wide I/O throughputs in these two scenarios. The results clearly show that the interference has a major impact on the I/O eﬃciency. For example, with 16 clients, the throughput with the heavy-interference scenario is only 74% of that with a light interference.

Storage Cluster Interconnected with a TCP/IP Network

377

350 Heavy interference Light interference

I/O Throughput (MB/s)

300

250

200

150

100

50 0

2

4

6

8 # of Clients

10

12

14

16

Fig. 3. System-wide I/O throughput observed at the application level when running Iozone benchmark on multiple clients while keeping the number of I/O servers constant. Data access to diﬀerent set of disk regions (heavy interference) results in much more overhead when the number of clients is larger than 6.

The interference can be alleviated by serving many requests from the same client in a batch before moving disk heads to serve requests from other clients. In this way, the long-distance movements of disk heads can be reduced and disk throughput can be improved. This technique is especially eﬀective with asynchronous I/O, in which a number of requests from the same process are pending and ready to be scheduled together. However, for the synchronous I/O, which is commonly used in most applications, a technique named as anticipatory scheduling [10] is used to serve multiple requests from the same process together. In the scheduling policy, the disk will wait for the next request from the same process to serve after it served its current request, even though there are pending requests from other processes. For the policy to be eﬀective, the time gap between two consecutive requests from the same process, or thinktime of the process, must be small so that the gain from the reduced disk head movement is larger than the loss from the disk idling. However, the thinktime can be signiﬁcantly increased if a ﬁle is striped over many I/O servers because a request is considered to be completed only when its requested data on all the servers are accessed. As the disk operations on diﬀerent servers are not synchronized, the thinktime can be signiﬁcantly increased with the increase of the parallelism, or the number of servers involved, in the serving of a request. While the parallelism can be beneﬁcial to I/O throughput when the interference is not serious, it can hurt the throughput when the interference is intensive. Our SSG scheme addresses the issue by making a trade-oﬀ between potentially high throughout due to access parallelism and excessive interference due to high parallelism, looking for the optimal number of servers for the ﬁle striping. 2.3

Intensiﬁed Data Access Contention

Data access contention is intensiﬁed when there are multiple popular ﬁles in the same storage system and all the ﬁles are simultaneously accessed by diﬀerent clients. The default consistently aggressive striping for data access parallelism

378

X. Zhang, G. Liu, and S. Jiang

over the same set of servers can lead to a high probability in which one server simultaneously services requests for actively used ﬁles from many clients. This increased intensity of competition for disk services causes high inter-client interference. System throughput degradation caused by data accesses to popular ﬁles can be alleviated through separating the popular ﬁles into diﬀerent storage server groups using the SSG scheme proposed in this paper. One of the SSG components is responsible for tracking the popularity of each ﬁle and, accordingly, placing popular ﬁles into diﬀerent groups by carefully selecting ﬁle striping parameters. Therefore, data access contention to popular ﬁles can be reduced in each group. All these observations indicate that optimizing ﬁle striping parameters on line is critical to avoiding the Incast eﬀect and reducing the interference overhead caused by aggressive striping and data access contention to popular ﬁles. SSG can automatically tune these parameters on line to achieve better I/O performance than the traditional default striping method.

3

Related Work

There are a large of body of work about modeling I/O storage systems, dataaware storage resource management, and solutions to the Incast eﬀect. 3.1

Research on Storage System Simulation

Some existing disk simulators, like Disksim [15,14], can help to accurately predict I/O workload performance from single disks or disk arrays by using software to simulate device behaviors. But developing such kind of simulators requires extensive expertise and knowledge about details of hard disks and disk arrays. Furthermore, it is even harder to have a simulator for the cluster-based storage system because many dynamics on the networks are involved. Analytic models which are presented in papers [16,17,18] describe device behavior with a set of formulae. They are computationally eﬃcient but have the same drawbacks as the simulation methods. Machine learning techniques can also be used to model a storage system. Wang et al. proposed a scheme to predict storage device performance by using CART model [12]. They treat an I/O device as a function whose input is workload characteristics, either at the request level or at the workload level, and whose output is the request response time. This CART model does not require a priori knowledge on storage devices and is able to predict I/O performance of single disks and disk arrays with less than 20% median relative error on average, as reported in the paper. Mesnier et al. proposed a relative ﬁtness model [11], which captures the diﬀerences between a pair of devices instead of predicting the absolute performance of workloads. Their method can reduce prediction error by as much as a factor of two when compared to absolute models. Therefore, we choose to build our I/O speedup relative ﬁtness model.

Storage Cluster Interconnected with a TCP/IP Network

3.2

379

Research on Data-Aware Storage Resource Management

Yu et al. proposed a hierarchical striping method [20] to reshape the data access pattern to storage servers, resulting in an improved aggregated bandwidth. But they did not analyze I/O speedup of a workload for ﬁle striping or consider the data access interferences to popular ﬁles. Kosar proposed a job scheduling approach [19] to explore data locality of distributed storage resources on-line by analyzing the processing of applications. However, the job-level approach does not have enough knowledge on dynamic changes of workload characteristics or current network statuses to reduce the Incast eﬀect. 3.3

Research on Solutions to the Incast Eﬀect

The Incast eﬀect can be alleviated by increasing the switch buﬀer size or tuning TCP ﬂow control parameters such as package loss timeouts [1]. However, solutions at the network level are less adaptive to the changes of workload behaviors and are not cost-eﬀective as switches with large buﬀers can be expensive. The problem can also be addressed at the application level [1,24]. Several possible application-level methods are proposed in [21], including increasing request size, global scheduling of data transfers, and limiting the number of synchronously communicating servers.

4 4.1

Storage Server Grouping Scheme I/O Speedup

We introduce the concept of I/O speedup for each client’s workload to characterize the reduced workload running time due to the use of increased number of storage servers. The I/O speedup for k severs is deﬁned as the ratio between the workload running time when its data is striped on the k storage servers and the running time when one dedicated server is used. We use a dedicated server as the baseline for the comparison to make sure that no dynamically varying interference is involved in the metric. The dedicated server is named as a reference server. If the I/O speedup for k servers is greater than 1, then a workload can beneﬁt from striping its ﬁle across the k servers. Since the I/O speedup not only depends on the characteristics of a workload but also depends on the conﬁguration of networks such as the size of the buﬀer in the network switch, it would be a time-consuming task for administrators to ﬁnd the number of servers to obtain the highest speedup. The I/O speedup of a workload is automatically determined in our SSG scheme, using relative ﬁtness modeling approach [11] rather than traditional direct modeling approach since the relative ﬁtness modeling provides higher prediction accuracy [11]. The I/O speedup relative ﬁtness model is used to calculate the I/O speedup for k servers and expressed as the following function, T imeref T imek

= RFref →k (W Cref , P erfref , U tilref ),

380

X. Zhang, G. Liu, and S. Jiang

where W Cref represents the workload characteristics, such as request size and request arrival time, which are proﬁled when the workloads run on the reference server; P erfref is a performance metric such as latency; and U tilref is a vector of values, such as the devices’ cache utilization and cache hit ratio. After we have built up the model, or speciﬁcally, have obtained function RFref →k , we only need to measure the performance characteristics, such as W Cref , P erfref , and U tilref , on the reference server for a newly admitted workload and then use the models to derive the maximal I/O speedup of the workload. We instrument the MPI-IO library for parallel I/O benchmarks or source code of applications for ﬁlesystem benchmarks to collect the required measurements. Details on the model training are described on Section 3.1.1. If the total number of storage servers is N, SSG needs to generate N models, each on one possible striping factor k (k = 1, 2, ..., N ). Figure 4 shows two steps involved in the prediction of the I/O speedup. The ﬁrst step is ﬁtness oﬀ-line tests for building I/O speedup models. The models are trained before a cluster-based storage system starts to serve requests from clients using ﬁtness tests and will be periodically updated using the actually measured I/O speedup values. The second step is I/O speedup on-line prediction during workload admission control. More speciﬁcally, SSG runs an application to obtain the characteristics, performance, and resource utilization of a workload on the reference server before actually striping the data set of the workload over the storage servers. Based on these proﬁling data and I/O speedup models, SSG predicts I/O speedup of the workload on diﬀerent number of servers. After enumerating all I/O speedup relative ﬁtness models, SSG can determine the optimal number of servers for ﬁle striping. Training an I/O Speedup Model. To train an I/O speedup model, SSG uses a synthetic workload generator to generate training requests, which should exhibit adequate coverage of storage system characteristics. The training data is obtained when proﬁling the training workloads running on the reference server and candidate server groups. Each training sample includes the following variables [12]: Requesti = {T imeDif fi(1), ..., T imeDif fk (k), LBNi , LBN Dif fi (1), ..., LBN Dif fi (l), Sizei , RWi , Seqi , Cachei , Latencyi } Where T imeDif fi(k) = ArrivalT imei −ArrivalT imei−2k−1 and LBN Dif fi (l) = LBNi − LBNi−l . The ﬁrst two groups of parameters capture temporal and spatial locality of a workload. Seqi indicates whether a request is a sequential access. Cachei indicates whether a request hits in memory buﬀer. Sizei represents size of a request and RWi represents read or write attribute of a request. Latencyi is the service time of a request. The two parameters, k and l, determine how far we look back for request bursts and locality [12]. The value of k and l is pre-deﬁned before ﬁtness tests.

Storage Cluster Interconnected with a TCP/IP Network Areference server Modellearning algorithm

Acandidateserver groupwithK servers

381

I/Ospeedupmodelof / p p thecandidateserver groupwithKservers againstthereference server

Step I: fitness test Workload characteristics on reference server Application

Thereference server

Performance of reference server Resource utilization of reference server

I/Ospeedup p p modelofthe candidateserver groupwithK servers against serversagainst thereference server

I/O Speedup if I/OSpeedupif theapplication runningonthe candidateserver group instead of groupinsteadof onareference server

St II Step II: I/O speedup d prediction di ti

Fig. 4. There are two steps for the prediction of I/O speedup. In step I, SSG uses ﬁtness tests to construct I/O speedup relative ﬁtness model for a candidate server group with speciﬁc number of servers. In step II. SSG predicts the I/O speedup for running time of an application on the reference server over its running time on a candidate server group of k servers.

SSG uses REPTree [13] to train the I/O speedup relative ﬁtness model. For each Latencyref training request in the workloads, the predicted variable is Latencytarget , where Latencyref is the request service time on a reference server and Latencytarget is the request service time on a candidate server group, and the predictor variables are a variable vector {T imeDif fref , ..., LBNref ,..., Sizeref , RWref , Seqref , Cacheref }. 4.2

Harmoniousness Analysis Based on File Popularity

Since there will be several server groups that have the same striping factor, SSG has to decide in which group a workload should join in by setting a striping index for a ﬁle. This process plays a critical role in reducing the contention due to serving many popular ﬁles concurrently on the same group of servers. Accordingly, a ﬁle accessed by a workload prefers to join in a group with fewer number of popular ﬁles. We deﬁne ﬁle popularity as the access frequency of the ﬁle during each time slot Tslot and deﬁne a harmoniousness value H of a server group as the sum of popularity of 20% most popular ﬁles in the group. A fresh ﬁle is conservatively assigned a maximal access frequency F eqmax even though no requests have accessed it. 4.3

Putting It All Together

The whole procedure is an iterative process, including feedback control in looking for the best striping parameters for a ﬁle. Figure 5 shows the framework of the SSG scheme. During the workload admission control, SSG scheme ﬁrst ﬁnds the number of I/O servers, where the workloads can achieve highest I/O speedup. Then SSG chooses the group that has the smallest harmoniousness value for workload ﬁles to stripe over by setting striping index as the index of the ﬁrst server in the group. File popularity tracking component is responsible

382

X. Zhang, G. Liu, and S. Jiang

for periodically updating the ﬁle popularity by communicating with each tracking thread running on clients. The striping unit size is assigned as the average request size of a workload divided by the number of I/O servers in its server group.

FilePopularityTracking

FitnessTests

OfflineProfilingon ReferenceSystem

Harmoniousness Analysis

I/OSpeedupModelof StorageSystem

RequestLevelReference SystemDescription

StripingIndex

StripingFactor

AverageRequestSize

StripingUnitSize

Fig. 5. The SSG scheme. In the workload admission control, SSG initializes striping parameters for ﬁles to be accessed. The ﬁle popularity tracking keeps updating popularity of each ﬁle periodically until the ﬁle is deleted.

The overhead of the SSG scheme includes the time for proﬁling applications on the reference server and the time for tracking ﬁle popularity. With limited loss of the prediction accuracy, SSG can approximate the characteristics of a workload by sampling I/O requests in a relatively coarse granularity during its running period. This proﬁling time can also be overlapped with the job scheduling time in high performance computing environment or request queueing time in other client-server environment. In addition, since SSG uses thread communication for ﬁle popularity tracking, the frequency for communication can be reduced for reduced overhead if data access has good locality.

5

Performance Evaluation and Analysis

We have implemented a prototype of the proposed SSG scheme over the Lustre parallel ﬁle system [3]. In this section, we will describe the experiment environment, evaluate prediction accuracy of I/O speedup model, and show the performance of the SSG scheme in diﬀerent scenarios compared with static aggressive ﬁle striping policies to demonstrate that how SSG would respond to the Incast eﬀect and variations of ﬁle popularity. We use (a, b, c) to represent the ﬁle striping triplets, where a is the striping unit size, b is the striping index, and c is the striping factor. 5.1

Experimental Setup

Our experiments are conducted on a dedicated cluster-based storage system with 16 nodes. All nodes are of identical conﬁguration, each with a dual 1.6GHz

Storage Cluster Interconnected with a TCP/IP Network

383

Pentium processor, 1GB memory, and an 80GB SATA hard disk. The cluster uses the Lustre parallel ﬁle system (version 1.6.5.1), in which eight nodes were conﬁgured as storage servers and one of them also as a metadata server. The default striping triplet in our Lustre environment is (64KB, 0, 8). We also chose one server with the same conﬁguration as the reference server. Each node runs Redhat Enterprise Linux (version 4.5) with kernel-2.6.9. All nodes are connected through a switched Gigabit Ethernet. The switch is D-Link DGS-1016D with 16 ports and 340Kbytes built-in buﬀer memory [22]. 5.2

I/O Workloads

We choose two kinds of I/O workloads, synthetic workloads and real workloads, in the experiments. Synthetic workloads are produced by a workload generator and used to build I/O speedup models. For this purpose we ensure that there is an adequate coverage of characteristics of workloads that are possibly presented to the storage system, including data access locality, the number of burst requests, and workload size. To evaluate the performance, we use I/O workloads from proﬁling of real benchmark running, such as mpi-io-test from PVFS2 software package, noncontig from the Parallel I/O Benchmarking Consortium at Argonne National Laboratory [23], and Iozone ﬁlesystem benchmark. The ﬁrst two benchmarks are used to test I/O system performance through parallel I/O interfaces, while the last one generates POSIX-compatible Linux ﬁle operations to evaluate ﬁlesystem performance. In the mpi-io-test benchmark , each process accesses (reads or writes) one segment of contiguous data at a time and all the parallel processes access several segments in a row. Benchmark noncontig uses complex MPI derived data structure vector to access a ﬁle which is considered to be a two-dimensional array by each process. There are several columns in the array and each process accesses a column of the array, starting at row 0 of its designated column. In one row of a column there are elmtcount elements of the M P I IN T type. So the width of a column is elmtcount*sizeof(M P I IN T ). In each call, the total amount of data accessed by the processes is ﬁxed, determined by the buﬀer size, which is 16MB in our experiment. That is, the larger elmtcount, the more small pieces of data, which are non-contiguous to each other, that are accessed by each process. Each process only reads or writes one ﬁle in its running. For the Iozone benchmark [7], we calculate system-wide I/O throughputs of sequential read, random read, and backward read workloads. 5.3

Prediction Accuracy of I/O Speedup Models

In this section, we will evaluate the prediction accuracy of the I/O speedup models generated through the ﬁtness tests2 by using a metric, relative error x%, 2

In ﬁtness tests, we use 80% of synthetic workloads as training data set for discovering the I/O speedup relations, and the other 20% as test data set to evaluate prediction accuracy of this REPTree learning approach.

384

X. Zhang, G. Liu, and S. Jiang

which is deﬁned as |Tm − Tp | ∗ 100/Tm = x%, where Tm is the latency measurement (service time) of a request and Tp is predicted latency of the request. The smaller the relative error is, the better the predicted results are. We ran the noncontig benchmark on the reference server twice with one process and two processes respectively to get proﬁling data. Then we use the data and I/O speedup models to predict the I/O speedup of each request on N candidate server groups with diﬀerent number of storage servers. Figure 6 shows the CDF curves of relative error of all the predicted latency of requests. Each curve has a long tail, indicating that while the majority of predictions are quite close to measured latency, there is a small percentage of relative errors that are greater than 50%.

Fig. 6. CDF of the prediction error of predicted latencies when the noncontig benchmark runs with single process (left ﬁgure)or two concurrent processes (right ﬁgure). Over 90% of predictions are accurate within 16.7% error on average.

5.4

POSIX-I/O Experiments: Iozone Benchmark

We repeat the experiments described in Section 2.2 with the SSG scheme enabled. Iozone threads running with diﬀerent numbers of clients sequentially read diﬀerent ﬁles. Since all the ﬁles accessed by the clients have the same popularity, SSG isolates the accesses to each ﬁle, which is striped over a separate server group so that interference of data access from diﬀerent threads is reduced as much as possible. Figure 7 shows the results of these tests. When the overhead due to disk-head movements becomes a dominating factor to system performance, SSG scheme helps to increase I/O throughput by up to 38.6% in the experiments, compared with the default aggressive setting. For the random and backward read, SSG achieves roughly the same improved performance. From this ﬁgure we also ﬁnd that when the number of clients is fewer than 4, the I/O throughput with SSG scheme is slightly smaller than the default setting. This is because client I/O interference on disks can be alleviated through I/O scheduling like anticipatory scheduling when interference of data accesses is not intensive. 5.5

Scenario I: Parallel I/O Experiments

This experiment is to demonstrate the eﬀectiveness of the SSG scheme to alleviate the Incast eﬀect and to show how server grouping can lead to better I/O

Storage Cluster Interconnected with a TCP/IP Network

385

400 aggressive striping SSG 350

I/O Throughtput (MB/s)

300

250

200

150

100

50 0

2

4

6

8

10

12

14

16

# of Clients

Fig. 7. The I/O throughput increases by up to 38.6% using the SSG scheme compared with default aggressive striping, while the number of clients increases

throughput. We execute four processes, each running the mpi-io-test benchmark with four diﬀerent segment size, 64KB, 128KB, 256KB, and 512KB, respectively, to access diﬀerent ﬁles from f ile0 to f ile3 striped over storage servers. Processes start executing the application at the same time. In the ﬁrst test, we bypass the SSG scheme and use default ﬁle striping (64KB, 0, 8) for the four ﬁles. In the second test, we launch the SSG threads and use ﬁle striping triplets suggested by the I/O speedup model. The striping unit size is rounded to the nearest multiple of 64KB following the Lustre system requirement. Figure 8 shows the predicted I/O speedup of candidate server groups with the speciﬁc number of I/O servers for the mpi-io-test benchmark. Based on the prediction results, SSG chooses 3 as ﬁle striping factor for f ile0 and f ile3 , and 4 for the others. In next step, we assume that all the ﬁles have the same popularity in the beginning. SSG selects the ﬁle striping factor in the round-robin fashion. Finally, striping parameters for each ﬁle is set as follows: f ile0(64KB, 0, 3), f ile1(64KB,

3.5 64KB segment size 128KB segment size 256KB segment size 512KB segment size

I/O Speedup

3

2.5

2

1.5

1 1

2

3

4

5

6

7

8

# of I/O Nodes

Fig. 8. The predicted I/O speedup of the candidate server groups for the mpi-io-test benchmark

386

X. Zhang, G. Liu, and S. Jiang

I/OThroughput(MB/s)

80 70 60 50 40 30 20 10 0 64KB

128KB

256KB

512KB

Segmentsizeforprocesses aggressivestriping

SSG

Fig. 9. Experiment results in scenario I. There are four processes involved in this test. Each process reads 1GB data with 64KB, 128KB, 256KB, and 512KB segment size, respectively.

3, 4), f ile2 (64KB, 7, 4), and f ile3(128KB, 3, 3). Figure 9 shows the throughput observed by the benchmarks for each process with diﬀerent segment sizes. The experiments show that the throughput is increased by 32.1% on average as SSG reduces the interferences on disks of each storage server makes the entire system relieved from the performance bottleneck due to the Incast eﬀect. 5.6

Scenario II: Parallel I/O Experiments

This experiment shows how the SSG scheme reduces the overhead incurred by the contention on concurrent accesses of popular ﬁles. In this experiment, we also use the mpi-io-test benchmark, which is modiﬁed to support reading in diﬀerent rates to simulate ﬁle access frequency, by controlling how many requests are produced in one second. There is a process running in background from the beginning, reading f ile0 with 64KB segment size and default striping setting. At the beginning, two processes read f ile1 and f ile2 in the rates of 500req/s and 1req/s with segment size 128KB and 256KB, respectively. After 3 seconds, a newly launched process will write a new ﬁle f ile3 with segment size 512KB. In this scenario, if without the SSG scheme, system will use default striping pattern (64KB, 0, 8) for all the ﬁles. In contrast, for SSG, the striping factors are determined by the trained I/O speedup models based on its prediction results shown in ﬁgure 8. SSG chooses (64KB, 0, 4) for f ile1 and (64KB, 4, 4) for f ile2 . When f ile3 is examined in admission control, the SSG scheme ﬁnds that f ile1 is much more popular than f ile2 based on the results from ﬁle popularity tracking. As a result, in order to reduce the interference between the new ﬁle and f ile1 , f ile3 is striped across server4 to server6 with 128KB striping unit size. Figure 10 shows the throughput of each process running the mpi-io-test benchmark in this experiment. The average throughput improvement is 12.2%. From this ﬁgure, we ﬁnd that the process running in background beneﬁts little from the server grouping scheme, because f ile0 is striped over all the servers and without ﬁle redistribution SSG cannot help to reduce the interference of data accesses to this ﬁle.

Storage Cluster Interconnected with a TCP/IP Network

387

I/OThroughput(MB/s)

60 50 40 30 20 10 0 64KB

128KB

256KB

512KB

Segmentsizeforprocesses aggressivestriping

SSG

Fig. 10. Experiment results in scenario II. There are three processes involved in this test. Each process reads 1GB data with 128KB, 256KB, and 512KB segment sizes, respectively. Background process keeps reading data during the test with 64KB segment size.

6

Conclusions and Future Work

In this paper, we have shown that the conventional method for conﬁguring the storage cluster can achieve the highest access parallelism but causes high interferences on disks of storage servers and more serious data access contention on popular ﬁles, and aggravated Incast eﬀect. To make a trade-oﬀ between potentially high throughout due to access parallelism and excessive interference due to high parallelism, we proposed the storage server grouping (SSG) scheme over parallel ﬁle system. SSG uses admission control to carefully select the striping parameters for a ﬁle by predicting the I/O speedup of a workload and analyzing the harmoniousness of a ﬁle with the one coexisted in a group by tracking the ﬁle popularity. We have implemented a SSG prototype in the Lustre ﬁle system. Experimental results have shown that our technique is able to eﬀectively optimize the striping of data in a cluster-based storage system. The performance evaluation on typical I/O benchmarks, such as mpi-io-test and Iozone, shows that SSG can improve the system throughput by up to 38.6% and 22.1% on average, demonstrating that SSG can eﬀectively alleviate the Incast eﬀect and reduce the interference of data accesses to popular ﬁles. In the future, we would like to further improve adaptability and usability of the SSG scheme. For example, if SSG detects that the popularity of a ﬁle has changed substantially, it should be able to reconﬁgure the striping pattern for the ﬁle on line to reduce the potential interference.

Acknowledgments We thank the anonymous reviewers for their constructive comments. This work was supported by US National Science Foundation under grant CAREER CCF 0845711. This work was also partially supported by the National Natural Science Foundation of China (No. 60775037 and No.60833004) and Chinese Fundamental Research Funds for the Central Universities.

388

X. Zhang, G. Liu, and S. Jiang

References 1. Phanishayee, A., Krevat, E., Vasudevan, V., Andersen, D., Ganger, G., Gibson, G., Seshan, S.: Measurement and Analysis of TCP Throughput Collapse in Clusterbased Storage Systems. In: Fast 2008 (2008) 2. Abd-El-Malek, M., II, W.V.C., Cranor, C., Ganger, G., Hendricks, J., Klosterman, A., Mesnier, M., Prasad, M., Salmon, B., Sambasivan, R., Sinnamohideen, S., Strunk, J., Thereska, E., Wachs, M., Wylie, J.: Usra Minor: versatile clusterbased Storage. In: Fast 2005, San Francisco, CA (December 2005) 3. Sun Microsystem, Inc. Lustre: A Scalable, High Performance File System (2009), http://www.lustre.org 4. Schmuck, F., Haskin, R.: GPFS: A shared-disk ﬁle system for large computing clusters. In: FAST 2002, pp. 231–244 (2002) 5. Laborarory, A.N.: Online-document (2008), http://www.pvfs.org 6. Baek, S., Park, K.: Prefetching with Adaptive Cache Culling for Striped Disk Arrays. In: USENIX 2008 (2008) 7. IOzone Filesystem Benchmark (2009), http://www.iozone.org/ 8. Zhang, X., Jiang, S.: Making Resonance a Common Case: A High-performance Implementation of Collective-I/O on Paralle File System. In: IPDPS 2009 (2009) 9. Nagle, D., Serenyi, D., Matthews, A.: The Panasas ActiveScale Storage ClusterDelivering Scalable High Bandwidth Storage. In: SC 2004 (2004) 10. Iyer, S., Druschel, P.: Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O. In: SOSP 2001 (2001) 11. Mesnier, M., Waches, M., Ganger, G.: Modeling the Relative Fitness of Storage. In: SIGMERICS 2007 (2007) 12. Wang, M., Au, K., Ailamaki, A., Brockwell, A., Faloutsos, C., Ganger, G.: Storage device performance prediction with CART models, Technical Report CMU-PDL04-103, Carnegie Mellon University (2004) 13. WEKA Online document (2008), http://weka.sourceforge.net/doc/weka/classifiers/trees/REPTree.html 14. Bucy, J., Ganger, G.: The Disksim simulation environment version 3.0 reference manual, Technical Report CMU-CS-03-102, Carnegie Mellon University (2003) 15. Ruemmler, C., Wilkes, J.: An introduction to disk drive modeling. IEEE Computer 27(3), 17–28 (1994) 16. Shriver, E., Merchant, A., Wilkes, J.: An analytical behavior model for disk drives with readahead caches and request reordering. In: Preceedings of International Conference on Measurement and Modeling of Computer Systems, pp. 182–191 (1998) 17. Lee, E., Katz, R.: An analytic performance model of disk arrays. In: SIGMETIRCS 2003, pp. 98–109 (2003) 18. Uysal, M., Alvarez, G., Merchant, A.: A modular, analytical throughtput model for modern disk arrays. In: Proceedings of 9th International Symmposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 183–192 (2001) 19. Kosar, T.: A new paradigm in data intensive computing: Stork and the dataaware schedulers. In: Challenges of Large Applications in Distributed Environments (2006) 20. Yu, W., Vetter, J., Canon, R., Jiang, S.: Exploiting Lustre File Joining for Eﬀective Collective IO. In: CCGrid 2007 (2007)

Storage Cluster Interconnected with a TCP/IP Network

389

21. Krevat, E., Vasudevan, V., Phanishayee, A., Andersen, D., Ganger, G., Gibson, G., Seshan, S.: On Aplication-level Approaches to Avoiding TCP Throughput Collapse in Cluster-based Storage Systems. In: Proceedings of Petascale Data Storage Workshop, Supercomputing (2007) 22. D-Link Systems, Inc., Product/Performance Speciﬁcations (2008), http://www.dlink.com/products/resource.asp 23. Parallel I/O Benchmarking Consortium (2008), http://www-unix.mcs.anl.gov/pio-benchmark/ 24. Butler, J.: Panasas Inc., Personal Communication (August 2007)

Evaluate the Performance and Scalability of Image Deployment in Virtual Data Center Kejiang Ye, Xiaohong Jiang, Qinming He, Xing Li, and Jianhai Chen College of Computer Science, Zhejiang University, Zheda Rd. 38, Hangzhou 310027, China {yekejiang,jiangxh,hqm,lix,chenjh919}@zju.edu.cn

Abstract. Virtualization technology plays an important role in modern data center, as it creates an opportunity to improve resource utilization, reduce energy costs, and ease server management. However, virtual machine deployment issues arise when allocating virtual machines into single or multiple physical servers. In this paper, we explore the performance and scalability issues for virtual machine deployment in a virtualized data center. We ﬁrst evaluate the image scalability when allocating multiple VMs per physical server using four typical servers in data center. Then we investigate how the overall eﬃciency will be aﬀected when deploying M virtual machines into N physical machines with diﬀerent deployment strategies. Experimental results show that: (i) There is a resource bottleneck when deploying single type virtual machine server into single physical server, except for composite workloads. (ii) More physical machines do not always beneﬁt for some speciﬁc applications to support a ﬁxed number of virtual machines. (iii) MPI and network communication overheads aﬀect the deployment eﬃciency seriously. Keywords: Virtual Machine, Scalability, Image Deployment, Server Consolidation.

1

Introduction

Virtualization as a critical technology in modern data center has been receiving much attention as it creates an approach to improve resource utilization, reduce costs, and ease server management. The emergence of mature virtualization solutions such as VMware [1], Xen [2], KVM [3], and OpenVZ [4] have spurred this growth. Virtualization provides an abstraction of hardware resources enabling multiple instantiations of operating systems (OS) to run simultaneously on a

This work is funded by the National 973 Basic Research Program of China under grant NO.2007CB310900 and National Natural Science Foundation of China under grant NO. 60970125.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 390–401, 2010. c IFIP International Federation for Information Processing 2010

Evaluate the Performance and Scalability of Image Deployment

391

single physical machine. By sharing the underlying physical resources, virtualization could achieve signiﬁcantly higher system utilization compared to traditional physical machines. One of the challenges in realizing such resource sharing is how to eﬃciently deploy virtual machines into physical server computers. Speciﬁcally, two questions should be answered in the deployment process: (i) How many virtual machines can be allocated into one physical machine while still maintaining good performance? (ii) If M virtual machines are required to be deployed, how many physical machines are needed to support them? Since the introduction of virtual machine monitor(VMM), the performance overheads introduced by virtualization is widely evaluated. Many researchers put their focus on the performance overheads of several pivotal system components in virtualization environments, e.g., CPU, memory, disk and network etc [5,6]. Some researchers focus the performance overheads in speciﬁc scenarios, such as server consolidation [7,8] and HPC environment [9]. However, to the best of our knowledge, none of them have quantiﬁed the deployment eﬃciency for typical server applications and HPC applications under diﬀerent deployment strategies. To evaluate the performance of diﬀerent deployment strategies, we design two sets of experiments. We ﬁrst investigate the deployment eﬃciency of M-1 scenario that allocating M virtual machines into one physical machine. We will investigate the best M selection. We choose four representative server applications in modern data center and quantify the server scalability. Then we evaluate the deployment eﬃciency of M-N scenario that allocating M virtual machines into N physical machines. To quantify the overall performance, we choose a standard HPC benchmark HPCC that running in parallel manner with MPI communication. Our experimental results indicate that there is a resource bottleneck when deploying single type virtual machine server into single physical machine and consolidation of diﬀerent workloads together is recommended. It is suitable to deploy several virtual machines running composite workloads into one physical machine. For computing intensive applications that don’t run in parallel manner and processor communication intensive applications which are insensitive with the deployment strategies, adding more physical machines is useless. For the parallel applications with no MPI communication, more physical machines can achieve better performance. While for the computing intensive workloads that have high MPI and network communication traﬃc, it is not suitable to deploy the virtual machines across multiple physical machines due to the signiﬁcant communication overheads between physical machines. The rest of this paper is organized as follows. We describe the background of Xen virtualization technology and our experiment motivation in Section 2. And then in Section 3, we present our experimental methodology to solve the problems presented in Section 2. In Section 4, we discuss and analyze the experimental results. The related work is introduced in Section 5 and we conclude with the future work in Section 6.

392

K. Ye et al.

VM1

VM2

VM3

…

VMM

VMM(Xen/KVM/VMware)

PhysicalMachine (a)MͲ1

VM1

VM2

VM3

…

VMM

VMM(Xen/KVM/VMware) PM1

PM2

PM3

…

PMN

(b)MͲN

Fig. 1. Two experimental scenarios: (a) M-1 means allocating M virtual machines into one physical machine; (b) M-N means allocating M virtual machines into N physical machines

2

Background and Motivation

Previous studies have shown that the resource utilization is very low in data center most of the time [10]. We also realize that the maximum resource utilization is much higher than the average one. Virtualization can be used to solve the utilization problem as it holds the beneﬁts of running multiple virtual machine instances simultaneously while sharing the underlying physical resources. However, there is a lack of proposal for selecting the image deployment strategies. It is essential to investigate the most appropriate deployment strategy to achieve best resource utilization while maintaining good QoS. If we allocate too many or too few virtual machines into the physical machines, the physical machines may either be grossly under-utilized or overloaded leading poor application-level QoS. Fig. 1 indicates our two experimental scenarios. In the M-1 scenario, we measure the scalability that adding the virtual machine into the physical machine one by one to investigate the appropriate virtual machine number hosting in one physical machine. While in the M-N scenario, we will evaluate the eﬃciency that allocating M virtual machines into N physical machines and investigate the best N. 2.1

Xen Virtual Machine Monitor

Xen [2] is a popular open-source x86 virtual machine monitor (VMM) that allows multiple instantiation operating systems (OSs) running concurrently on a single physical machine. Xen supports both full virtualization and para-virtualization scheme. In the para-virtualization scheme, the guests run a modiﬁed operating system using a special hypercall. Para-virtualization avoids frequent traps into the hypervisor, and achieves high performance. Starting from version 3.0, Xen implement the full virtulization through hardware-assisted virtualization (HVM)

Evaluate the Performance and Scalability of Image Deployment

393

technology. Both Intel and AMD have developed their own solutions named Intel VT and AMD-V to support Xen. Xen uses this hardware-assisted technology to support unmodiﬁed guest operating systems to run within virtual machines. 2.2

Requirements for M-1 Performance Evaluation

Generally, one physical machine will support several virtual machines running simultaneously to share the same physical resources in modern data center. However deploying too many virtual machines will cause the degradation of application QoS. To deploy appropriate number of virtual machines into the physical platform is a challenge. Besides, diﬀerent workloads have diﬀerent demands on system resources. For example, web server consumes a lot of network bandwidth and CPU resource while ﬁle server consumes large amounts of disk I/O resource but less CPU time. It is necessary to study the representative server workloads used in data center. 2.3

Requirements for M-N Performance Evaluation

The beneﬁts of virtualization such as ﬂexible resource management, high reliability, live migration play an important role in the high performance computing (HPC) and parallel computing community. However there is a virtual machine deployment issue that arranging appropriate physical machines to support above virtual machine workloads, which is a challenge. Arranging all the virtual machines into one physical machine will reduce the communication overhead between the virtual machines but there is a resource bottleneck which will seriously aﬀect the application performance. In the otherwise, M virtual machines arranged into M physical machines can obtain the most abundant system resources that each physical machine hosts only one virtual machine, but the communication overheads between physical machines are very huge. What’s more, there is a huge loss of electricity energy. For example, if one parallel job needs 16 threads running simultaneously, we will create 16 virtual machines to support it with MPI communication between each virtual machine. We need to ﬁnd a optimal deployment that maximizes the use of system resource while maintaining overall good performance. Is one physical machine enough or more physical machines are needed to support the 16 virtual machines, we will investigate this deployment issue below.

3 3.1

Experimental Methodology Experimental Configuration

The M-1 experimental evaluations are performed on the Dell 2900 PowerEdge server, with 2 Quad-core 64-bit Xeon processors at 1.86 GHz, while M-N experimental evaluations are performed on four Dell OPTIPLEX 755, with Intel Core2 Quad CPU at 2.4GHz. We use Ubuntu 8.10 with kernel version 2.6.27 in

394

K. Ye et al.

domain 0, and the version of Xen hypervisor is the 3.3.1. Each virtual machine is installed with Ubuntu 8.10 as the guest OS with 4 VCPUs and 256MB memory size. We choose MPICH 2.1.0.8 as our MPI environment. 3.2

Workloads and Benchmarks

To investigate the deployment eﬃciency in M-1 and M-N scenarios, we design two sets of experiments. In the M-1 scenario, we measure the scalability that answers how many virtual machines can be allocated into one physical machine while maintaining good performance? We choose four typical server workloads in modern data center running in virtual machine in our experiments. They are java server, ﬁle server, database server, and web server. Java performance is important in many modern multitiered applications. We choose SPECjbb2005 [11] as the benchmark for java server evaluation. File server performance is also a typical metric indicating the ability to serve I/O requests from clients. We use IOzone benchmark [12] to measure the ﬁle system I/O performance. Database is needed to support running transactional workloads in many modern applications, and it is resource intensive and exercises most system components, especially CPU and storage. We choose Sysbench [13] as our database benchmark. The back-end database used in the experiment is Mysql. Web server is pervasive in modern data centers and is a representative workload for consolidation. We use the WebBench, a simple tool for benchmarking WWW or proxy servers. Apache is used as the back-end server. In the M-N scenario, we try to answer the question how many physical machines are needed to support M virtual machines? We use the HPCC benchmark suite [14] for our study that is commonly used for HPC measurements. The HPCC suite is a comprehensive set of synthetic benchmarks designed to proﬁle the performance of several aspects of a cluster. The testing applications used in our study are listed as follows: – HPL: measures the ﬂoating point rate of execution for solving a linear system of equations. – DGEMM: measures the ﬂoating point rate of execution of double precision real matrix-matrix multiplication. – FFT: measures the ﬂoating point rate of execution of double precision complex one-dimensional Discreate Fourier Transform (DDF). – PTRANS: measures the transfer rate for large arrays of data from multiprocessor’s memory. – STREAM: measures the sustainable memory bandwidth (in GB/s) and the corresponding computing rate for simple vector kernel. – RandomAccess: measures the rate of integer random updates of memory (GUPS). – Latency & Bandwidth: measures latency and bandwidth of a number of simultaneous communication patterns. There are three running modes: single means that a single processor runs the benchmark, star means all the processors run separate independent copies of the

Evaluate the Performance and Scalability of Image Deployment

395

benchmark with no communication, mpi means all processing elements running the benchmark in parallel using explicit data communications. In our experiments, three problem sizes were evaluated, they are 1000MB, 2000MB, 3000MB. The block and grid sizes used are common-block: 80, 100, 120; grid: 2*2, 1*4, 4*1.

4

Experimental Results and Analysis

In this section, we describe the measurement results of deployment performance and scalability in M-1 and M-N scenarios respectively. In order to ensure the data precision, each of the showed experiment results was obtained via running benchmarks ﬁve times on the same conﬁguration, the highest and lowest values for each test were discarded, and the remaining three values were averaged. 4.1

Deployment in M-1 Scenario

In this experiment, we investigate the deployment performance that allocating M virtual machines into one physical machine. M scales from 1 to 8 here. Fig. 2 shows the deployment results of java server, ﬁle server, database server, and web server. It is obvious that more virtual machines deployed into one physical machine incurs more performance degradation, especially for ﬁle server and web server, since they are I/O intensive workloads and the serious I/O resource contention becomes a bottleneck. Fig. 2(a) shows the deployment eﬃciency of java server with a 65.95% performance degradation when deploying 8 VMs into the physical machine compared to the case that deploying 1 VM into the physical machine. There is a nearly linear decline in the performance. From Fig. 2(b), we ﬁnd that there is a signiﬁcant decline in ﬁle I/O performance. It is because that the ﬁle server is I/O intensive. When more virtual machines are added into the physical machine, the disk I/O resource becomes the performance bottleneck and seriously aﬀects the performance. We note that when deploying more than 5 virtual machines into the physical machine, the decline trend becomes ﬂat and maintains at a stable I/O bandwidth with about 30000 Kb/s. The reason is that when more than 5 ﬁle servers are deployed into the physical machine, not only the I/O bandwidth but also the other system resources like CPU become the performance bottleneck. When in the database scalability testing (Fig. 2(c)), the physical machine can support 6 virtual machines well with the execution time less than 86 seconds. However, when more virtual machines are deployed into the physical machine there is a huge increase in the execution time due to the resource bottleneck like CPU, memory, and disk I/O. The web server testing (Fig. 2(d)) shows the similar phenomenon with ﬁle server because of the similar system resource demands. The signiﬁcant decline of web server performance is caused by intense competition for the network I/O and CPU resource.

396

K. Ye et al.

Fig. 2. The Deployment Performance and Scalability of Java Server, File Server, Database Server and Web Server in M-1 Scenario that Allocating M Virtual Machines into One Physical Machine (M-1 Scenario)

It is obvious from the above experiments, when the virtual machine number scales from 1 to 8, all the four server workloads have a sharp performance degradation due to heavy demand on the same kind resource which becomes the performance bottleneck. When more virtual machines are added, other system resources will become the performance bottleneck too. So in this M-1 scenario, M depends on user’s QoS requirements. It is not suitable to deploy more than two same servers in the physical machine due to the QoS consideration. One alternative approach is to consolidate diﬀerent workloads running simultaneously. What’s more, we ﬁnd an interesting phenomenon that one physical machine can host 6 database servers running simultaneously well based on our experimental results. It is because that database is a kind of composite workloads which consumes CPU, memory and I/O resource at the same time. 4.2

Deployment in M-N Scenario

In this experiment, we will study the eﬃciency of deploying M virtual machines into N physical machines with HPCC benchmark. We create 16 virtual machines

Evaluate the Performance and Scalability of Image Deployment

397

Fig. 3. The Computing Performance in the Case that Deploying M Virtual Machines into N Physical Machines (M-N Scenario)

Fig. 4. The Memory Performance in the Case that Deploying M Virtual Machines into N Physical Machines (M-N Scenario)

Fig. 5. The Communication Performance in the Case that Deploying M Virtual Machines into N Physical Machines (M-N Scenario)

398

K. Ye et al.

and test three diﬀerent deployments: 1) allocate 16 VMs into one physical machine; 2) allocate 16 VMs into two physical machines with 8 VMs per physical machine; 3) allocate 16 VMs into four physical machines with 4 VMs per physical machine. We investigate the deployment eﬃciency from computing performance, memory performance and communication performance with diverse benchmarks. The results of these experiments are presented in Fig. 3 to 5. Fig. 3 shows the computing performance with diﬀerent deployment strategies in M-N scenario. HPL, DGEMM, and FFT measure the ﬂoating computing capability focusing diﬀerent aspects. From Fig. 3 we note that the HPL performance has no apparent changes when 16 virtual machines are deployed into one, two or four physical machines. Obviously, this phenomenon indicates that more physical machines will not always improve the overall performance for some applications. In the single mode testing, DGEMM and FFT both have no signiﬁcant changes in all the three deployment strategies. While in the star mode, the case that involving more physical machines achieves better performance. It is because in the single mode a single processor is used to run the benchmark, while in the star mode all the processors run separate independent copies of the benchmark with no communication which maximum utilize the physical resource. Furthermore, in the mpi mode testing of FFT, we observe the case that deploying 16 virtual machines into 4 physical machines obtains worse performance than the case that deploying 16 virtual machines into only one physical machine. The reason is that when involving more physical machines, the MPI communication overhead becomes a bottleneck and aﬀects the performance. Fig. 4 shows the result of memory performance with diﬀerent deployment strategies using STREAM and RandomAccess benchmarks. Similar to the computing performance, in the single mode testing, both STREAM and RandomAccess have no signiﬁcant changes in the three deployment cases. While in the star mode testing, the copy, scale, add, and triad performance of RandomAccess gain an performance improvement when more physical machines are used. RandomAccess has the similar result with STREAM in the star mode. While in the mpi mode, we ﬁnd an opposite conclusion with FFT that the RandomAccess obtains better performance when using more physical machines. It is because that the RandomAccess performs much local processor communication that improves the performance while causes little network communication overhead. Fig. 5 shows the communication performance including processor communication and network communication. PTRANS measures the communication performance of multiprocessor while Latency and Bandwidth refer to the network communication performance. From Fig. 5 we ﬁnd that the data transfer rate is slightly aﬀected by the deployment strategies due to the slightly local memory transfer overheads. While network latency and bandwidth are seriously aﬀected by the deployment strategies with a 275.74% increase in latency and a 46.89% decrease in the bandwidth compared to the case that deploying 16 VMs into one physical machine. This experiment demonstrates that there exists huge network communication overhead when deploying 16 virtual machines into more physical machines.

Evaluate the Performance and Scalability of Image Deployment

399

There is a compromise between communication overhead and system resources. For some speciﬁc server workloads such as HPL and PTRANS, more physical machines provide no obvious performance improvement. Generally, more computing resource will improve the running eﬃciency for the parallel workload running in parallel without MPI communication. Furthermore, for the communication intensive workloads that perform much MPI communication or network communication, it is better to deploy the virtual machines into fewer physical machines due to the heavy communication overheads between physical machines.

5

Related Work

A lot of research has been done to evaluate the performance in virtualization environments [15,6,2,7,5,16]. Barham et al. [2] gave a comprehensive introduction to the Xen hypervisor and made a thorough performance evaluation of Xen against VMware ESX Server, UML, Connectix’s Virtual PC and Plex86 with SPEC CPU2000, OSDB, dbench and SPEC Web2005. Clark et al. [15] reproduced the results from [2] with almost identical hardware, and compared Xen with native Linux on a less powerful PC. Padala et al. [7] measured the performance of server consolidation. Che et al. [6] studied an initial comparison of Xen and KVM with Linpack, LMbench and IOzone. Recently, Deshane [5] presented an independent work about performance comparison of Xen and KVM at Xen Summit, which measured the overall performance, performance isolation and scalability of Xen and KVM. Hai Jin et al. [16] presented a VSCBenchmark to evaluate the dynamic performance of virtualization systems in server consolidation. However, all of the above work does not consider the deployment issues. With respect to the resource and image management issues in virtualization environments, Garbacki et al. [17] solved the virtual resource management issue by introducing linear programming approach. Yamasaki et al. [18] presented a model-based resource selection policy to achieve fast virtual cluster installation on heterogeneous grid environment. Tan et al. [19] indicated that the choice of storage solution and access protocol would aﬀect the deployment eﬃciency. Some management system and tools were also developed for ﬂexible deployment and management of virtual machines [10,20,21]. Our work diﬀer from the previous work, focusing the evaluation the performance and scalability of image deployment with diﬀerent deployment strategies.

6

Conclusion and Future Work

In this paper, we have investigated the performance and scalability with diﬀerent virtual machine deployment strategies. We summarize two typical virtual machine deployment scenarios M-1 and M-N to explore the deployment issue. We ﬁrst investigate the deployment eﬃciency of M-1 scenario that allocating M virtual machines into one physical machine. We choose four representative server applications in modern data center and quantify the scalability. Then we

400

K. Ye et al.

evaluate the eﬃciency of M-N scenario that allocating M virtual machines into N physical machines. To quantify the overall performance, we choose the standard HPC benchmark HPCC that running in parallel manner. Our experimental results indicate: 1) There is a resource bottleneck when deploying single type virtual machine server into single physical machine and consolidation of diﬀerent workloads together is recommended. It is suitable to deploy several virtual machines running composite workloads into one physical machine, such as database server which consumes CPU, memory and I/O resources at the same time. 2) For computing intensive applications that don’t run in parallel manner and processor communication intensive applications which are insensitive with the deployment strategies, more physical machines are useless. 3) For the parallel applications with no MPI communication, more physical machines can achieve better performance. 4) For the computing intensive workloads that have high MPI and network communication traﬃc, it is not suitable to deploy the virtual machines across multiple physical machines due to the signiﬁcant communication overheads between physical machines. This is only the beginning of performance evaluation of deployment issues. Various other measurement and optimization algorithm will need to be explored in the future.

References 1. Waldspurger, C.A.: Memory Resource Management in VMware Esx Server. SIGOPS Oper. Syst. Rev. 36(SI), 181–194 (2002) 2. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warﬁeld, A.: Xen and the Art of Virtualization. SIGOPS Oper. Syst. Rev. 37(5), 164–177 (2003) 3. Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: kvm: the Linux virtual machine monitor. In: Linux Symposium (2007) 4. OpenVZ: Server Virtualization Open Source Project, http://openvz.org 5. Deshane, T., Shepherd, Z., Matthews, J., Ben-Yehuda, M., Shah, A., Rao, B.: Quantitative comparison of Xen and KVM, Xen Summit, Boston, MA, USA (2008) 6. Che, J., He, Q., Gao, Q., Huang, D.: Performance Measuring and Comparing of Virtual Machine Monitors. In: IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (2008) 7. Padala, P., Zhu, X., Wang, Z., Singhal, S., Shin, K.: Performance evaluation of virtualization technologies for server consolidation. Technical Report, HP Laboratories (2007) 8. Apparao, P., Iyer, R., Zhang, X., Newell, D., Adelmeyer, T.: Characterization & analysis of a server consolidation benchmark. In: Proceedings of the Fourth ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp. 21–30 (2008) 9. Tikotekar, A., Vall´ee, G., Naughton, T., Ong, H., Engelmann, C., Scott, S.L.: An analysis of hpc benchmarks in virtual machine environments. In: Proceedings of 3rd Workshop on Virtualization in High-Performance Cluster and Grid Computing, pp. 63–71 (2008)

Evaluate the Performance and Scalability of Image Deployment

401

10. Padala, P., Shin, K., Zhu, X., Uysal, M., Wang, Z., Singhal, S., Merchant, A., Salem, K.: Adaptive control of virtualized resources in utility computing environments. SIGOPS Oper. Syst. Rev. 41(3) (2007) 11. Standard Performance Evaluation Corporation, SPECjbb, http://www.spec.org/jbb2005 12. Iozone Filesystem Benchmark, http://www.iozone.org 13. Sysbench Benchmarking Tool, http://sysbench.sourceforge.net 14. HPC Challenge Benchmark, http://icl.cs.utk.edu/hpcc 15. Clark, B., Deshane, T., Dow, E., Evanchik, S., Finlayson, M., Herne, J., Matthews, J.: Xen and the art of repeated research. In: USENIX Annual Technical Conference, pp. 135–144 (2004) 16. Jin, H., Cao, W., Yuan, P., Xie, X.: VSCBenchmark: benchmark for dynamic server performance of virtualization technology. In: Proceedings of the 1st International Forum on Next-Generation Multicore/Manycore Technologies (2008) 17. Garbacki, P., Naik, V.: Eﬃcient resource virtualization and sharing strategies for heterogeneous Grid environments. In: 10th IFIP/IEEE International Symposium on Integrated Network Management, pp. 40–49 (2007) 18. Yamasaki, S., Maruyama, N., Matsuoka, S.: Model-based resource selection for eﬃcient virtual cluster deployment. In: Proceedings of the 3rd International Workshop on Virtualization Technology in Distributed Computing (2007) 19. Tan, T., Simmonds, R., Arlt, B., Arlitt, M., Walker, B.: Image management in a virtualized data center. ACM SIGMETRICS Performance Evaluation Review 36(2), 4–9 (2008) 20. Vall´ee, G., Naughton, T., Scott, S.: System management software for virtual environments. In: Proceedings of the 4th International Conference on Computing Frontiers, ACM, New York (2007) 21. McNett, M., Gupta, D., Vahdat, A., Voelker, G.: Usher: An extensible framework for managing clusters of virtual machines. In: Proceedings of the 21st Large Installation System Administration Conference (2007)

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme Yue Zhang1 and Yunxia Pei2 1 Department of Math and Computer Science Center of Network, Henan Police College, Zhengzhou, China 2 Zhengzhou University of light Industry, Zhengzhou China

Abstract. The goal of this work is to enable mobile devices as a resource in the Grid network.We propose an IP-paging scheme based on the mobile grid computing framework to manage idle mobile devices. Within this framework, we discuss several research issues such as location tracking using IP-paging, limiting power using the Grid page cache, and service migration using reserved resources. We also describe a mobile proxy and a prototype for a paging agent. Keywords: Mobile Grid Computing,IP-Paging, Resource Discovery Algorithm.

1 Introduction Grid computing differs from conventional distributed computing by its focus on largescale resource sharing, innovative applications, and high-performance focus[1]. In Grid computing, the term resource management refers to the operations used to control how capabilities provided by Grid resources and services are made available to other entities such as users, applications, or services [2]. A resource manager is one of the most critical components of the grid middleware [4], since it is responsible for resource management that provides resource selection and job scheduling. Therefore, resource discovery, resource selection, and job scheduling have considerable influence on computing performance and are important issues in Grid computing. In the early period of grid computing, most researches have focused on a fixed network [1, 2, 3, 4, 5]. With the improvements on Internet technology, the use of mobile devices is rapidly increasing. Researches in Grid computing [6] tried to combine Grid computing with mobile devices. However, according to properties of mobile environments, the management of mobile devices deserve many careful considerations, such as mobility management, disconnected operation, device heterogeneity, service discovery, resource sharing, and so on. There are two approaches in mobile grid computing to address these problems: proxy-based mobile grid architecture [6] and agent-based mobile grid architecture. There are various solutions proposed in these two kinds of architectures. However, the existing solutions cannot solve the mobility management problems completely. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 402–411, 2010. © IFIP International Federation for Information Processing 2010

A Resource Discovery Algorithm in Mobile Grid Computing

403

2 Related Works This section briefly surveys the background on mobile grid computing and IP-paging, which is employed in our proposed architecture. 2.1 Mobile Grid Computing There are two architectures for mobile grid computing. One is a proxy-based mobile grid architecture, and the other is an agent-based mobile grid architecture. Proxybased mobile grid architectures to support lots of mobile devices in computational grid are proposed in [6]. Fig. 1. shows the proxy-based mobile grid architecture.

Fig. 1. The proxy-based mobile grid architecture

In this architecture, there is a cluster of mobile devices. The cluster is connected to a base station that is a router/node within grid or a grid proxy server. The base station has two functions. It works as a wireless access point and a proxy server that is responsible for data transfer, QoS, and resource access policy. The mobile devices use the base station to access grid, and the base station allocates the jobs requested from mobile devices to grid computing system. However, in the proxy-based architecture, mobile devices have no access to grid resources. The agent-based grid architectures use mobile agents to provide, share and access distributed resources in grid network. Grid clients submit computing jobs to the computational grid, and the mobile agents carrying the user programs of the jobs travel among the remote hosts to search for available resources. By taking advantages of the mobile agent’s mobility and autonomy features, a job can be executed at any suitable machines and may migrate to other nodes if necessary. The service tasks can migrate to different idle machines during their execution. In this way, load balancing and sharing of distributed computing resources in the computational grid can be achieved.

404

Y. Zhang and Y. Pei

2.2 IP-Paging IP-paging is a procedure that allows a wireless system to search for idle mobile devices when there is a message destined for it, such that the mobile user does not need to register its precise location to the system whenever it moves. Paging has two major benefits: z z

reducing signaling overhead reducing power consumption.

Fig. 2. shows basic paging reply message format in IP-paging.

Fig. 2. IP-paging reply message format

In mobile grid computing environment, since grid tasks are executed at mobile devices, reducing power consumption is important issue to use a mobile device to perform such tasks.

3 Wireless Grid Paging Architecture In this paper, we propose a wireless grid computing architecture based on IP-paging. We tried to enable dormant mobile devices as a grid resource and to solve limited battery problem. Fig. 3. shows the wireless grid paging architecture.

Fig. 3. The wireless grid paging architecture

A Resource Discovery Algorithm in Mobile Grid Computing

405

In this wireless grid computing architecture, the movements of mobile device are managed by a grid proxy or a grid controller. The grid proxy in our proposed architecture performs scheduling operation as a pure meta-scheduler. It performs following three operations: dividing a submitted job into tasks, assigning tasks to mobile proxy, and collecting the accomplished tasks. 3.1 The Paging Agent In previous wireless grid computing architecture, the movement between cells is managed by a fixed grid network. However, our architecture doesn’t consider the movement between cells of the mobile devices. It is because we use only idle mobile devices. Instead, we consider only the status of mobile devices and the movements between paging areas. In IP-paging scheme, the status of a mobile device is classified into two categories: the active state and the idle state. The mobile device is registered to paging agent with following two conditions: joining paging area of new wireless grid and changing from the active state to the idle state.

Fig. 4. The grid paging registration message format

Fig. 4. illustrates the grid paging registration message format. In IP-paging scheme, paging agent manages address information of an idle mobile device. In this paper we extend IP-paging message format to manage address information and resource status information. 3.2 Mobile Proxy When tasks are assigned from a grid proxy, a mobile proxy composes a pre-candidate set by resource information on paging cache. And, to confirm status of resource, the mobile proxy sends paging request message to the mobile devices on the precandidate set. Then, the mobile proxy receives paging reply message from mobile devices on pre-candidate set, and decides final resource set. There is additional information in paging reply message that is the current address of mobile device that will issue the computing task. And, to prevent resource duplication, running mobile devices are marked on paging cache. Fig. 5. shows the grid paging reply message format.

406

Y. Zhang and Y. Pei

Fig. 5. The grid paging reply message format

4 Mobile Proxy and Paging Agent Fig. 6. shows the components of a mobile proxy and the paging agent. The proposed mobile proxy and paging agent consist of grid middleware, ontology service, and UDDI registry.

Fig. 6. The mobile proxy and paging agent

The grid middleware consists of the communication interfaces, a paging cache, an information service, and a broker service.

A Resource Discovery Algorithm in Mobile Grid Computing

407

4.1 The Communication Interface There are two communication interfaces: one is that between the mobile device and the mobile proxy, and the other is between the mobile proxy and the paging agent. The communication interface with mobile devices communicates with mobile devices by SOAP framework. The communication interface with grid network communicates with a fixed grid network and allows using the standard grid protocol (ex. GRAM, MDS, GIS). 4.2 Paging Cache Mobile grid environment intends free communication between mobile devices and a fixed grid network. But it isn’t possible, because there are different types of devices, interfaces, CPUs, capacity in the mobile environment. As shown in the section 3.2, the mobile proxy and paging agent manages mobile device status in its grid paging area. Fig. 7. shows the example of grid paging cache.

Fig. 7. The example of grid paging cache

The Home Address field denotes mobile device’s home address. The Type field denotes type of a mobile device (i.e., L means laptop, P means PDA, C means cellular phone). The CpuSpeed field denotes CPU speed of a mobile device. The NumCpus field denotes the number of CPUs in a mobile device. The RamSize field denotes RAM size in a mobile device. The Power field denotes the remaining power of a mobile device (e.g., 40 means 40%, R means recharged). The Capacity field denotes all capacity of mobile devices. The FreeSpace field denotes usable capacity of mobile devices. And, the Used field denotes a mobile device is using or not in another grid operations (i.e., N means none, U means used). 4.3 Information Service There are lots of grid services and grid resources in grid network. To provide grid services, service discovery and resource provision is needed. The globus toolkit is representative grid infrastructure to provide service discovery and resource provision. The principle of service discovery and the resource provision scheme of globus toolkit is that when a new job or task is assigned, it discovers and decides a suitable service or resource among all reserved services or resources. However, this scheme of globus is incongruent to mobile grid environment, due to the device mobility.

408

Y. Zhang and Y. Pei

The proposed service discovery and resource provision scheme is based on paging cache discussed in the section 4.2. When a new task is assigned, the paging agent composes a pre-candidate set by resource information on paging cache. And, to confirm the status of resource, the mobile proxy sends paging request message to the mobile devices on the pre-candidate set. The service discovery and the resource provision steps follow the grid paging registration/replay procedure discussed in the section 3.3. Table 1 shows the query processing. Table 1. The Query Processing

ex) Q={R{R1 .... RN}| Type = laptop and CpuSpeed 2.0 and RamSize 512 and Power = recharged and ......... } 4.4 Broker Service The broker service in mobile proxy and paging agent performs two operations: The first operation is delivery of an assigned task from mobile device to grid proxy. And, the second operation is to schedule an assigned task from grid proxy to mobile resources. Table2. shows the operation of broker service. Table 2. The Operation Of Broker Service

Broker service phase 1: Wait for tasks from grid proxy 2: Request grid paging cache 3: Decide pre-candidate set from paging cache 4: Send grid paging request to current status of pre-candidate set 5: Wait grid paging reply from pre-candidate set 6: Decide candidate set from pre-candidate set 7: Decide reserved mobile devices from pre-candidate set 8: Send task to each mobile devices of candidate set 9: Request to change used field U and N to paging cache 10: If processing mobile device changes its state to active or its battery status to discharging then 11: Receive processing data 12: Send processing data to reserved mobile device 13: Wait for tasks from each mobile devices

5 Performance Evaluations 5.1 Network Model Fig. 8. shows the network model.

A Resource Discovery Algorithm in Mobile Grid Computing

409

Fig. 8. The network model

The network model used for all the simulations includes six routers and three gateway routers on the Ethernet (10Gbps)-based fixed grid network. Each wireless grid paging area includes 14-18 cells. Mobile devices have a wireless link (11Mbps:802.11b) with the access point on each cell. Access point and gateway router of a fixed grid network are connected to each other through Ethernet-based internet. 5.2 Class Component Fig. 9. shows the class of components. The Grid_Proxy is located on a fixed grid network. And the Mobile_Proxy is located on a wireless grid network. The Paging_Agent is located with Mobile_Proxy. It makes paging cache by each mobile device’s MD_Status. And the Mobile_Device is invoked by MD_manager. 5.3 Performance Metrics The purpose of the performance comparison is to quantitatively evaluate the improvements that total tasks would experience in a system using the proposed enhancements in comparison to the proxy-based scheme. The parameters to be studied are as follows: z

Task execution time: Task execution time is defined for an executing task as the time that elapses between the stating task execution and the finishing task execution. We study task execution time for an increasing number of total tasks.

410

Y. Zhang and Y. Pei

Fig. 9. The class of components

z

Task loss: Task loss is defined for an executing task as the task loss by movement or power discharging. We study task loss for random movements and power discharging on execution time.

5.4 Evaluation and Discussion We present the results of the impact of execution time in Fig. 10 with the following parameters: total tasks, starting task, movement, battery discharging, and task migration. For all tasks, the proposed scheme shows a slight better performance than a proxybased scheme. Because the proposed scheme selects idle mobile devices, selected mobile devices have low probability of movement or active state during execution. We can observe that the proposed scheme is slowly increased and the proxy-based scheme has rough variation, because mobile devices that the proxy based scheme selected, re-execute many times by movement and battery discharging.

Fig. 10. The total execution time on total tasks

A Resource Discovery Algorithm in Mobile Grid Computing

411

6 Conclusion and Future Works In mobile grid computing, there are many problems to be solved, e.g., disconnected operation, location tracking, limited power, service migration, job scheduling, device heterogeneity, security, and so on. In this paper, we tried to solve location tracking using IP-paging, limited power using grid paging cache, and service migration using reserved resource. But there are still challenging problems such as disconnected operation, job scheduling, device heterogeneity, security and so on. We plan to tackle these issues in future works and develop mobile grid computing system based IPpaging. Moreover, we plan to extend our investigation into wireless network and implications in grid computing and try to provide effective mobility support computing.

References 1. Foster, S.I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Int. Journal of Supercomputer Applications 3 (2001) 2. Foster, I., Kesselman, C.: The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco (2008) 3. Foster, I., Roy, A., Sander, V.: A Quality of Service Architecture that Combines Resource Reservation and Application Adaptation. In: 8th International Workshop on Quality of Service (2000) 4. Foster, I.: The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. In: Global Grid Forum (2009) 5. Foster, I.: The Grid: A New Infrastructure of 21st Century Science. Physics Today 55, 42–52 (2002) 6. Ghosh, P., Roy, N., Das, S.K., Basu, K.: A Game Theory based Pricing Strategy for Job Allocation in Mobile Grids. In: Proceedings of 18th International Parallel and Distributed Processing Symposium (2004)

JAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure Juan Peng, Xiaoyi Lu, Boqun Cheng, and Li Zha Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China {pengjuan,luxiaoyi,cbq}@software.ict.ac.cn, [email protected]

Abstract. Usability is an important feature of Grids or Clouds to end users, who may not be computer professionals but need to use massive machines to compute their jobs. For meeting various computing or management requirements, heterogeneous clusters with diverse Distributed Resource Management Systems (D-RMS) and applications are needed to supply computing services in Grids or Clouds. The heterogeneity of clusters and diversity of applications are easy to cause Grid or Cloud systems hard to use. This paper presents a usable batch job management system, called JAMILA, to coordinate heterogeneous clusters and diverse applications over Grid or Cloud infrastructure. JAMILA aims to implement a uniform job management mechanism to integrate different kinds of DRMSes and it is designed to supply non-professional users from various fields with a professional but usable high performance computing environment. Some key technologies of JAMILA have been used in China National Grid and an experimental Cloud-oriented infrastructure. Keywords: D-RMS, heterogeneous clusters, diverse applications, HCC, ASEM.

1 Introduction As computing resources continue to move into Grids and Clouds, they let more and more users who have a variety of HPC (High Performance Computing) demand gather around Grids or Clouds. In the Grid environment, every site is managed autonomously, so different sites may be installed different D-RMSes and deployed diverse applications. According to our experience in developing Vega GOS [1, 2] for CNGrid (China National Grid), we draw the Fig. 1 to illustrate the typical scenario of heterogeneous computing resources in Grids. In the Fig. 1, Beijing site installs LSF, Shanghai site utilizes Torque, and Hong Kong site uses Condor. As different interfaces among D-RMSes, it causes the heterogeneities among clusters. If users directly use these clusters, they need spend a lot of time accepting unnecessary trainings. Meanwhile, a Grid, as a huge resource pool, should supply HPC services to users of various fields, who need to compute jobs by diverse applications. In the Fig. 1, most C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 412–422, 2010. © IFIP International Federation for Information Processing 2010

JAMILA: A Usable Batch Job Management System

413

applications in all three sites are different, and even the same application in different sites may be installed with different versions, paths, or configurations. These cause the diversity among applications. In CNGrid, 42 kinds of applications are installed up to October in 2009, and many of them have different versions.

Fig. 1. Scenario of Heterogeneous Clusters and Diverse Applications in the Grid Infrastructure

Fig. 2. Scenario of Heterogeneous Clusters and Diverse Applications in an Experimental Cloud Computing Oriented Infrastructure

Therefore, if we want to design a usable batch job management system for Grids, we should shield the heterogeneity of clusters. In addition, it also should make users focus on their job execution logic but the usages of Grid system’s APIs or commands. The job execution logic is the process that users engage in computing activities: 1) application selection; 2) parameter configuration of the application; 3) stage-in and stage-out files configurations; 4) resource quota configuration. It means how

414

J. Peng et al.

many CPUs, memories and other else resources are required. This requires the batch job management system can provide a usable way for users to easily access all applications in the Grid. For Cloud, in our opinions, we consider it has three major differences with the Grid. First and foremost, Cloud is more concerned about the management of resources within a site. There is always an infrastructure management system as the fundamental component of IaaS to support resource dynamic provisioning and partitioning. Each partition is likely to be used to support different computing modes and managed by different software. Besides, virtualization technology emerges as a new resource provisioning way. The most important feature of this way is multi-tenanted which means customers can configure their leased resources on demand within the lease rule. This easily causes the heterogeneous software deployed in virtual clusters. Last but not the least, new computing mode for large-scale data processing should be supported in a Cloud platform. Fig. 2 shows a resource view of our experimental Cloud environment. We anticipate tens of thousands of machines in a datacenter or a computing center will be divided into four big clusters: a MapReduce cluster, an HPC cluster, a VM Provisioning cluster for virtual cluster creation, and a big Storage pool. In this figure, we suppose there are two virtual clusters with Condor and LSF for users’ mutative computing requirements. Similar with the Grid scenario, if we want to design a batch job management system for a Cloud, we also have to confront with the situation of heterogeneous clusters and diverse applications. From all of above, we conclude that no matter in a Grid or a Cloud infrastructure, the heterogeneity of clusters and the diversity of applications are exist, and they are easy to cause the Grid or Cloud hard to use. This paper presents a usable batch job management system, called JAMILA, to coordinate heterogeneous clusters and diverse applications over Grid or Cloud infrastructure. JAMILA aims to implement a uniform job management mechanism to integrate different kinds of D-RMSes and it is designed to supply non-professional users a professional but usable HPC environment. The rest of this paper is organized as follows: Section 2 presents the architecture of JAMILA. Section 3 shows some key technologies implementation of JAMILA. In the section 4, the evaluation of JAMILA will be demonstrated. Section 5 concludes the paper and gives the future work.

2 JAMILA Architecture 2.1 Design Goals and Basic Concepts In our design, JAMILA should satisfy three major goals: 1) Easy to use. JAMILA should make end users just concentrate their attention on their jobs’ execution logic but other things. To administrators, JAMILA should provide tools to help them integrate different clusters and manage other resources; 2) Effectiveness. It mainly embodies in two aspects: a) quick response to requests; b) job processing capacity: it can concurrently handle hundreds of jobs and requests well; 3) Robustness. The system should work well under heavy load, and the system should have recovery mechanism when system failure.

JAMILA: A Usable Batch Job Management System

415

Before we introduce the architecture, four basic concepts in JAMILA are: 1) User. JAMILA has two kinds of users: global user and local user. A global user is managed by JAMILA while a local user is managed by local OS. User mapping exists between them; 2) Application. It is a software package providing some HPC functions to users; 3) Cluster. It is a computing resource. It is often installed D-RMS and applications to provide HPC services. We classify the clusters by the type of D-RMS installed on them; 4) Job. A job can be considered as a runtime instance of an application on a cluster by a global user. A job has a life cycle with different statuses. 2.2 Architecture Overview From the scenarios in Introduction chapter, we find that end users mainly concern submitting jobs and quickly getting jobs’ results. So, we consider designing a batch job management system should solve the following three most important problems: 1) providing a usable interface to make users easily submitting and managing jobs; 2) submitting and executing jobs on a proper cluster; 3) helping administrators integrate clusters. Based on these problems, we design JAMILA to three layers: User Interface Layer, Job Execution Layer, and Cluster Connector Layer, as shown in the Fig. 3.

Fig. 3. Architecture of JAMILA

1) User Interface Layer: it is to make end users directly face to all kinds applications and to shield all things that are not directly related with job execution logic. In this layer, JAMILA provides two web 2.0 technology based portals: one is Management Portal, and the other is Application Portal. Web-based computing services can be provided anytime, anywhere to users from various fields. Administrators can use Management Portal to configure user mapping, to integrate heterogeneous clusters

416

J. Peng et al.

and to manage diverse applications. End users can use Application Portal to submit or to manage jobs. When doing submission, users first select an application, and then define the jobs execution logic on a web page related to the selected application. Finally, they click the submit button, and wait for jobs’ results. Fig. 4 shows the submission page of Gaussian.

Fig. 4. Job Submission Page of Gaussian

Fig. 5. The Submission Processing Sequences and Job Stages Conversion in JAMILA

2) Job Execution Layer: it addresses all requests from end users. It contains three parts: Job Manager, Scheduler, and Job Execution Pipeline. Job Manager likes a task dispatcher which pushes submission, rerunning, terminating requests to Scheduler and handles other simple requests by itself. Scheduler is to find out a target cluster and put the job to Job Execution Pipeline to be executed (If the request is to terminate a job, the Scheduler does nothing but directly pushes this request to Job Execution Pipeline.). Job Execution Pipeline is the real place to manage the execution of a job on a cluster. In this layer, JAMILA adopts GridSAM’s pipeline technology [3], which utilizes SEDA (Staged Event Driven Architecture) [4] as the implementation principle, and builds on Quartz [5] framework to schedule stages and allocate threads. By pipeline technology, a job’s life cycle is divided into several stages. Each stage does a

JAMILA: A Usable Batch Job Management System

417

specific task. Fig. 5 shows the process of submission operation and job stages. If a job gets errors at any stages, the job will exit from pipeline directly after changing its status to fail. Job execution pipeline technology highly improves the system’s throughput. 3) Cluster Connector Layer: it aims to address how to connect heterogeneous clusters in a uniform way. The main component of this layer is HCC (Heterogeneous Cluster Controller), which contains three parts: Task Dispatcher, User Switcher, and Driver. Task Dispatcher delivers jobs to their target clusters to run. Then the User Switcher on the target cluster receives these jobs, maps every global user to a local user on the cluster, and redirects the task on behalf of the local user to the Driver. Finally, the Driver invokes D-RMS on the target cluster to do this task and returns DRMS’s response to Job Execution Pipeline.

3 Key Technologies Implementation in JAMILA As mentioned in Introduction, two difficulties of implementing a batch job management system over a Grid or Cloud are how to make users easily access all kinds of applications, and how to coordinate heterogeneous clusters. JAMILA invents two technologies to solve these two problems, one is ASEM (Application Software Encapsulation Mechanism) and the other is HCC (Heterogeneous Cluster Controller). In addition, an optimization technology is also shared. 3.1 Application Software Encapsulation Mechanism (ASEM) ASEM, which is similar with the Rappture [6, 7] technology in nanoHUB.org [7], provides a uniform solution of how to encapsulate diverse applications with good sharability and usability. In order to better understand the implementation detail of ASEM, we first introduce the concept of ST. ST is a method for encapsulating an application software usage and it contains a vocabulary, variables, functions, and normative XML Schema that facilitate the expression of application software usage. Definition 3.1. ST = < JobSubmitTemplate, JobManageTemplate> JobSubmitTemplate = < SubAppearance, SubOperation>: defining job submission logic. JobManageTemplate = <ManAppearance, ManOperation>: defining job management logic.

Definition 3.2. SubAppearance|ManAppearance = < HTMLContet, {Image}>

、

、

HTMLContet: containing an HTML string to define the application’s web page. Image = < CodingType ImageFilePath ImageCode>: storing the images of web pages.

Definition 3.3. SubOperation = {Name, SoftwareName, SoftwareVersion, JobProject, Description, Executable, Argument, Output, Error, Input, {StageIn}, {StageOut}, Resource, {GenerateFile}, JobWorkDirectory}: it is to store information of job execution.

418

J. Peng et al.

Definition 3.4. ManOperation = < Invoker, OperationSequence > Invoker: component of invoking the management operation OperationSequence = < {GenerateFile},{UploadFile} >: defining management operation.

Definition 3.5. StageIn | StageOut | UploadFile = < Source, Target > Source: file address referenced to the client host. Target: file address referenced to the server host

Image element is to store coding data of the picture with the coding type indicated in CodingType element. ASEM loader and Interpreter has coding and decoding functions. So, when an administrator creates an ST, he or she configures other text parts of the ST and uploads pictures, then the ASEM loader and Interpreter codes the pictures and adds several Image elements to the ST. Next time, when he or she exports this ST, pictures’ code is in it and it works well in other environments. This makes STs easy and convenient to be shared. All STs in JAMILA are a kind of resource. ASEM Loader finds out this resource and shows them in Application Portal in the form of a list. When the end user chooses an application from the list, the ASEM Interpreter paints the web page according to the HTMLContent element of the selected application’s ST, like the page shown in Fig. 4. So, by ST and ASEM Loader and Interpreter, end users can submit jobs on a web page instead of developing a complicate JSDL [8] file or using D-RMS commands directly. As users are quite clear about their job execution logic, filling such a web page form is not difficult. Since applications installed on different clusters may have different versions, paths, and other properties which are closely dependent on specific environments. To better share the developed STs across Grids or Clouds, we must remove application’s variability from ST. So we introduce STI, which is used to store the applications’ variable information such as version, command paths, and so on. This design makes STs be independent with environment, so STs can be used to other environment besides the developed environment. And the Management Portal contains exporting ST function. The last component is ASEM Translator which is to transfer end users’ inputs, ST, and relative STI together to a JSDL. Then it pushes the generated JSDL to Job Execution Layer. For most batch job management systems supporting JSDL, so in order to be easy to interoperate with other systems, Job Execution Layer uses the JSDL as its input. 3.2 Heterogeneous Cluster Controllers (HCC) HCCs are bridges between JAMILA and clusters. It contains three parts: Task Dispatcher, User Switcher, and Driver. All these three parts are implemented by Linux Shell Scripts. The Driver provides six standard interfaces for all kinds of D-RMSes as shown in the Table 1. Operations related with jobs in JAMILA are implemented by these six basic interfaces. All kinds of Drivers for specific D-RMSes need to be implemented these six standard interfaces. Therefore, the above layer can use a uniform way to invoke HCCs.

JAMILA: A Usable Batch Job Management System

419

HCC separates implementation from logic of the system which makes upper layer not require caring about the underlying resources and makes JAMILA easily integrate different kinds of clusters. Furthermore, by using HCC, heterogeneous clusters can dynamically join to or leave from JAMILA. Table 1. Interfaces of Driver Parameter Format

Function

-b -s -d

Submit a job to a cluster’s D-RMS Get a job’s latest status Get a job’s execution detail, including execution queue, consumed CPU time and so on Cancel a specific job Rerunning a specific job Doing pre-process or post-process of a specific job.

-c -r -p

3.3 Job Submission Rate Controlling In the Job Execution Layer of JAMILA, we design a simple but efficient mechanism to improve the system’s capacity. In this layer, job execution pipeline uses Quartz to schedule jobs stages. When lots of jobs are submitted into the pipeline, it very easily causes congestion in Quartz, which makes many jobs hard to be scheduled. If more and more jobs continually rush into the system, it will make all threads in the system look like locked. Therefore, we design and implement a simple job submission rate controlling mechanism to make the job’s arrival rate to be nearly stable. If jobs cannot be served at current, they will be saved into database. Through our experiments in the next section, we can see the processing capacity and throughput of JAMILA are highly improved by it.

4 Evaluations 4.1 Evaluations of Design Goals Easy to use: For an administrator, he or she can integrate new clusters or old clusters with new configurations to JAMILA by HCC. He or she just needs to do three simple steps: copy User Switcher and a fit Driver to the cluster, change the commands path of D-RMS in the Driver, and configure sudo & SSH of the cluster to enable JAMILA server executing the Driver with any local users’ identities without password. Now, JAMILA has Drivers for most popular D-RMSes including PBS, Torque, Condor, LSF, SGE and Hadoop. What is more, developing a new Driver is not difficult by our Driver Template. JAMILA aims to supply non-professional users a professional but friendly HPC environment. In JAMILA, users do not need to learn extra computer knowledge. What they need to do when submitting a job is to define their jobs execution logic mentioned in Introduction chapter, and then push the jobs to JAMILA. In a word, JAMILA makes job submission and management operations like surfing internet, which can be finished just by a few mouse clicks and filling a web page form.

420

J. Peng et al.

Effectiveness: We do some experiments to evaluate JAMILA’s performance mainly on its max processing capacity and response time. All experiments are done on the server with following configuration: x86_64, two Intel(R) Xeon(R) with each four cores, 2.0GHz, 8GB memory, Gigabytes Ethernet, CentOS 5.3 (kernel 2.6.18), MySQL 5.0.45, JDK 1.6.0, Apache Tomcat 5.0.28, Torque 2.1.9. Firstly, we evaluate the response time for job submission operation from 10 jobs/min to 400 jobs/min. As shown in the Fig. 6, the response time keeps stable nearly before 100 jobs/min, and the average response time is 316ms from 10 jobs/min to 400 jobs/min. Then, we evaluate the average turnaround time evaluated by the number of finished jobs per minute, and the result is shown in the Fig. 7. From this result, we can see that when the job submission rate reaches 300 jobs/min, if no submission rate controlling mechanism, the finished job number per minute will reduce dramatically; if we adopt job submission rate controlling, the finished job number keep stable basically. Robustness: From the Fig. 7, we can see that Job Submission Rate Controlling makes JAMILA work well on a big submission rate. Moreover, as JAMILA having functions like job information persistence and recovery mechanism by Quartz, when the system fails, those unprocessed or unfinished jobs can continue to run. 4.2 Evaluations of Key Technologies JAMILA’s two major technologies ASEM and HCC are widely used in CNGrid for about two years. Up to October in 2009, there are 42 kinds of applications encapsulated by ST in CNGrid. These applications cover many fields, such as Biology, Meteorology, Chemistry, and so on. Closely analyzing these STs, we find that 62.07% STs can be developed from a simple ST, like Gaussian ST shown in Fig. 4. According to our training experience of developing STs, a simple ST like above can be developed in 10 minutes. Even the most complicated STs in CNGrid can be developed less than 2 hours. What’s more, STs can be shared by different clusters since it is independent with environments. So, developing a ST is not a difficult thing. HCC technology is used to shield the heterogeneities of clusters in Grid or Cloud infrastructures. Developers of GridSAM also introduced this technology into GridSAM by our collaboration. Firstly, we give a simple statistic result of drivers’ code lines in HCC as shown in Table 2. From this table, we can see all six Drivers of HCC are not very complex and the average code of all Drivers is 323 lines. Here, JAMILA can also easily integrate Hadoop cluster for large scale data processing in our experimental cloud environment through ASEM and HCC technologies. Besides, we do experiments to show the cost of driver execution in the job submission operation and the cost of driver invocation by VFork. Both of two costs have little change as the number of continually submitted jobs increasing and the average driver invocation cost is only 6.7% to that of driver execution. We adopt JNI to invoke more light-weighted VFork mechanism to fast create a child process, instead of Fork. We find VFork can dramatically reduce the memory usage.

JAMILA: A Usable Batch Job Management System

421

Fig. 6. Average Response Time under Different Job Submission Rate

Fig. 7. Average Finished Jobs/Min under Different Job Submission Rate Table 2. Statistic Results of Drivers’ Code Lines in HCC Driver Type PBS LSF Condor Fork SGE Hadoop

Code Lines of Driver 405 323 378 138 412 341

5 Conclusion and Future Work This paper presents a batch job management system JAMILA, to coordinate diverse applications and heterogeneous clusters over Grids or Clouds. In JAMILA, ASEM is to encapsulate applications with good sharability and usability. With it, users just need to concern jobs execution logic. ASEM has been used in CNGrid practically for two years over 12 sites across China and up to October in 2009, 42 kinds of applications

422

J. Peng et al.

are encapsulated by ST. HCC is used to integrate heterogeneous clusters in a uniform way. In order to integrate a new kind of cluster, administrators just need to develop a fit Driver for the new cluster. Now, JAMILA has Drivers for Fork, PBS, LSF, Condor, SGE, and Hadoop. Besides these two important technologies, we do some work on improving JAMILA’s effectiveness and robustness, such as job submission rate controlling, memory leak checking and so on. These efforts largely improve the system’s processing capacity and reliability. JAMILA with 1GB memory usage configuration can handle about 230 million jobs’ whole life cycle running at the same time. There is a lot of future work to do. The most important is JAMILA needs a good scheduling algorithm to balance loads among clusters. Currently, scheduler component in JAMILA uses random scheduling algorithm to get a target cluster.

Acknowledgment This research is supported in part by the National Basic Research (973) Program of China (Grant No. 2005CB321807) and the Hi-Tech Research and Development (863) Program of China (Grant No. 2006AA01A106, 2006AA01Z121, and 2009 AA01A130).

References 1. Xu, Z., Li, W., Zha, L., Yu, H., Liu, D.: Vega: A Computer Systems Approach to Grid Computing. Journal of Grid Computing 2(2), 109–120 (2004) 2. Zha, L., Li, W., Yu, H., Xie, X., Xiao, N., Xu, Z.: System Software for China National Grid. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 14–21. Springer, Heidelberg (2005) 3. Lee, W., McGough, A.S., Darlington, J.: Performance Evaluation of the GridSAM Job Submission and Monitoring System. UK eScience Program All Hands Meeting (2005) 4. Welsh, M., Culler, D., Brewer, E.: SEDA: An Architecture for Well-Conditioned, Scalable Internet Services. In: Proceedings of the 18th ACM Symposium on Operating System Principles, pp. 230–243. ACM Press, Banff (2001) 5. Cavaness, C.: Quartz Job Scheduling Framework: Building Open Source Enterprise Applications. Prentice Hall PTR, Upper Saddle River (2006) 6. Rappture, https://nanohub.org/infrastructure/rappture/ 7. Klimeck, G., McLennan, M., Brophy, S.P., Adams III, G.B., Lundstrom, M.S.: NanoHUB.org: Advancing Education and Research in Nanotechnology. Computing in Science and Engineering 10, 17–23 (2008) 8. Anjomshoaa, A., Brisard, F., Drescher, M., Fellows, D., Ly, A., McGough, S., Pulsipher, D., Savva, A.: Job Submission Description Language (JSDL) Specification V1.0. Open Grid Forum, GFD.136 (2008)

User-Centric Privacy Preservation in Data-Sharing Applications Feng Gao1, Jingsha He2, and Shufen Peng1 1

College of Computer Science and Technology, Beijing University of Technology Beijing China 2 School of Software Engineering, Beijing University of Technology Beijing China [email protected], [email protected]

Abstract. During data sharing process across people, users do not want the information that contains privacy to be shared with everyone else; some people may only want to share it with someone they are familiar with. To solve this issue of privacy preservation in data-sharing applications, we propose a novel user-centric method. Our main contributions include the followings. (1) Users can select key words or characters for their own privacy information, which is a user-centric way to protect privacy. (2) During the process of data sharing, data substitution can be used to ensure privacy preservation as well as high successful rate of data sharing. (3) Our method can be used in different data-sharing applications in a flexible way. Simulation results show that our method can achieve our privacy preservation goal. Keywords: User-centric, privacy preservation, data sharing, trust, P2P network, online social network.

1 Introduction The evolution of peer-to-peer (P2P) networks has triggered large scale distributed applications. The main application domain is data sharing across a very large number of highly autonomous participants. The popular scenarios of data sharing in the peer-to-peer (P2P) networks focus on massive file sharing. Advanced scenarios such as online communities (e.g., medical communities) need to share private or sensitive data frequently. The other typical data-sharing application is online social networks, which is immensely popular, claiming over 200 million users. In online social networks, the information that users share may include privacy, such as personal information or private photographs that should not be misused [2]. Privacy is a concept combining law, sociology and psychology. The dimension of privacy refers to user’s culture, education level, preference and so on. Therefore, the definition, the sensitive degree and the range of privacy information is diversiform. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 423–431, 2010. © IFIP International Federation for Information Processing 2010

424

F. Gao, J. He, and S. Peng

According to the multiplicity of privacy understandings, it should be important to research the privacy preservation mechanism that is user-centric way. To solve the issue of privacy preservation in data-sharing applications, we propose a novel user-centric method in this paper that includes following contributions. Firstly, user can select key words or characters by themselves for their own privacy information which means a user-centric way to protect privacy. Secondly, if the data under sharing contains too much privacy according to user’s privacy policy and trust model, data substitution can be used to ensure privacy preservation. The substitution also brings higher successful rate of data sharing. Thirdly, our method can be applied in different data-sharing applications in a flexible way. The rest of this paper is organized as follows. In section 2, we introduce some related work. In section 3, we explain our user-centric privacy preservation method that includes formally description and the execution model. In section 4, we perform some simulations. The results show that our method can achieve our privacy preservation goal. Finally, we conclude this paper in Section 5.

2 Relate Work Privacy preservation is an important issue in data-sharing applications. Jawad et al. [1] propose a P2P data privacy model which combines the Hippocratic principles and the trust notions to support P2P systems. Di Crescenzo and Lipton [2] describe an evolving access control mechanism in social networking to provide non-trivial quantifiable guarantees for formally specified requirements of utility and privacy. Wei Q and Yansheng L [3] introduce a practical solution to defend against background knowledge attacks with considering the privacy disclosure in social network data publishing. Kobsa A. and Teltzrow M. [4] proposed a user interface design approach in which the privacy practice of a website was explicated in a contextualized manner. These methods for privacy preservation in data-sharing applications focus on certain computing application, none of them can offer privacy preservation in all scenarios for data sharing. Selective partial encryption offers protection for compressed image data [5], audio data and video data. Although this method can protect privacy by encrypting partial data which the user does not want to be shared, the receiver can recognize that the sender does not trust him from the partial encryption data. Therefore, this method is unsuitable for data sharing with privacy preservation.

3 User-Centric Privacy Preservation Method 3.1 Formally Description In this section, we describe the concepts and expressions used in our privacy preservation method.

User-Centric Privacy Preservation in Data-Sharing Applications

425

Data object. We consider the user’s data as objects. These objects make up to a logical database. Let D1 , D 2 ,.... D m ∈ D be the user’s data objects (they can be image, text, file and so on), {D1, D2 ,.... D m} is the user’s logical database, and D is the space of data object. Privacy range. The range of privacy information is defined by the customized setting and the default setting. The customized setting means the user can choose the key words and the characters of their privacy information. The set K {k / k ∈ (k1 , k 2 ,.... k n)} and the set C {c / c ∈ (c1 , c2 ,....co)} denote the choice of the user’s key words and characters for their privacy information respectively. The default setting means that makes all the general acknowledged privacy information into privacy range. Privacy policy ontology. We abstract the user’s privacy policy as privacy policy ontology and its attribute as trust evidence with constraint which need to be satisfied in semantic way ([6]). Privacy information ontology. According to the customized setting, the default setting and privacy policy ontology, privacy information ontology created in semantic way ([6]) correspondingly. Privacy information ontology describes the needed trust evidences when disclosing privacy. Privacy mapping function P. The mapping function P : D → {0,1} denotes the user data D whether contains privacy based on privacy information ontology. Let d1 , d 2 ,.... d r ∈ {0,1} express the value of function P where d j = P( D j )( j = 1,2,...r ) . And d j = 1 means the user’s data D j involves privacy while d j = 0 means D j does not involve privacy. Privacy entropy. We describe the entity’s trust that should be achieved when disclosing a piece of privacy as T a , the trust that the entity has already been achieved before the privacy disclosure as T b .Conditional probability P = prob(T a / T b) denotes the probability of achieving T a under condition T b . Set T e {e1, e 2 ,...e s} denotes the needed trust evidences when disclosing one piece of privacy information. Conditional probability P i = prob(T ai / T bi ) represents the i th trust evidence’s conditional probability when disclosing the privacy information. Let p1, p 2 ,.. p s denote the original values of conditional probability p i , preprocess Pi by formula 1. Pi =

pi p1 + p 2 + .... p s

(1)

We use H to denote privacy entropy which means privacy information quantity: s H = −k ( ∑ P i log P i ) i =1

s

( ∑ P i = 1 , k = p1 + p 2 + .... + p s ) i =1

(2)

Trust mapping function T. The mapping function T : D → (a, b) denotes that for certain trust, the interval of privacy entropy can be afforded in data sharing. This mapping function related to the user’s trust model and privacy policy.

426

F. Gao, J. He, and S. Peng

Substitution data object. Let D1 ' , D2 ' ,.... Dt '∈ D be the substitution data objects (they can be image, text, file and so on), {D1 ' , D 2 ' ,.... Dt '} as logical substitution database, and D' is the substitute data object space. 3.2 Privacy Preservation Method The architecture of our user-centric privacy preservation method in data sharing is illustrated in Fig.1. And this method includes privacy setting module and privacy processing module. Privacy setting module, in this module the users can choose their own key words ( K {k / k ∈ (k1 , k 2 ,.... k n)} ) and characters ( C {c / c ∈ (c1 , c2 ,....co)} ) for their privacy with the default setting which makes the all general acknowledged privacy information into privacy range. Privacy processing module, this module process the user’s sharing data by four steps. Step1. Scan the information that the user want to share by using privacy mapping function P, for the sharing data set Ds ( D1 , D 2 ,.... Dt )(t ≤ m) , if d l = P( Dl ) = 1(l = 1,2,...m) , record d l to set d ' d '∈ {d k ' / d k ' = d j , d j = P( D j ) = 1}, (k ∈ (1, w), j ∈ (1, k )) . Step2. For each D j that P( D j ) = d j , (d j ∈ d ' ) , compute H ( D j ) , then let H ' records the sum of the H ( D j ) based on formula (2). n w H ' = ∑ H ( D j ) ( H ( D j ) = −k ( ∑ P i log P i ) j =1 i =1

)

(3)

Step3. Divide the trust interval to define that for the certain trust how much privacy information can be shared. The total of interval can be set by the user or default as a certain constant. Denote there is u trust interval T , each interval controls v terms privacy information Q , for t i ∈ T , q j ∈ Q , t i controls q j privacy information. For trust interval t i , the lower bound of mapping function T : D → (a, b) says ai compute by formula (4). ai =

u v 1 (i − 1) ∑ ∑ H (q j )(i = 1,2,....u ) u i =1 j =1

(4)

And the upper bound says bi compute by formula (5). bi =

u v 1 ⋅ i ⋅ ∑ ∑ H (q j )(i = 1,2,....u ) u i =1 j =1

(5)

So the interval (ai , bi) denotes the privacy entropy that the i th trust interval can be afforded in data sharing. Step4. For the sharing data set, if H '∈ (ai , bi) , share the data with the entity. If H '∉ (ai , bi) , substitute some sharing data from substitution database, until the H ' satisfied H '∈ (ai , bi) , then share that new data set.

User-Centric Privacy Preservation in Data-Sharing Applications

427

Our user-centric privacy preservation method in data-sharing applications includes 9 steps as follows. 1. The user input the data that he wants to share to privacy preservation model; 2. Creates privacy policy ontology according to privacy policy and based on it also with the user setting and the default setting, privacy setting module creates the privacy ontology; 3. Scan input data set by using privacy mapping function P then record the information which involved privacy; 4. Compute the entropy of privacy information; 5. Divide the trust interval; 6. According to entity’s trust, judge the privacy entropy whether in the interval or not. If privacy entropy is in it, share the data set, otherwise do step 7; 7.Substitute some data from substitution data base; 8. Do 4 until the privacy entropy is in relevant interval; 9. Output the new data set which means privacy preserved.

Privacy preserving module Privacy setting module Privacy policy

Share data set

Default set

Privacy mapping

New share data set with privacy preservation

Privacy policy ontology

User set Privacy information ontology

Privacy processing module

Privacy entropy

Substitution database

Trust interval

Trust model

Fig. 1. Architecture of Privacy Preservation Model

4 Simulations and Analysis We have performed evaluation of our method using the .NET Framework 2.0 platform and implemented a simulator written in C#. The experiment was carried out on a laptop computer with an Intel Pentium (R) Core(TM) 2 Duo 1.83 GHz CPU and 1G bytes of memory. In our simulation, we choose the parameters as follows. The data set {D1 , D2 ,.... Dm}(m = 500) is the user’s logical database, there is u (u = 5) trust interval T , each interval controls v(v = 10) terms privacy information Q . The space of privacy information is 50 terms and the data set {D1 ' , D 2 ' ,.... Dt '}(t = 15) is logical substitution database. Privacy entropy uses the data form our previous work [7], the sum of privacy information entropy is between 0 and 3.5 in our randomly selecting data simulations.

428

F. Gao, J. He, and S. Peng

In case 1, we select 10 terms data from 500 randomly by using our privacy entropy computing method but without substitution method to describe the situation of data sharing. We run our simulation 50 rounds, the x-axis records the time of round, and y-axis denotes the trust interval. There are 5 trust intervals which can afford privacy entropy [0,0.7) [0.7,1.4) [1.4,2.1) [2.1,2.8) [2.8,3.5] respectively. In our randomly selection, if privacy entropy of sharing data is in arbitrary interval, record as one successful data sharing. The result of case 1 is illustrated in Fig.2. The coordinate ( x, y) means the yth simulation involved x privacy. Take point (40,0.7) as example, it means the 40th simulation data sharing scenario is successful if entity’s trust can afford privacy entropy 0.7. That is to say, if the entity’s trust is in the certain trust interval which can afford privacy entropy between 0.7 and 1.4, this entity can share data with user. And the entity whose trust is in the certain trust interval which can afford privacy entropy [1.4,2.1) , [2.1,2.8) and [2.8,3.5] respectively also can share data with user. So our simulations show the situation of once data selecting, each trust interval that entity’s trust in whether can share this selected data or not. Then we do case 2 that select 20 terms data from 500 randomly and show result in Fig.3. We can see that the points which mean successful data sharing in Fig.3 is universal lower than the points in Fig.2. That is to say in case 1 the demand of trust to share selected data is more strictly. In case 3 and case 4, we select 10 data terms and 20 data terms respectively from the user data space randomly with using our substitution method to describe the situation of data sharing. Fig.4 show the result of case 3 and Fig.5 illustrates the result of case 4 respectively. We denote “successful rate of data sharing” as successful data sharing to all of data sharing scenarios. We then record the successful rate of data sharing and illustrate in table 1. We can see that without the substitution method, successful rate of data sharing is 37.2% and 34.8% respectively according to our simulation scenario. That is clear that in process of data sharing with privacy considering the success rate is low. With substitution method, the successful rate of user data sharing is close to 54.4% and 54.8% respectively. That accounts for our method is effective in data sharing with privacy preservation. Also we can learn that the more data selection for share the smaller successful rate by comparing case 1 to case 2. Fig.6 illustrates information quantity of data sharing compare case 1 to case 3. By using substitute method, the information quantity is less than the data sharing without substitute method which means privacy enhanced.

Fig. 2. The situation of data sharing selecting 10 from 500 (without substitution)

User-Centric Privacy Preservation in Data-Sharing Applications

Fig. 3. The situation of data sharing selecting 20 from 500 (without substitution)

Fig. 4. The situation of data sharing selecting 10 from 500 (with substitution)

Fig. 5. The situation of data sharing selecting 20 from 500 (with substitution)

Fig. 6. Information quantities compare case 1 to case 3

429

430

F. Gao, J. He, and S. Peng

Table 1. Successful Rate of Data Sharing Simulation Case Case 1 Case 2 Case 3 Case 4

Successful Rate of Data Sharing 37.2% 34.8% 54.4% 54.8%

5 Conclusions and Future Work Privacy preservation is an important issue for data sharing in applications such as P2Pnetworking, online social networking and so on. Privacy is a concept combining all kinds of elements, and the dimension of privacy is diversiform. To solve the privacy preservation problem in data sharing, we propose a novel user-centric method in this paper. In our method, the users can select key words and characters for their privacy information, which means user-centric way to protect privacy. Then during the process of data sharing, if the sharing data contains too much privacy according to user’s privacy policy and trust model, our data substitution method can be used to ensure privacy preservation and as well as high successful rate for data sharing. And our method can be used in different data sharing applications in a flexible way. We did five simulation cases and simulation results show that our method can achieve our privacy preservation goal. In the future, we will analyze the complexity and cost of our approach; refine our method to balance privacy preservation and its additional cost. We need further improve our method with real implementation and applications.

Acknowledgements The work in this paper has been supported by research funding from Beijing Education Commission (Grant No. KM201010005027).

References 1. Jawad, M., Serrano-Alvarado, P., Valduriez, P.: Protecting Data Privacy in Structured P2P Networks. In: Data Management in Grid and Peer-to-Peer Systems. Linz, Austria (2009) 2. Di Crescenzo, G., Lipton, R.J.: Social Network Privacy via Evolving Access Control. In: Liu, B., et al. (eds.) WASA 2009. LNCS, vol. 5682, pp. 551–560. Springer, Heidelberg (2009) 3. Wei, Q., Yansheng, L.: Preservation of Privacy in Publishing Social Network Data. In: International Symposium on Electronic Commerce and Security (2008) 4. Kobsa, A., Teltzrow, M.: Contextualized Communication of Privacy Practices and Personalization Benefits: Impacts on Users’ Data Sharing and Purchase Behavior. In: Martin, D., Serjantov, A. (eds.) PET 2004. LNCS, vol. 3424, pp. 329–343. Springer, Heidelberg (2005)

User-Centric Privacy Preservation in Data-Sharing Applications

431

5. Spinsante, S., Gambi, E.: Selective Encryption for Efficient and Secure Transmission of Compressed Space Images. In: International Workshop on Satellite and Space Communications (IWSSC), Tuscany, Italy, pp. 9–11 (September 2009) 6. Gao, F., He, J., Peng, S., Wu, X., Liu, L.: An Approach for Privacy Protection Based-on Ontology. In: The 2nd International Conference on Networks Security, Wireless Communications and Trusted Computing, Wuhan, China, April 24-25 (2010) 7. Gao, F., He, J., Peng, S., Wu, X.: A Quantify Metric for Privacy Protection Based on Information Theory. In: 3rd International Symposium on Intelligent Information Technology and Security Informatics (IITSI 2010), Jinggangshan, China (April 2-4, 2010)

Software Metrics Reduction for Fault-Proneness Prediction of Software Modules Yunfeng Luo, Kerong Ben, and Lei Mi Department of Computer Engineering Navy University of Engineering Wuhan, China [email protected], [email protected], [email protected]

Abstract. It would be valuable to use metrics to identify the fault-proneness of software modules. However, few research works are on how to select appropriate metrics for fault-proneness prediction currently. We conduct a large-scale comparative experiment of nine different software metrics reduction methods over eleven public-domain data sets from the NASA metrics data repository. The Naive Bayes data miner, with a log-filtering preprocessor on the numeric data, is utilized to construct the prediction model. Comparisons are based on the analysis of variance. Our conclusion is that, reduction methods of software metrics are important to build adaptable and robust software fault-proneness prediction models. Given our results on Naive Bayes and log-filtering, discrete wavelet transformation outperforms other reduction methods, and correlationbased feature selection with genetic search algorithm and information gain can also obtain better predicted performance. Keywords: Software fault-proneness; metrics reduction; analysis of variance.

1 Introduction Software fault static prediction technologies are defined as the methods which predict the amount or distribution of the software faults based on the software metrics [1]. In general, it aims at answering one or several of the following questions [2]: 1) Which metrics that are easy to be collected are good fault predictors? 2) Which models, quantitative, qualitative, hybrid, etc., should be used for fault prediction? 3) How accurate are those models? Software metric is a measure of some property of a piece of software or its specifications. It is very common nowadays for an organization to maintain several software metrics repositories for each undertaken project [3]. The metrics which are appropriate for predicting software fault-proneness have product and process categories [4]. The datasets comes from the NASA Metrics Data (MDP) repository involves more than 40 method-level metrics, and 8 class-level metrics [5]. Menzies proposed that mining static code attributes to learn fault predictors was useful, but he suggested not seeking “best” subsets of static code attributes, and building fault predictors using all available attributes metrics, followed by finding the most appropriate particular subset for a particular domain[6]. It is evident that the set of metrics should be reduced to the minimum set of representative parameters in order to C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 432–441, 2010. © IFIP International Federation for Information Processing 2010

Software Metrics Reduction for Fault-Proneness Prediction of Software Modules

433

avoid metrics which are not useful in the analysis or which introduce noise in the evaluation of data [7]. We organize a large-scale comparative experiment of nine different software metrics reduction methods over eleven public-domain data sets from the NASA MDP. We utilize Naive Bayes data miner with a log-filtering preprocessor on the numeric data to construct the prediction model [6]. Comparisons are based on the analysis of variance (ANOVA), which is a statistical technique of examining whether independent groups or populations are significantly different from one another. In our study, the one-way ANOVA is selected to analyze the performances of nine reduction methods of metrics.

2 Related Work Since some independent variables might be highly correlated, i.e. multicollinearity, and serious multicollinearity problem will affect the stability of the results of regression analysis. Some methods are proposed to overcome the multicollinearity problem. Nagappan utilized principal component analysis (PCA) to reduce the software metrics [8]. Shatnawi analyzed the multicollinearity through three tests [9]: the Spearman correlation analysis, the Variance inflation factor (VIF) analysis, and the condition number analysis (use 30 as the cutoff value). The Spearman correlation in the metrics group was calculated first. After the high correlations were noted, they calculated the condition number and the VIFs for the group. If the condition number was greater than 30, the metrics that had the highest VIF value was dropped, and the condition number was recalculated; this process continued until the condition number was below 30; the remaining metrics were candidates to enter the MLR models. A correlation-based feature selection technique (CFS) was applied to down-select the best predictors out of the 21 independent variables in the datasets [10]. This involves searching through all possible combinations of variables in the dataset to find which subset of variables works best for prediction. Olague utilized simple correlation calculation to ascertain whether the currently available metrics can predict the initial quality of a software system [11]. Zhou utilized univariate regression analysis to examine the effect of each metric separately, identifying the metrics which are significantly related to fault-proneness of classes and identifying potential fault-proneness predictors to be used in multivariate analysis 2 [12]. Vandecruys went through an input selection procedure using a χ -based filter [13]. First, the observed frequencies of all possible combinations of values for class and variable are measured. Based on this, the theoretical frequencies, assuming complete independence between the variable and the class, are calculated. The hypothesis 2 of equal odds provides a χ test statistic; higher values allow one to reject the null hypothesis of equal odds more confidently; hence, these values allow one to rank the variables according to predictive power. Menzies ranked the attributes using the information gain [6]. If A is a metric and C is the class, n(c) is the examples amount of every class, N = ∑ n(c) , p(c)=n(c)/N, c∈C and p(c/a) is the probability of the metric a belongs to class c. Eq.1 gives the entropy of the class before and after observing the metric.

434

Y. Luo, K. Ben, and L. Mi

H (C ) = −∑ p(c) log 2 p(c) c∈C

H (C | A) = −∑ p( a)∑ p (c | a ) log 2 p(c | a ) a∈A

(1)

c∈C

Each metric Ai is assigned a score based on the information gain between itself and the class: InfoGain(A i ) = H (C ) − H (C | Ai )

(2)

3 Proposed Methods In this section, we introduce the data reduction techniques firstly, which include nine data reduction methods. The Naive Bayes data miner, with a log-filtering preprocessor on the numeric data, is utilized to construct the prediction model. Finally, we describe our model evaluation criteria and the analysis of variance. 3.1 Data Reduction Techniques Feature extraction and feature selection are two different approaches for the reduction of dimensionality. Feature extraction involves linear or nonlinear transformation from the original feature space to a new one of lower dimensionality. Although it does reduce the dimensionality of the vectors fed to the classifier, the number of features that must be measured remains the same. Feature selection, on the other hand, directly reduces the number of original features by selecting a subset of them that still retains sufficient information for classification. In general, feature selection approaches can be grouped into two categories: filter methods and wrapper methods [14]. Acquiring no feedback from classifiers, the filter methods estimate the classification performance by some indirect assessments, such as distance measures which reflect how well the classes separate from each other. The wrapper methods, on the contrary, are classifier-dependent. Based on the classification accuracy, the methods evaluate the “goodness” of the selected feature subset directly, which should intuitively yield better performance. In spite of the good performance, the wrapper methods have limited applications due to the high computational complexity involved. Filters and wrappers hava the similar search approach. One popular filter metrics for classification problems are correlation. We choose CFS as our filter method. For wrapper method, we choose J48 as the classifier. Two search approaches, best first and genetic algorithm, are choosed for subset selection. For feature extraction, we focus on two popular and effective methods of lossy dimensionality reduction [15]: principal component analysis (PCA) and discrete wavelet transformation (DWT). PCA searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k ≤ n . The original data are thus projected onto a much smaller space, resulting in dimensionality reduction. Unlike feature subset selection, which reduces the metric set size by retaining a subset of the initial set of metrics, PCA combines the essence of metrics by creating an alternative, smaller set of variables. Then the initial data can be projected onto this smaller set.

Software Metrics Reduction for Fault-Proneness Prediction of Software Modules

435

PCA often reveals relationships that are not previously suspected and thereby allows interpretations that would not ordinarily result. DWT is a linear signal processing technique that, that is applied to transform a data vector X to a numerically different vector X ' of wavelet coefficients. The two vectors are of the same length. The usefulness lies in the fact that the wavelet transformed data can be truncated. A compressed approximation of the data can be retained by storing only a small fraction of the strongest of the wavelet coefficients. A particular reduction method evaluated features based on information gain, which evaluates the worth of a metric by measuring the information gain with respect to the class. Menzies ranked the attributes using the information gain [6]. In our study, we got a metrics queue based the value of information gain, and choose top 3, 4 and 5 metrics respectively to enter the prediction model. The No-Reduction method, which does not reduce metrics and allows all metrics to enter the prediction model, is used as a baseline method to compare against the investigated metrics reduction methods. With these comparisons, we can determine whether some (or even all) of the metrics reduction methods are better than NoReduction method in terms of their predictive accuracy on faults or vice versa. These reduction methods are summarized in Table 1. These methods are implemented in the WEKA data mining tool [16]. Table 1. Software metrics reduction methods No.

Reduction methods

Description

1

CfsSubsetEval+ BestFirst

2

CfsSubsetEval+ GeneticSearch

3

InfoGainAttributeEval (top 3)

4

InfoGainAttributeEval (top 4)

Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them. The search approach is greedy hill-climbing augmented with a backtracking facility. Performs a search using the simple genetic algorithm described in Goldberg Evaluates the worth of an attribute by measuring the information gain with respect to the class. Choose top 3 features in the features rank. Choose top 4 features in the features rank.

5

InfoGainAttributeEval (top 5)

Choose top 5 features in the features rank.

6

WrapperSubsetEval+J48+Best First

7 8 9

WrapperSubsetEval+J48+ GeneticSearch DWT PCA

Evaluates attribute sets by using a learning scheme. Cross validation is used to estimate the accuracy of the learning scheme (we choose J48 ) for a set of attributes. The search method is BestFirst. The search method is GeneticSearch

10

No-Reduction

A filter for wavelet transformation. Performs a principal components analysis and transformation of the data. Dimensionality reduction is accomplished by choosing enough eigenvectors to account for some percentage of the variance in the original data -- default 0.95 (95%). Allow all metrics to enter the prediction model, which is used as a baseline method.

436

Y. Luo, K. Ben, and L. Mi

3.2 Classification Technique and Evaluation Strategy We use Naïve Bayes classifiers as our prediction model to evaluate the metrics reduction methods [6]. Naïve Bayes classifier is based on Bayes’ Theorem. Formally, P( H | E ) =

P( H ) ∏ P ( Ei | H ) P(E ) i

(3)

that is, given fragments of evidence Ei and a prior probability for a class P(H), the posteriori probability P(H|E) is calculated. The predicted variable is whether one or more faults exist in the given module. We use the following set of evaluation measures. The Probability of Detection (pd) is defined as the probability of the correct classification of a module that contains a fault: (Note that pd is also called recall). The Probability of False alarm (pf) is defined as the ratio of false positives to all non-defect modules.For convenience, we say that notpf is the complement of pf: notpf=1-pf. In practice, engineers balance between pf and pd. To operationalize this notion of balance, bal is defined to be the Euclidean distance from the sweet spot pf=0, pd=1 to a pair of (pf, pd). balance = bal = 1 −

(0 − pf ) 2 + (1 − pd ) 2

(4)

2

Hence, better and higher balances fall closer to the desired sweet spot of pf=0, pd=1. 3.3 Analysis of Variance In our study, the one-way ANOVA design is selected to analyze the performances of the nine metrics reduction methods. In this design, a reduction method corresponds to a group. Let Y1 j , " , Yn j represent a random sample of n j observations taken from the popuj

lation of group j. In our experiment, ten different data mining algorithms are used. In addition, eleven datasets are available and the dependent variable is known for all the datasets. Yij, the ith observation in group j (where i = 1,2,…,11 and j = 1,2,…,10), can be represented by the following model

Yij = μ + δ j + ε ij

(5)

where μ is the overall effect to all the observations; δ j = μ j − μ is the treatment effect related to the jth group; ε ij = Yij − μ is the experimental error associated with j the ith observation in group j; μ is the true mean of the jth group. The F statistic j validates whether the 10 population means are equal. In one way ANOVA, between group variance is denoted by Sb2 , and with group variance is denoted by S w2 , the following equation shows how the F-test is calculated in our experiment. F=

Sb2 / 10 10

S w2 // ∑ ( n j − 1) j =1

(6)

Software Metrics Reduction for Fault-Proneness Prediction of Software Modules

437

4 Empirical Evaluation 4.1 Data Eleven data sets are available in MDP, as shown in Table 2. Each module of each datasets describes the metrics of corresponding module, plus the number of defects known for that module. The error count column was converted into a Boolean metrics called fault- proneness? as follows: Fault - proneness ? = (error count ≥ 1)

Finally, the error density column is removed (since it can be derived from line counts and error count). The preprocessed data sets have two categories: 38 attributes and 22 attributes, plus one target attribute (fault-proneness?), and include Halstead, McCabe, lines of code, and other miscellaneous metrics [6]. The difference of two categories is that one includes the miscellaneous metrics and the other doesn’t. Menzies found that these values in the data sets formed an exponential distribution with many small values and a few much larger values. So they suggested that a logarithmic filter on all numeric values might improve predictor performance [17]. Such a filter replaces all numerics n with their logarithms, Ln(n). In our experiment, we applied a logarithmic filter on all numeric values, and added all numbers in the datasets with 0.00001 to avoid numerical errors with ln(0). Table 2. Datasets used in this study Data set CM1 JM1 KC1 KC3 KC4 MC1 MW1 PC1 PC2 PC3 PC4

language C C C++ java Perl C&C++ C C C C C

Total LOC 20K 315K 43 K 18K 25 K 63K 8K 40K 26K 40K 36K

#modules 505 10878 2107 458 125 9466 403 1107 5589 1563 1458

%fault-proneness 16.04 19.32 13.91 6.33 48 0.64 6.70 6.59 0.39 10.24 12.21

4.2 Experimental Result We applied nine different software metrics reduction methods shown in Table 1 in addition to No-Reduction, over eleven public-domain data sets in Table 2. The metrics subsets, chosen by reduction method 6 over datasets KC3, PC2 and PC3, and chosen by reduction method 7 over dataset PC2 respectively, are null sets. Reduction method 9 classified all modules in MC1 as no fault-proneness. Therefore, the values of pd, notpf and bal calculated by these reduction methods over corresponding datasets are null. The detailed predicted results are shown in Table 3 - Table 5.

438

Y. Luo, K. Ben, and L. Mi Table 3. The predicted results for pd

Reduction Methods

CM1

JM1

KC1

KC3

KC4

MC1

MW1

PC1

PC2

PC3

PC4

1 2 3 4 5 6 7 8 9 10

0.679 0.728 0.210 0.370 0.469 0.222 0.457 0.617 0.222 0.679

0.419 0.423 0.199 0.222 0.221 0.346 0.330 0.655 0.254 0.475

0.768 0.768 0.741 0.853 0.860 0.379 0.696 0.846 0.556 0.846

0.552 0.552 0.448 0.483 0.586

0.833 0.783 0.817 0.867 0.833 0.867 0.700 0.783 0.583 0.700

0.295 0.328

0.630 0.667 0.593 0.593 0.704 0.222 0.704 0.704 0.037 0.704

0.685 0.822 0.342 0.767 0.849 0.616 0.753 0.712 0.342 0.822

0.318 0.682 0.773 0.773 0.773

0.819 0.869 0.775 0.863 0.856

0.727 0.318 0.727

0.844 0.813 0.556 0.881

0.860 0.899 0.978 0.972 0.916 0.910 0.848 0.876 0.826 0.910

0.862 0.828 0.517 0.862

0.033 0.426 0.262 0.459 0.902 0.754

Table 4. The predicted results for notpf Reduction Methods

CM1

JM1

KC1

KC3

KC4

MC1

MW1

PC1

PC2

PC3

PC4

1 2 3 4 5 6 7 8 9 10

0.613 0.590 0.950 0.894 0.840 0.913 0.757 0.604 0.906 0.550

0.831 0.829 0.929 0.884 0.872 0.865 0.842 0.617 0.919 0.764

0.641 0.641 0.698 0.596 0.573 0.871 0.716 0.598 0.809 0.592

0.928 0.907 0.862 0.823 0.797

0.692 0.723 0.677 0.631 0.692 0.754 0.800 0.738 0.862 0.800

0.950 0.939

0.803 0.803 0.875 0.840 0.771 0.963 0.750 0.761 0.981 0.734

0.788 0.638 0.900 0.680 0.589 0.675 0.676 0.696 0.907 0.582

0.969 0.884 0.866 0.880 0.878

0.623 0.556 0.678 0.588 0.628

0.809 0.975 0.786

0.602 0.670 0.795 0.547

0.784 0.765 0.709 0.709 0.747 0.772 0.773 0.710 0.813 0.675

0.655 0.690 0.893 0.643

0.979 0.843 0.942 0.781 0.897 0.675

Table 5. The predicted results for bal Reduction Methods

CM1

JM1

KC1

KC3

KC4

MC1

MW1

PC1

PC2

PC3

PC4

1 2 3 4 5 6 7 8 9 10

0.645 0.652 0.440 0.549 0.608 0.447 0.579 0.610 0.446 0.609

0.572 0.574 0.432 0.444 0.442 0.528 0.513 0.636 0.469 0.593

0.698 0.698 0.719 0.696 0.682 0.552 0.706 0.696 0.658 0.692

0.679 0.676 0.598 0.613 0.674

0.753 0.751 0.737 0.722 0.753 0.802 0.745 0.760 0.690 0.745

0.500 0.523

0.703 0.726 0.699 0.691 0.735 0.449 0.726 0.731 0.319 0.718

0.732 0.715 0.530 0.720 0.690 0.645 0.712 0.704 0.530 0.679

0.517 0.761 0.813 0.818 0.818

0.704 0.673 0.722 0.693 0.718

0.765 0.518 0.755

0.698 0.732 0.654 0.669

0.818 0.819 0.793 0.793 0.811 0.827 0.807 0.777 0.819 0.762

0.737 0.749 0.650 0.730

0.316 0.579 0.477 0.587 0.899 0.293 0.712

Based on data from Table 3 - Table 5, we use the one-way ANOVA design to analyze the performances of the nine metrics reduction methods Fig. 1 shows the values of the mean and standard error of mean for pd, notpf and bal. We observe that DWT (method 8) performs the best on mean for pd, PCA (method 9) performs the best for notpf, and DWT performs the best for bal. It is also worth noting that the means of pd, notpf and bal vary according to the reduction methods, so it is for the standard error.

Software Metrics Reduction for Fault-Proneness Prediction of Software Modules

(a)

(b)

439

(c)

Fig. 1. The error chart for (a) pd, (b) Notpf , (c) bal

Our null hypothesis is that there is no significance difference for performances of the ten metrics reduction methods. The F-value is selected at 90% confidence level (i.e., α = 0.10) [3]. The p-value related to the F-test is also provided. Table 6 - Table 8 present the results of ANOVA. Since the p-value is well below the significance level α, we can reject the null hypothesis and conclude that there is a significant difference in the means of the 10 metrics reduction methods. Table 6. ANOVA for pd

Between Groups Within Groups Total

Sum of Squares 1.111 4.437 5.548

df 9 94 103

Mean Square 0.123 0.047

F 2.615

p-value 0.010

Table 7. ANOVA for notpf

Between Groups Within Groups Total

Sum of Squares 0.386 1.088 1.473

df 9 94 103

Mean Square 0.043 0.012

F 3.701

pvalue 0.001

Table 8. ANOVA for bal

Between Groups Within Groups Total

Sum of Squares 0.206 1.208 1.415

df 9 94 103

Mean Square 0.023 0.013

F 1.783

p-value 0.082

Table 9 presents the description of pd, notpf and bal in detail. The results indicate that the all metrics reduction methods improve notpf, and most methods improve the bal compared with NoRedcution. But at the same time, the pd decreases with the increase of notpf. The only exception is DWT. The means of pd and notpf for DWT both increase compared with NoRedcution. It is also worth noting that the pf for CfsSubsetEval+ GeneticSearch (method 2) and InfoGainAttributeEval+top5 (method 5) have a minimum decrease among the other 8 reduction methods compared with

440

Y. Luo, K. Ben, and L. Mi

NoRedcution. We conclude that the metrics reduction methods could improve the predicted performance on average, the DWT outperforms the other reduction methods, and CFS with genetic search algorithm and information gain also could obtain better predicted performance. Table 9. The data description for pd, notpf and bal Reduction Methods

N

1 2 3 4 5 6 7 8 9 10 Total

11 11 10 11 11 8 10 11 10 11 104

Mean 0.623 0.684 0.588 0.618 0.681 0.478 0.665 0.769 0.421 0.760 0.635

pd Std. Deviation 0.203 0.182 0.273 0.305 0.226 0.283 0.187 0.093 0.228 0.124 0.232

Mean 0.784 0.752 0.815 0.781 0.736 0.844 0.735 0.708 0.886 0.668 0.768

not_pf Std. Deviation 0.130 0.133 0.110 0.133 0.110 0.101 0.073 0.091 0.066 0.094 0.120

Mean 0.666 0.688 0.648 0.641 0.683 0.591 0.681 0.733 0.575 0.697 0.663

Bal Std. Deviation 0.099 0.084 0.140 0.151 0.109 0.152 0.091 0.076 0.145 0.056 0.117

5 Conclusion and Future Work In this paper, we have conducted a large-scale comparative experiment of nine different software metrics reduction methods over eleven public-domain data sets from the NASA Metrics Data (MDP) repository. It is shown that, on average, there is a significant difference in the predicted performance of the different metrics reduction methods. We also demonstrate that the metrics reduction methods can improve the predicted performance, the DWT outperforms the other reduction methods on average, and CFS with genetic search algorithm and information gain can also obtain better predicted performance. These two conclusions are critical for complex data mining problems such as software fault-proneness classification. The practitioner should not solely rely on sophisticated and/or robust algorithms to generate accurate predictions. As demonstrated in our study, it is advised to use some feature reduction methods to reduce software metrics in order to improve the predicted performance. A source of bias in this study is the set of data reduction methods explored by this study. Data mining is a large and active field and any single study can only use a small subset of the known data mining algorithms. Therefore, we only suggest that DWT does best in the chosen nine typical reduction methods. Future works can investigate whether the classifier would affect the performance of reduction methods. In addition, it would be worthwhile to investigate whether the metrics reduction method is effective for the other metrics, such as object-oriented metrics and software process metrics. Acknowledgments. This research is supported by the national prior research foundation of China under Grant No.513270104.

Software Metrics Reduction for Fault-Proneness Prediction of Software Modules

441

References 1. Wang, Q., Wu, S.J., Li, M.S.: Software Defect Prediction. Journal of Software 19, 1565– 1580 (2008) (in Chinese) 2. Raimund, M., Witold, P., Giancarlo, S.: A Comparative Analysis of the Efficiency of Change Metrics and Static Code Attributes for Defect Prediction. In: 30th International Conference on Software Engineering, pp. 181–190. ACM, New York (2008) 3. Khoshgoftaar, T.M., Rebours, P., Seliya, N.: Software Quality Analysis by Combining Multiple Projects and Learners. Software Quality Journal 17, 25–49 (2009) 4. Catal, C., Diri, B.: A Systematic Review of Software Fault Prediction Studies. Expert Systems with Applications 36, 346–354 (2009) 5. NASA Metrics Data (MDP) Repository, http://mdp.ivv.nasa.gov 6. Menzies, T., Greenwald, J., Frank, A.: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering 33, 2–13 (2007) 7. Bellini, P., Bruno, I., Nesi, P., Rogai, D.: Comparing Fault-proneness Estimation Models. In: 10th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS 2005), pp. 205–214. IEEE Press, New York (2005) 8. Nagappan, N., Ball, T., Zeller, A.: Mining Metrics to Predict Component Failures. In: 28th International Conference on Software Engineering, pp. 119–125. ACM, New York (2006) 9. Shatnawi, R., Li, W.: The Effectiveness of Software Metrics in Identifying Error-prone Classes in Post-release Software Evolution Process. Journal of Systems and Software 81, 1868–1882 (2008) 10. Elish, K.O., Elish, M.O.: Predicting Defect-prone Software Modules Using Support Vector Machines. Journal of Systems and Software 81, 649–660 (2008) 11. Olague, H.M., Etzkorn, L.H., Messimer, S.L.: An Empirical Validation of Object-oriented Class Complexity Metrics and Their Ability to Predict Error-prone Classes in Highly Iterative, or Agile Software: a Case Study. Journal of Software Maintenance and Evolution: Research and Practice 20, 171–197 (2008) 12. Yuming, Z., Hareton, L.: Empirical Analysis of Object-oriented Design Metrics for Predicting High and Low Severity Faults. IEEE Transactions on Software Engineering 32, 771–789 (2006) 13. Vandecruys, O., Martens, D., Baesens, B.: Mining Software Repositories for Comprehensible Software Fault Prediction models. Journal of Systems and Software 81, 823–839 (2008) 14. Kohavi, R., John, G.H.: Wrappers for Feature Subset Selection. Artificial Intelligence 97, 273–324 (1997) 15. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006) 16. Witten, I.H., Frank, E.: Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000) 17. Menzies, T., DiStefeno, J.S., Chapman, M.: Metrics that Matter. In: 27th Annual NASA Goddard Software Engineering Workshop (SEW 2002), pp. 51–57. IEEE Computer Society, Washington (2002)

A Methodology for Design of Unbuffered Router Microarchitecture for S-Mesh NoC Hao Liu1,3, Feifei Cao2, Dongsheng Liu3,*, Xuecheng Zou3, and Zhigang Zhang1 1 2

Henan Electric Power Research Institute, Zhengzhou 450052, China Henan Electric Power Industrial School, Zhengzhou 450051, China 3 Department of Electronic Science & Technology, Huazhong University of Science & Technology, Wuhan 430074, China [email protected]

Abstract. Currently, most of Network on-Chip (NoC) architectures have some limitation in routing decisions. And it makes router nodes overloaded, and sequentially forms deadlock, livelock and congestion. A simple unbuffered router microarchitecture for S-mesh NoC architecture is proposed in this paper. Unbuffered router transforms message without making routing decision. Simulation results showed that S-mesh could get optimal performance in message latency compared with 2D-mesh, Butterfly and Octagon NoC architectures. The Design Compiler synthesis results showed that unbuffered router has obvious advantages on area, and it gets higher operation speed. Keywords: NoC, separated-mesh (S-mesh), unbuffered, low-latency, low-cost.

1 Introduction 1.1 Background With the arrival of multicore era, traditional bus-based interconnect architectures have became bottleneck for the multicores communication. The Network on-Chip [1], [2] design paradigm is seen as the ultimate solution of enabling the integration of exceedingly high number of cores for the future on-chip communication network architectures. In NoC based system, messages are exchanged between cores using a network and packet switching paradigm. The messages are relayed from one core to another along the path constructed by routers and links. The design of NoC communication network architectures would be facing a major design challenge that is to construct an area-efficiency, low-latency, scalable on-chip communication network. NoC is an emerging paradigm for communications with large VLSI systems implemented on single silicon chip. It brings forward a networking method to on-chip communication and brings about notable improvements over conventional bus systems, such as ARM AMBA, Wishbone, STBus, Core Connect, and so on. There are several architectures for NoCs such as Fat-tree [3], Mesh [4], Octagon [5] and Spidergon [6]. Fat-tree and 2D-mesh as two of the most popular topologies. Some NoC * Corresponding author. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 442–451, 2010. © IFIP International Federation for Information Processing 2010

A Methodology for Design of Unbuffered Router Microarchitecture for S-Mesh NoC

443

implementation and research, such as Nostrum [7], Æthereal [8], Raw network [9], Xpipes [10], Eclipse [4] have been implemented on top of Fat-tree, 2D-mesh in some extent. From a perspective of these authors, of course, these architectures are topology-independent. 1.2 Overview of NoC Architecture A typical NoC chip is a matrix of resource slots containing integrated embedded processors or systems connected to each other via a multi-dimensional mesh/tree network. Therefore, a typical NoC system includes resource nodes, router nodes, links and network interface unit, and routing algorithms for meet the requirements of the different architectures. The router nodes include the routing controller and an arbiter for resolving local route conflicts. The routing algorithm currently under consideration can be labeled pseudo-dynamic since it’s allowed only for restricted dynamic routing in case of router conflicts. In addition, router architecture should be adjusted with different routing algorithms. These influencing factors potentially increase uncertainty condition for system performances, especially network latency, congestion, cost and other limitations. Because the routers do not exactly know subsequent routers working conditions in real time. So local performance optimization always makes the whole system performance worse. Using these architectures for extremely large systems is very difficult [3]. We believe that NoC router architecture should be simple, low-latency, low-cost, and the number of data buffers should be minimal in the future. In accordance with the view of NoC as a research field of SoC. We focus on constructing feasible, low-latency and low-cost communication-centric design. 1.3 Outline of This Paper The methodology of multicores system based on NoC will be changed from computation-centered to communication-centered. Moreover, what its key goal of on-chip network is to construct a high-performance on-chip communication network with low-latency, and scalability for multicore chip system. In this paper, an unbuffered router architecture for Separation of Control and Data-transmission NoC architecture is proposed. This NoC architecture decouples the routing decision from router. The routers employ a pre-connection mechanism for the input channels and output channels that help to reduce the complexity of the crossbar matrix design. It also meets the aforementioned features of simple, fast and less buffer. The studies of this paper mainly embody in several aspects, such as network architecture, unbuffered router microarchitecture. This paper is organized as follows: In section 2, S-mesh NoC architecture is discussed, and router microarchitecture is illuminated in Section 3, the simulation results are presented in Section 4 and conclusions are provided in Section 5.

2 S-Mesh NoC Architecture S-mesh NoC [11] network borrows features and design methods from those used in parallel computing clusters and communication networks. This includes design

444

H. Liu et al.

conception that separate service and bearer control in communication network. It also includes the implementation way of centralize-distribution in communication system and IP carrying network, as well as the Message Transfer Part [12], which is a part of the Signaling System #7 [13] used for communication network. Nevertheless, these methods cannot be adopted directly. The S-mesh is based on 2D-mesh topology. In the S-mesh system, the kernel communication network adopts circuit-switching mode. And the edge devices, such as resource nodes, adopt the packet switching mode. The S-mesh network architecture consists of three types of sub-networks: mesh-based data transmission network (DN), butterfly-based control network (CN) and local bypass network (BN). The S-mesh NoC consists (Fig.1) of resources, routers and connection process unit. DN connects the resources to its nearest routers. Each router should be connected to the CN as well as four other neighboring routers through BN and DN. A BN or DN link consists of two one-directional point-to-point buses. The functionalities of router nodes only undertake link layer functionalities and physical layer functionalities. Routers do not need to store any packets before forwarding though dedicated path has been established advanced. CN is responsible for the system resource management, routing decisions, and flow control. All resources are connected to interconnection fabric with a Network Interface Unit (NIU). The NIUs handle all communication protocols, which are used to make the network as a transparent communication network. S

S R

S

R

Resource nodes

R

S R

Router nodes

S

Datapath Network

S R

R

Network Interface Unit Control Network

CN S

S R

Bypass Network

S R

R

Fig. 1. S-mesh NoC architecture

2.1 Control Network and Datapath Network Control Network architecture is similar to Fat-tree architecture. Routers are located at the leaf node of the Fat-tree, and CN is located at the root node of the Fat-tree. CN unit is designed not only for efficient commands between resources, but also for efficient movement of operands between special resources just like processors or computing units. CN uses iSLIP algorithm [14] to schedule each active request and acknowledge in turn or forward command. All on-chip dynamic data movement uses packet-based and connection-oriented communication over DN. This includes memory accesses, user-level DMA transfers, and I/O. In S-mesh, the packets header does not contain

A Methodology for Design of Unbuffered Router Microarchitecture for S-Mesh NoC

445

routing information, only contains labels and payload. DN network is deadlock free and congestion-immunity due to connection-oriented communication mechanism. 2.2 Bypass Network Using BN to transfer messages between adjacent cores would be better way than using DN. It responses only to its four neighbors, not acts as a router. It can efficiently reduce the traffic load of DN and signaling load of CN. Moreover, Router has two logically disjointed networks, which could be implemented as two separate networks, or as two logical networks using the two groups of physical wires. When we map the internal communications between process cores upon two-dimensional mesh networks, about 72.9% communication data occurs between the adjacent cores (routing distance is 1) [15]. The obvious local data would be transmitted through the BN. However, the global data would be transmitted through the DN. This mapping rule can ensure that local and global communication in different region could be transmitted at the same time, thereby to reduce congestion and promote transmission efficiency of the whole system. The applicable condition is described in the pseudo code as below. if (abs(XDestination – Xsource) =1 or abs(YDestination – Ysource) =1) and (abs(XDestination – Xsource) • abs(YDestination – Ysource)) {Data transmission via BN port without ControlNetwork;} else {Data transmission via DN port with ControlNetwork;}

3 Router Microarchitecture The task of routers is to carry messages injected into the network to their final destination, following a defined determined routing path in advance. The router exchanges message flits from one of its input link to one or more of its output link under controlled by CN. Meanwhile, router can directly transfer message flits between adjacent resource nodes using BN port. The router microarchitecture consists (Fig.2) of DN crossbar, BN crossbar and controller. In S-mesh architecture, messages are divided into packets in NIU, and packets are further divided into flits. Every packet is only composed of a header flit, a tail flit and some data flits. The every flit width is equal to N+2. The “2” is used as a packet label to indicate which packet belongs to idle, header, tail or payload. The link bandwidth also can be configured in 18-bit, 36-bit, 72-bit and others according to service demands. This attribute can increase data transmission efficiency and utilization of links. Meanwhile, it can overcome the effect of best-effort services using packet switching. In 2D-mesh and Fat-tree NoC architectures, current routers do not predict any second hop router working condition. Therefore, routers need enough memory to buffer packets while the next routers might be stalled or under the condition of congesting possibility. Furthermore, local congestion is possible to cause global congestion, thus affect the whole system performance. Regarding the buffer in routers, it is obviously different in router microarchitecture comparison S-mesh and other NoC architectures. Different switching techniques are implemented in current NoC architectures except S-mesh, which has different performance metrics along with different requirements on hardware resources. Routers in

446

H. Liu et al.

Fig. 2. S-mesh router logical implementation model

S-mesh do not need to packed/unpacked and make decisions. This property makes the design of the switches simple, and the buffer is reduced or eliminated. The transmission latency of each router can therefore be reduced to a cycle. And, transmission and control of packet in S-mesh are separated. This indicates clearly that the router microarchitecture would not be changed due to routing algorithms running in CN. In computer networks, different techniques are used to perform message switching between different nodes. Popular switching techniques include Store-And-Forward (SAF), Virtual-Cut-Through (VCT) and WormHole (WH). When these switching techniques are implemented in NoC chips, they have different performance metrics along with different requirements on memory resources. The buffer requirement in various different routing models is shown in Table 1. Table 1. Buffer requirements and latency for different routing techniques

Table 1 indicates that router in S-mesh no longer need memory to buffer message flits. In SAF model [16], an entire packet should be received and stored prior to transmission to the next router. When the message size is big enough, it not only introduces extra delay at every router stage, but also requires a substantial amount of buffer to store multiple entire packets at the same time. VCT [17] requires the buffer for an entire packet. It is forwarded as soon as the next router guarantees that the entire packet will

A Methodology for Design of Unbuffered Router Microarchitecture for S-Mesh NoC

447

be accepted. However, when the next stage router is not available, the entire packet still needs to be stored in the buffers of the current router. A WH routing scheme can reduce routers memory requirements with low latency communication. If a certain flit faces a busy channel, subsequent flits have to wait at their current locations, and therefore they are spreaded over multiple routers. While packets block each other in a circular fashion such that no packets can advance, thus a deadlock is generated.

4 Simulation and Analysis We evaluated the S-mesh NoC by means of a Gpnocsim [18] simulator for NoC, an architectural level cycle-accurate simulator using Java. We measured message latency from the time a packet created in origination to the time the last flit arriving at destination message center. Nevertheless, in S-mesh architecture, the start time is the time that message request is sent to CN. Every simulation initiates a warm-up phase of 2 percentage of all number of running cycles. All types of the topologies use the WH switching technique except S-mesh. 4.1 Message Length In this scenario, we studied the relationships among latency and various message lengths. In this case, size of network is fixed at 16 nodes; width of flit is 64-bit. Average message latency was increased with message length in all NoC architectures. However, the average latency of S-mesh rises slowly by comparison [13]. When more and more message flits have been injected into network, contentions for resources in routers have become more serious. As a result, time spent in data buffer and queue process would be increased quickly in 2D-mesh and others. The utilization of each router should rise sharply. Especially, fewer routers are in overload operation when the packet length is up to 96 bytes. Then local congestion would be formed as time goes by. On the other hand, when routers’ output links have stalled, the routers need more memory to buffer the injecting data. Additionally, local router cannot predict following link and router running states of next hop, so congestion would be easily formed. And, the next hop router refuse the previous router ejection traffic because of resource exhausting. Further more a local congestion can quickly spread to a region or the entire network. Some NoC architectures provide a local best solution easily, but do not guarantee the global optimization performance. Because the datapath is pre-connected in S-mesh architecture, it indicates that the S-mesh would provide lower latency than other NoCs under bigger message application services. 4.2 Influence of Buffer Sizes Buffer is the major part of any network router. In the most NoC architectures, buffers occupy the main part of the router area. As such, it is a major concern to minimize the amount of buffer under given performance requirements. Moreover the influencing degree of buffer sizes on average network latency has also been studied. In this scenario, size of network is fixed at 16 nodes.

448

H. Liu et al. Table 2. The influence of input buffer size on network latency Flit/Buf

Fat-tree

2D-mesh

Torus mesh

Butterfly

Octagon

S-mesh

1 -> 2

12.3%

16.0%

15.5%

25.2%

15.9%

2.1%

2 -> 4

9.9%

12.4%

9.6%

6.6%

6.9%

0.5%

4 -> 8

6.0%

7.6%

10.1%

5.9%

4.8%

0.6%

8 ->16

3.74%

-0.52%

8.18%

-4.98%

14.58%

-0.16%

16->24

3.83%

4.16%

-0.15%

-1.12%

4.32%

-1.38%

Table 2 summarizes the propagation time reduction with various buffer sizes. Average network latency on five NoC architectures can be reduced as much as from 12.3% to 25.2% while buffer size changes from 1 to 2 flit/buffer. However, with the deepening of buffer size, the adverse impact on message latency is obvious. Such as 2D-mesh, Butterfly, and S-mesh architectures, the larger buffer sizes in routers wouldn’t help to promote the network performance while buffer sizes are changed from 8 flits to 16 flits. For further research, we gave impact degree of buffer sizes with different message length in three types of NoC architectures. For the large capacity of buffer sizes, it can efficiently reduce message latency in circumstance of large message packet as shown in Fig.3. Mainly because large buffer sizes reduce the network contention. But as shown in Fig.4, S-mesh architecture can get shorter network latency no matter what message is without buffer units. It is shown that increasing the buffer size is not a solution to avoid congestion. At best, it delays the onset of congestion since the throughput is not increased. That is, buffers are useful to absorb burst traffic, thus leveling the bursts. Moreover, the performance improves marginally in relation to the power and area overhead. Unbuffered router architecture can efficiently reduce buffer requirements without reducing network latency. It is great improvement because the cost of on-chip memory is much higher. 700

700

160-byte 96-byte 32-byte

160-byte 96-byte 32-byte

600

500

Avg. latency(clks)

Avg.latency(clks)

600

128-byte 64-byte

400 300 200

128-byte 64-byte

500 400 300 200

100

100

0

0 1

2

4

8

Buffer size in router (flits)

(a) 2D-mesh

16

24

1

2

4

8

Buffer size in router (flits)

(b) Fat-tree

Fig. 3. Relative curve of message length and buffer sizes

16

24

A Methodology for Design of Unbuffered Router Microarchitecture for S-Mesh NoC

60

160-byte 64-byte

128-byte 32-byte

449

96-byte 10-byte

Avg. latency(clks)

50

40

30

20

10 1

2

4

8

Buffer size in router (flits)

16

24

Fig. 4. Relative curve of message length and buffer sizes in S-mesh architecture

4.3 Area and Power Consumption Results Area figures were achieved with Design Complier logic synthesis tool targeting a Chartered 0.13µm CMOS High Performance (HP) technology. Power consumption was estimated by using the same tool that performs cycle-based simulations on the synthesized netlist. The comparison of the alternative NoC architectures for area and power consumption are summarized in Table 3. It should be noted that the area of router of S-mesh is smaller than the area of other architectures except routers in 3D mesh. Because of using Chartered HP technology, the power dissipation of the router is slightly higher. Though the same traffic is transmitted, it’s along with the optimal distances in S-mesh without more latency and no power is wasted on data buffer and congestion. These factors are all perfectly processed by Control Network. The operating frequency of Router in S-mesh is up to 1250 MHz, and achieves 200Gbps maximum bandwidth. Table 3. Area and power consumption results Router

Buffer

Equivalent area (mm2)

Power (mw)

1

3D mesh[19]

80 flits

0.0346

9.41@500M

2

S-mesh:2009

--

0.0411

7.39@200M

3

Reconfigurable [20]

Circuit-

0.051

n.a.

4

MaRS[21]

32-flits

0.052

4.47@432M

5

ReNoC [22]

40 flits

0.061

n.a.

6

Æthereal[23]

24-word

0.175

n.a.

7

Xpipes [24]

64b flits

0.19

n.a

8

GT-BE[25]

8 flits

0.26

n.a.

9

QNoC[26]

input buffer

0.314

n.a.

10

GALS [27]

n.a.

0.884

n.a.

450

H. Liu et al.

5 Conclusion S-mesh NoC architecture borrows excellent architectural features of packet switching network, and unbuffered router microarchitecture borrows excellent architectural characteristic of circuit switching network. Therefore, S-mesh can gain higher performance than 2D-mesh and other NoC architectures under the condition of long message packet. The Equivalent area of router architecture is only 0.0411mm2. And local network performances optimizing would interfere in overall network performances. The result shows that S-mesh architecture and unbuffered router architecture are feasible and effective. Acknowledgments. We are grateful to thank the anonymous reviewers for their useful comments and suggestions. This research is supported by the High Technology Research and Development Program of China (No. 2009AA01Z105), the Postdoctoral Science Foundation of China under Grant (No.20080440942 and 200902432) and the Ministry of Education–Intel Special Foundation for Information Technology (No. MOE-INTEL-08-05).

References 1. Dally, W.J., Towles, B.: Route Packets, Not Wires: On-Chip Interconnection Networks. In: Design Automation Conf., USA, pp. 683–689 (2001) 2. Jantsch, A., Tenhunen, H.: Networks on Chip, pp. 3–39. Kluwer Academic Publishers, Hingham (2003) 3. Pande, P.P., Grecu, C., Ivanov, A., Saleh, R.: Design of a Switch for Network on Chip applications. In: ISCAS, Bangkok, Thailand, vol. 5, pp. v217-v220 (2003) 4. Kumar, S., Jantsch, A., Soininen, J.-P., Forsell, M., Millberg, M., Tiensyrja, K.: A network on chip architecture and design methodology. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI, Pittsburgh, USA, pp. 105–112 (2002) 5. Karim, F., Nguyen, A., Dey, S.: An interconnect architecture for networking System on Chips. IEEE Micro 22(5), 36–45 (2002) 6. Bononi, L., Concer, N.: Simulation and Analysis of Network on Chip Architectures: Ring, Spidergon and 2D Mesh. In: Design, Automation and Test in Europe, Munich, Germany, vol. 2, pp. 6–10 (2006) 7. Millberg, M., Nilsson, E., Thid, R., Kumar, S., Jantsch: The Nostrum backbone - a communication protocol stack for networks on chip. In: Proceedings 17th International Conference on VLSI Design, Mumbai, India, pp. 693–696 (2004) 8. Rijpkema, E., Goossens, K.: A router architecture for networks on silicon. In: Progress 2001, 2nd Workshop on Embedded Systems (2001) 9. Taylor, M.B., Kim, J., Miller, J.: The Raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro. 25(2), 25–35 (2002) 10. Dall’Osso, M., Biccari, G., Giovannini, L., Benini, L.: Xpipes: a latency insensitive parameterized network-on-chip architecture for multi-processor SoCs. In: Proc. ICCD, SanJose, USA, pp. 536–539 (2003) 11. Liuhao, Xuecheng, Z., Lixin, J., Meng, C., Kefeng, Z.: S-mesh: A mesh-based on-chip network with separation of control and transmission. The Journal of China Universities of Posts and Telecommunications 16(5), 86–92 (2009)

A Methodology for Design of Unbuffered Router Microarchitecture for S-Mesh NoC

451

12. ITU-T: Network element management information model for the Message Transfer Part (MTP). ITU-T, Rec. Q.751.1. International Telecommunication Union - Telecommunication Standardization Sector, Geneva (1995) 13. ITU-T 2001b: Signalling connection control part procedures. ITU-T, Rec. Q.714. International Telecommunication Union - Telecommunication Standardization Sector, Geneva (2001) 14. McKeown, N.: Fast Switched Backplane for a Gigabit Switched Router. Stanford University, Stanford (2008), http://tiny-tera.stanford.edu/~nickm/papers/ cisco_fasts_wp.pdf (accessed 03/23/07) 15. Yuan, T., Fan, X.Y., Jing, L.: Application specific network on-chip architecture. Computer Engineering and Applications 43(6), 88–91 (2007) (in Chinese) 16. Terry, T.Y.: On-chip multiprocessor communication network design and analysis. PhD thesis, Stanford University, USA (2003) 17. Benini, L., Bertozzi, D.: Network-on-chip architectures and design methods. In: IEEE Proc. Computers and Digital Techniques, vol. 152(6), pp. 261–272 (2005) 18. Hossain, H., Ahmed, M., Al-Nayeem, A., Islam, T.Z., Akbar, M.: GPNOCSIM – A General Purpose Simulator for Network-on-Chip. In: ICICT, Dhaka, Bangladesh, pp. 254–257 (2007) 19. Kim, J., Nicopoulos, C., Park, D., Reetuparna Das, R., Xie, Y., Vijaykrishnan, N., Mazin, S., Chita, R.: A Novel Dimensionally-Decomposed Router for On-Chip Communication in 3D Architectures. In: 34th International Symposium on Computer Architecture (ISCA 2007), San Diego, California, USA, pp. 138–149 (2007) 20. Wolkotte, P.T., Gerard, J.M., Smit, J.M.G., Rauwerda, G.K., Smit, L.T.: An energy-efficient reconfigurable circuit-switched NOC. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), Denver, Colorado, USA, pp. 155a-163a (2005) 21. Bahn, J.H., Lee, S.E., Bagherzadeh, N.: Design of a router for network-on-chip. International Journal of High Performances Systems Architecture 1(2), 98–105 (2007) 22. Stensgaard, M.B., Sparso, J.: ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology. In: 2nd ACM/IEEE International Symposium on Networks-on-Chip, Newcastle University, UK, pp. 55–64 (2008) 23. Dielissen, J., Rădulescu, A., Goossens, K., Rijpkema, E.: Concepts and implementation of the Philips network-on-chip. In: IP Based SoC Design 2003, Grenoble, France (2003) 24. Benini, L., Bertozzi, D.: Network-on-chip architectures and design methods. In: IEE Proceedings of Computers and Digital Techniques, vol. 152(2), pp. 261–272 (2005) 25. Rijpkema, E., Goossens, K.G.W., Rădulescu, A., Dielissen, J., Meerbergen, J., Wielage, P., Waterlander, E.: Trade Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip. In: IEE Proceedings of Computers and Digital Techniques, vol. 150(5), pp. 294–302 (2003) 26. Bolotin, E., Cidon, I., Ginosar, R., Kolodny, A.: QNoC - QoS architecture and design process for Network on Chip. Journal of System Architecture 50, 105–128 (2004) 27. Zipf, P., Hinkelmann, H., Ashraf, A., Glesner, M.: A Switch Architecture and Signal Synchronization for GALS System-on-Chips. In: 17th Symposium on Integrated Circuits and Systems Design (SBCCI 2004), Pernambuco, Brazil, pp. 210–215 (2004)

A Worst Case Performance Model for TDM Virtual Circuit in NoCs Zhipeng Chen and Axel Jantsch Royal Institute of Technology(KTH), Sweden {zhipeng,axel}@kth.se

Abstract. In Network-on-Chip (NoC), Time-Division-Mutiplexing (TDM) Virtual Circuit (VC) is well recognized as being capable to provide guaranteed services in both latency and bandwidth. We propose a method of modeling TDM based VC by using Network Calculus. We derive a tight upper bound of end-to-end delay and buﬀer requirement for indivdual VC. The performance analysis using Latency-Rate server is also presented in comparsion with our Performance model for TDM Virtual Circuit in NoCs (Pemvin). We conducted experiments on comparing Pemvin to the Latency-Rate server model. Our experiment results show the improvement of Pemvin on tightening the upper bound of end-to-end delay and buﬀer requirement.

1

Introduction

Development of modern submicron technology results in increasing of gate number and cores on one chip. Billion gates and over hundred cores on one die is now possible. Thus, current bus-based system-on-chip architecture is no longer adequate for multicores due to wire delay. Network-on-chip has emerged as novel paradigm which oﬀers better on chip communication architecture. Network-onchip has the potential to solve the scalabilty problem. However due to the contention of shared link and buﬀer, on chip network creates unpredictalbe performance. To overcome this nondeterminism, various approches are proposed to achieve Quality Of Service (QOS). Time-Division-Mutiplexing(TDM) Virtual Circuit (VC) is one approach among those which has been proposed in [1] [7]. TDM VC is a connection-oriented communitcation service in which two or more VC packet streams share buﬀers and link-bandwidth in turn. The time domain is divided into time slots, in every slot a ﬁxed number of packets can be sent to the network. Each VC has its own dedicated slots to use shared resources. To dimension the worst case end-to-end delay and buﬀer requirement for TDM VC is an important probelm for three reasons. First, end-to-end delay is critical for implementing QOS, such as video stream or telephony service. Second, the buﬀer requirement is an important parameter while designing NoC implementations. Third, with the method that can dimension worst case endto-end delay and buﬀer requirement, researchers are able to make high level evaluation of the NoC system without simulation.

Corresponding author.

C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 452–461, 2010. c IFIP International Federation for Information Processing 2010

A Worst Case Performance Model for TDM Virtual Circuit in NoCs

453

In this paper we address the problem of performance analysis of individual TDM VC. Modeling of TDM VC makes it possible to have performance analysis of TDM VC in system synthesis. By knowing ﬂow characteristics and slot allocation, researchers can obtain the end-to-end delay and buﬀer requirement directly without simulation. Currently this problem is not deep investigated. TDM VC is treated as Latency-Rate server in [6], which in some cases yields a tight bound. However, TDM VC has various slot distribution style within the time window, Latency-Rate server on the other hand uses avergage severvice rate to determin the behavior. Thus, a more accurate model is needed to give out precise performance analysis. We propose a formal approach by taking each slot and its corresponding interval within the time window as an individual session (see section 5.2). The rest of the paper is structured as follows. Section 2 discusses the related work. Section 3 is an introduction of network calculus basic. We introduce the VC model of Latency-Rate server in Section 4. Section 5 introduces the formal model of TDM VC, Pemvin, for performance analysis. Section 6 shows the experimental results. Finally, Section 7 contains some conclusions and directions for future work.

2

Related Work

End-to-end delay and buﬀer requirement dimensioning is a general problem of performance analysis of TDM VC after knowing the VC speciﬁcation. Network calculus [4] is a theoretical framework for analysing performance guarantees in computer networks. The foundation of network calculus lies in the mathematical theory of dioids, and in particular, the Min-Plus dioid. It oﬀers us analytical mathmatical methods for buﬀer and delay dimensioning. Latency-Rate server [6] is a general model for analysis of traﬃc scheduling algorithms. The behavior of a Latency-Rate scheduler is determined by two parameters the latency T and the allocated rate R. It can be used to derive bound of end-to-end delay and buﬀer requirements in a network of servers. This paper considers modeling TDM VC into slots corresponding sessions. Each session can be modeled as Latency-Rate server as proposed in [6]. In [10], Lu and Jantsch proposed a simple model for evenly distributed time slots using network calculus. Pemvin analysis further on other occasions of VC slot allocation. Our results show that Pemvin has a signiﬁcant improvement of the upper delay bound. Our work focuses on end-to-end delay and buﬀer requirement dimensioning for TDM VC with given slot allocation. By comparing diﬀerent schemes of VC slot allocation, we found that evenly distributed slots yield the best performance. However, with diﬀerent number of VCs, it is not always possible to get evenly distributed slots. Thus, an approach to analysis performance of a given ﬂow and slot allocation is necessary to get the allocation ﬁt to the requirement of the ﬂow.

454

3 3.1

Z. Chen and A. Jantsch

Network Calculus Basic Arrive Curve and Tspec

In network calculus [4], the traﬃc sent by a source is bounded by an arrival curve. Given a wide-sense increasing function α deﬁned for t ≥ 0. We say that a ﬂow F is constrained by α if and only if for all s ≤ t: F(t) - F(s) ≤ α(t - s); F has α as an arrival curve.

Fig. 1. TSPEC

A ﬂow F(t) represents the accumulated number of bits transferred in the time interval [0, t]. We use TSPEC (Traﬃc SPECiﬁcation) to represent ﬂow characteristics. With TSPEC, F is characterized by an arrival curve α(t) = min(L + pt, σ + ρt) in which L is the maximum transfer size, p the peak rate (p ≥ ρ), σ the burstiness (σ ≥ L), and ρ the average rate. 3.2

Service Curve

Network calculus uses the concept of service curve to describe the minimum amount of service that is guaranteed to a ﬂow. Consider a system S and a ﬂow through S with input and output function F and F ∗ . We say that S oﬀers to the ﬂow a service curve β if and only if β is wide sense increasing, β (0) = 0 and F ∗ ≥ F ⊗ β, in which ⊗ is minplus convolution [4]. 3.3

Analytical Bound

With a ﬂow’s arrival curve and a network element’s service curve, we can decide the delay and buﬀersize bound in the network element using network calculus. ¯ is the maximum horizontal distance As illustrated in ﬁgure 2, the delay bound D ¯ the maximum vertical distance between the arrival and the buﬀer size bound B ¯ is either the distance marked as AA’ curve and the service curve. In ﬁgure 2, D ¯ is either the distance marked as AE or AE’. In network calculus, or BB’; B following equations are used to calculate delay bound and buﬀer size bound. ¯ = inf {D ≥ 0 such that α β(−d) ≤ 0} D

(1)

A Worst Case Performance Model for TDM Virtual Circuit in NoCs

455

¯ = (α β)(0) B

(2) ¯ ¯ where is minplus deconvolution [4]. In network calculus, D and B are represented as maximum vertical distance and maximum horizontal distance of arrival curve and survice curve.

Fig. 2. Computation of delay and buﬀer size bound

4

VC Modeling as Simple Latency-Rate Server

Function βR,T = R(t − T )+ describes the service model, where R is service rate and T is the maximum initial delay. Notation x+ = x if x > 0; x+ = 0, otherwise. Latency-Rate service curves as deﬁned by function βR,T = R(t − T )+ is illustrated in ﬁgure 3. The service curve βR,T = λR ⊗ δT , whereas ⊗ is deﬁned as: (λR ⊗ δT )(t) = inf {λR (t − s) + δT (s)} [6]. For a given ﬂow F(t) characterized by TSPEC arrival curve α(t) = min(L + pt, α + ρt), by knowing service rate R and initial Delay T, we have + ¯ = L + θ(p − R) + T, where θ = (σ − L) D R (p − ρ)

(3)

¯ = σ + ρT + (θ − T )+ [(p − R)+ − p + ρ], where θ = (σ − L) B (p − ρ)

(4)

TDM VC has an average service rate denoted by R = n/Tw , where n is the number of slots with in the time window, and Tw is the length of time window. There are many possiblities of TDM VC for the initial delay T. The worst case is that a packet misses a slot and have to wait entire interval between two slots. V(data volume)

V(data volume)

V(data volume)

R

R

t

t

t T

T

Fig. 3. Latency rate server

456

Z. Chen and A. Jantsch

We present a Latency-Rate server model here. In a TDM VC with n unevenly allocated slots, there are n intervals. Assume that the lengths of these n intervals are T1 , T2 , ..., Tn . The model uses the longest inteval length Ti = max(T1 , T2 , ..., Tn ) as initial delay. For TDM VC that has even slot allocation, this model is a close approximation. But for uneven slot allocation, the diﬀerence between sessional service rate and average rate could be signiﬁcant. The worst case performance can be too pessimistic. Thus, a more accurate method of modelling is needed for buﬀer and delay dimentioning.

5 5.1

Pemvin TDM VC Modeling TDM VC Modeling Example

An example is illustrated in ﬁgure 4. VC v goes through the network. Assume that there are 16 time slots in each time window. Four slots are assigned to v 4 possiblities of the admission patterns. every time window. There are totally C16 For example,there are 4 slots in the VC and all have diﬀerent length. From the beginning is 5,3,2,6 consecutively. Considering a TSPEC ﬂow (6.4,0.1,1,1), if we use average serving rate 4/16 and average initial delay 4, by applying equation 3 and 4, we can get worst case end-to-end delay and buﬀer size. But in reality, VC serves packet exactly at the time slot reserved, for each session, the service rate is 1/5,1/3,1/2 and 1/6 respectively. Every session’s rate is diﬀerent from the average rate. Thus, the model is not accurate. With diﬀerent permutation of the reserved slots, the worst case behavior will be diﬀerent for a given VC. As we can see from the ﬁgure 4, if the initial slot length is 5, the worst case delay is 19, and worst case buﬀer size is 5. When the initial slot is 6, the worst case delay is 24, and worst case buﬀer size is 6. The results from Latency-Rate server model, the worst case delay is 28, and worst case buﬀer size is 7.

Fig. 4. Example of uneven VC

5.2

Pemvin TDM VC Modeling

In a TDM VC, mutiple slots are allocated with a time window. Length of intervals between slots are diﬀerent if allocation is uneven. While serving the ﬂow, a ﬁnite length list of slots is repeated periodically. The intervals between slots may

A Worst Case Performance Model for TDM Virtual Circuit in NoCs

457

have same length(evenly distributed) or not(unevenly distributed). VC packets synchronously advance one hop per time slot. Pemvin models TDM VC with sessions. Deﬁnition 1. A session of a TDM VC is a reserved time slot and the inteval between this slot and next reserved time slot. For each session it has a session service rate Ri , which is given by function Ri = 1/Ti , where Ti is the length of interval.

Fig. 5. Service curve of TDM VC (si stands for a session) α is the arrive curve and β is the service curve

In Pemvin, each session considers to be an individual server. By concatenating the sessions together, we obtain:

βR1 ,T1 ∧ βR2 ,T2 ∧ βR3 ,T3 ∧ ... ∧ βRI ,TI = min1≤i≤I {βRi ,Ti } Ti = (Tj + ni ∗ (n − Rj Tw )) j = i − ni ∗ n, j ∈ [0, n − 1] in which βri ,Ti is session service curve. Ri is the service rate of the session service curve. Ti is the initial delay of the session service curve. j is the sequences number of the resevered slots. n is the number of the resevered slots of the time window. Rj is the average rate of j th session. Together with ﬂow characteristics and TDM VC slot allocation, we calculate Ri , Ti accordingly. Note that with diﬀerent initial starting slots, there will be diﬀerent service curves. With Pemvin, we obtain a new service curve, which taking into account of slot allocation of TDM VC. With network calculus, we use equation 1 and 2 to obtain upper bound of worst case end-to-end delay and buﬀer size. 5.3

Algorithm for Pemvin

To decide the starting slot of the TDM VC which yields worst case end-to-end delay and buﬀer size, Pemvin has following algorithm.

458

Z. Chen and A. Jantsch

Algorithm 1. End-to-end Delay and buﬀer requirement dimensioning of TDM VC Input: ﬂow characteristics F ∼ (σ, ρ, L, p), VC slots slots(s1 , s2 , ..., sn ), si is the time slot reserved. output: Worst case end-to-end Delay D, Buﬀer size B. Find if there is combinations of slots gives less service rate than ρ; θ = (σ − L)/(p − ρ) If no combination found,check permutation for the ﬁrst θ + 1 packets; If a combination exists, let this combination serves immediately after θ + 1 packets. Check if there is permuation gives lower service rate. If so, change to new permutation and use the current permutation to get the worst case delay and buﬀer size.

6 6.1

Simulation and Results Simulation Purpose and Setup

In order to assess the proposed algorithm, we build up a simulator for TDM VC according to ﬁgure 6. The simulator was developed in C under Linux. A 4 by 4 mesh network is constructed in the experiment. VCs are automatically generated. The bandwith of all links is 1 packet/cycle. We assume that all swithes have the same time window for TDM VC. The simulator will generate a TDM VC according to the characteristics of the TDM VC. Besides VC generator, the simulator also gives out the worst case delay and buﬀer size according to the ﬂow characteristics and VC characteristics.

Fig. 6. Flow of Pemvin simulation

A Worst Case Performance Model for TDM Virtual Circuit in NoCs

6.2

459

Exploration of Diﬀerent Flow and TDM VC Characteristics

To observe the accuracy diﬀerence between our model and latency-rate server model, we made experiments over diﬀerent ﬂow charcteristics and diﬀerent VC slot allocations. In the table 1,We show experiment results of 6 ﬂows. They are F1 ∼ (1, 1, 16.16, 0.015),F2 ∼ (1, 1, 14.11, 0.023),F3 ∼ (1, 1, 12.06, 0.031), F4 ∼ (1, 1, 6.16, 0.015), F5 ∼ (1, 1, 4.11, 0.023), F6 ∼ (1, 1, 2.06, 0.031) respectively. The ﬁrst 3 ﬂows have 20 packets injected into network for the ﬁrst 256 cycles, and the last 3 ﬂows have 10 packets injected into network within the 256 cycles. And for each ﬂow has average rate ρ, VCs with average rate ρ, 2ρ are assigned. Table 1. Results from Pemvin and Latency-Rate server model σ 16.16 16.16 14.11 14.11 12.06 12.06 6.16 6.16 4.11 4.11 2.06 2.06

ρ 0.015 0.015 0.023 0.023 0.031 0.031 0.015 0.015 0.023 0.023 0.031 0.031

N 4 8 6 12 8 16 4 8 6 12 8 16

R 0.015 0.031 0.023 0.047 0.031 0.063 0.016 0.031 0.023 0.047 0.031 0.063

Dlrs DP emvin 1125 1061 596 544 730 666 356 325 480 437 250 216 529 458 242 216 231 197 140 127 134 114 79 68

diﬀD 6.03% 9.56% 8.77% 8.71% 8.96% 13.6% 13.42% 10.74% 17.26% 10.24% 14.93% 16.18%

Blrs BP emvin 17.53 17 17.46 17 17.07 17 15.53 15 15 15 14 14 8.18 8 6.88 6 5.4 5 5.35 5 4.42 4 3.52 3

diﬀB 3.12% 2.71% 0.41% 3.41% 0% 0% 2.2% 14.67% 8% 7% 10.5% 17.33 %

All TDM VC slots from the table 1 are randomly generated. In the table, σ is the burstness of the ﬂow, ρ is the average service rate of the ﬂow, N is the number of the slots assigned to VC for this ﬂow, Dlrs and Blrs are the end-to-end delay and buﬀer requirement of the Latency-Rate server model, while DP emvin and BP emvin are the end-to-end delay and buﬀer requirement of Pemvin. dif fD and dif fB show the percentage diﬀerences of end-to-end delay and buﬀer requirement between Pemvin and Latency-Rate server model. We simulated with evenly distributed TDM VCs and consecutively allocated TDM VCs too. We can see from the ﬁgure 7, for the same ﬂow and same slots number, the evenly distributed TDM VCs yeild best performance. On the contrary the consectively allocated TDM VCs yield worst performance. Each bar in the chart is the average values of 1000 simulations. The blue column is the value that starts from a random slot, the red column is the value from Pemvin and the yellow column is the value from Latency-rate server model. Labels under the columns mark the slots allocating style. We use ﬂow F ∼ (1, 1, 10, 0.015) for simulation, 4 slots, 8 slots and 12 slots are assigned to the VC every 256 cycles. Also we can see from the chart in the diﬀerence in delay between Latency-Rate server model and Pemvin can go up to more than 30 percent. Thus, Pemvin

460

Z. Chen and A. Jantsch

Fig. 7. Simulation results:end-to-end delay

Fig. 8. Simulation results:buﬀer size

signiﬁcantly increase the accuracy of the delay bound. However the increase of accuracy on buﬀer size is not as good as on delay bound. Yet it is still tighter than Latency-Rate server model.

7

Conculusion

We proposed a TDM VC performce model Pemvin for end-to-end delay and buﬀer requirement dimensioning in this paper. A TDM VC is divided and modeled as sessions. By exploring diﬀerent slot allocation schemes using Pemvin, we found that evenly distributed slots give out best worst case end-to-end delay and buﬀer requirement. In comparison with Latency-Rate server model, Pemvin yields signiﬁcantly tighter bounds on worst case delay and tighter bounds on worst case buﬀer requirement. Our future work will develop performance models for other VC schemes other than TDM VC. We will use results of Pemvin as constraints for TDM VC allocation. Our invetigestion will also extend to other switching schemes such as deﬂection ﬂow control.

A Worst Case Performance Model for TDM Virtual Circuit in NoCs

461

References 1. Goossens, K., Dielissen, J., Radulescu, A.: The Æthereal network on chip: Concepts, architectures, and implementations. IEEE Design and Test of Computers 22(5), 21–31 (2005) 2. Wang, Y., Zhou, K., Lu, Z., Yang, H.: Dynamic TDM Virtual Circuit Implementation for NoCs. In: Proceedings of Asia-Paciﬁc Conference on Circuits and Systems (APCCAS 2008), IEEE Conferences, China (2008) 3. Lu, Z., Jantsch, A.: Slot Allocation for TDM Virtual-Circuit Conﬁguration for Network-on-Chip. In: Proceedings of the 2007 International Conference on Computer-Aided Design (ICCAD 2007), IEEE Conferences, USA (2007) 4. Boudec, J.Y., Thiran, P.: Network Calculus. LNCS, vol. 2050. Springer, Heidelberg (2001) 5. Tong, L., Lu, Z., Zhang, H.: Exploration of Slot Allocation for On-Chip TDM Virtual Circuits. In: The 12th EUROMICRO Conference on Digital System Design (DSD 2009), IEEE Conferences, Greece (2009) 6. Stiliadis, D., Varma, A.: Latency-Rate Servers: A General Model for Analysis of Traﬃc Scheduling Algorithms. Computer Engineering Department University of California Santa Cruz (1995) 7. Millberg, M., Nilsson, E., Thid, R., Jantsch, A.: Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip. In: Proceedings of the Design Automation and Test in Europe Conference, IEEE Conferences, France, vol. 2, pp. 890–895 (2004) 8. Coenen, M., Murali, S., Radulescu, A., Goossens, K., Micheli, G.: A buﬀer-sizing algorithm for networks on chip using TDMA and creditbased end-to-end ﬂow control. In: Proceedings of International Conference on Hardware/Software Codesign and System Synthesis, Korea (2006) 9. Lu, Z., Jantsch, A.: TDM virtual-circuit conﬁguration for Network-on-chip. In: Proceedings of the 2007 International Conference on Computer-Aided Design (ICCAD 2007), IEEE Conferences, USA (2007) 10. Lu, Z., Brachos, D., Jantsch, A.: A Flow Regulator for On-Chip Communication. In: The 22nd IEEE International SoC Conference (SoCC 2009), IEEE Conferences, UK (2009)

Convex-Based DOR Routing for Virtualization of NoC Guang Sun, Yuanyuan Zhang, Yong Li, Li Su, Depeng Jin, and Lieguang Zeng Tsinghua University, Main Building 901c, Beijing, China [email protected]

Abstract. Network on Chip (NoC) is proposed as a promising intra-chip communication infrastructure. A simple and efficient routing scheme is important for large scale NoC to provide the required communication performance to applications with low area and power overheads. Although mesh is preferred for NoC, virtualization may lead to irregular topologies. In this paper, we propose a Convex-Based DOR (CBDOR) routing scheme for the convex topologies. We demonstrate the connectedness and deadlock-freedom of CBDOR. This routing mechanism relies only on two bits per switch. Simulation results show that the area overhead of CBDOR switch is just 2.2% higher than that of traditional DOR switch, with the added complexity negligible. Therefore, the simplicity in the routing mechanism and switch architecture makes CBDOR more practical and scalable when compared to LBDR and FDOR. Keywords: Network on Chip (NoC); virtualization; irregular topology; Dimension-Order-Routing (DOR); connectedness; deadlock-free.

1 Introduction With the development of the semiconductor technology, SoC (System-on-Chip) design is entering billion-transistor era [1], [2]. More and more IPs and smart interconnects are integrated in one chip. In order to alleviate the complex communication problems which arise as the number of on-chip components increases, Network on Chip (NoC) architecture has been recently proposed as a promising communication paradigm to replace global interconnects [3], [4]. NoC addresses the on-chip communication problem with a networking approach and provides notable improvements in terms of area, performance, scalability, reliability and flexibility over the traditional bus-based structures [5], [6], [7]. The topology and routing algorithm are the two important features that influence the network performance, cost and power consumption [8]. The Regular topology, especially the 2D mesh topology, becomes a kind of popular architecture for NoC design, for it is very simple and efficient from a layout perspective [9], [10]. For the traditional routing algorithms, logic-based routing algorithm (e.g. DOR) is preferred as it can reduce latency, power and area requirements [9]. However, most of them are used in regular topologies. C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 462–469, 2010. © IFIP International Federation for Information Processing 2010

Convex-Based DOR Routing for Virtualization of NoC

463

In order to fully exploit the increasing number of cores and get enough parallelism for applications, virtualization for multicore chips is becoming necessary [8], [10], [11], [12]. The virtualized NoC solution provides several advantages such as increasing resource utilization, reducing power consumption and increasing the yield of chips [11]. Although the concept of virtualization is not new (e.g. virtual memory, virtual machines), there are some challenges when applying it to NoC. A virtualized NoC may be viewed as a network that partitions itself into several different regions, with each region serving different applications and traffic flows concurrently [11]. In a virtualized NoC, the system should guarantee traffic isolation among regions. Therefore, virtualization for NoC may lead to irregular sub-networks within the original 2D mesh [9]. Fig. 1 shows an irregular topology resulting from virtualization of NoC.

Fig. 1. An irregular sub-network topology resulting from virtualization of NoC

According to the aforementioned motivations, it would be very necessary to develop an efficient routing algorithm for most practical irregular topologies. In this paper, we propose a Convex-Based DOR (CBDOR) routing scheme for most of the practical topologies that we might find in the near future in NoC. The CBDOR is a logic-based routing algorithm without the need of forwarding tables in the switches. Moreover, CBDOR is connected and deadlock-free. This routing scheme uses only two bits per switch to route in the convex topologies with minimal hops. Therefore, CBDOR is very simple. Simulation results show that the area overhead of CBDOR switch is just 2.2% higher than that of traditional DOR switch. Therefore, the simplicity in the routing mechanism and switch architecture makes CBDOR very practical and scalable. The remainder of paper is organized as follows. Section 2 presents some related work. Then, Section 3 describes the CBDOR mechanism in detail. Section 4 analyses the connectedness and deadlock-freedom of CBDOR, and provides some evaluations. Finally, in Section 5, we conclude our work.

464

G. Sun et al.

2 Related Work Traditional DOR is very simple and efficient, but it can only be used in regular 2D mesh. Traditional routings used in irregular topology mainly include source-based routing and table-based routing [9]. In source-based routing, the entire path is stored in the packet header [9]. Because the packet header itself must be transmitted during routing, it consumes network bandwidth, especially when the routing path is very long or the packet has only a few bits. In table-based routing, a table at each switch stores the output port that must be used for each destination [9]. The main advantage of table-based routing is that it can be used in any topology and with any routing algorithm [9]. However, as the size of NoC increasing, the memory requirements for building such routing tables also increase, thus exhibiting longer access delay and consuming more areas and power overheads [8]. LBDR [9] is a logic-based routing without the need of forwarding tables in the switches. All the supported topologies share the same property: all the end-nodes can communicate with the rest of nodes through the minimal path defined in the original mesh topology, namely, convex property. However, LBDR need 12 bits per switch and some extra logic gates. Moreover, the routing computing is very complex. FDOR [12] is also a logic-based routing, but the supported topologies only include the FDOR-topology, which must satisfy three requirements. First, it must consist of three meshes: the core, the X- flank and X+ flank. Second, the core mesh must have at least the same number of dimensions as any flank. Third, the border of flank is only a sub-border of the core mesh. Thus, the supported topologies in FDOR are just a very small part of the convex topologies. In this paper, we propose CBDOR, which is a logic-based routing algorithm without the need of forwarding tables in the switches. The CBDOR uses only two bits per switch to routing in the convex topologies with minimal hops. So CBDOR is very simple and efficient. In addition, when the shape and the size of convex topology are varied, LBDR [9] need to regenerate the routing bits and run routing computing again, which is very complex. However, the CBDOR scheme directly routes according to the two connectivity bits per switch, without any more operations. Thus, this simplicity in the routing mechanism makes CBDOR more practical and scalable.

3 Proposed Routing Algorithm: CBDOR This section describes the principles of CBDOR. Firstly, we define the topology conditions on which CBDOR can be used. All the supported topologies in the CBDOR must be convex. It means that the topology region must contain all the line segments connecting any pair of points inside it. Intuitively, the supported topologies can not have holes inside the region. Therefore, in the CBDOR scheme, all the end nodes can communicate with the rest of nodes through a minimal path defined in the original 2D mesh topology (pictured in Fig.2 (a)). Fig. 2 (a-c) shows some examples of topologies supported by CBDOR. As

Convex-Based DOR Routing for Virtualization of NoC

465

a counter example, Fig.2 (d) shows a topology with a hole inside the region. In this case, this topology is not convex. So the CBDOR can not be applied in this topology.

(a) 2D mesh

(c) “+” topology

(b) P topology

(d) non-convex topology

Fig. 2. Examples of topologies (a-c) supported and (d) not supported by CBDOR

Implementation of CBDOR is very simple due to the fact that it is a pure logicbased routing scheme. According to the two connectivity bits per switch, the CBDOR can change the order in which the different dimensions are routed, so the routing logic in the CBDOR is slightly more complex than the one in the DOR. We verify this in the next section by synthesizing the two switches. Simulation results show that the added complexity of the CBDOR is negligible. When irregular topologies are generated, the one connectivity bit per output port can be given by global manager (GM), which is responsible for system resource management [13]. The connectivity bits indicate whether a switch is connected with its neighbors [9]. For example, if the switch is not connected through the south port, Cs equals zero. Otherwise Cs equals one. Thus, the connectivity bits are Cs , Cn ,

Ce and Cw . In our scheme, we only use Cs and Cn to route. We assume that the X and Y coordinates of the final destination are stored in the message header ( X des and Ydes ), and each switch knows it X and Y coordinates (through

X cur and Ycur registers at each switch) [9]. The coordinates of the switch which is in the left and bottom corner of the 2D-mesh are smallest. Messages are routed with CBDOR scheme from current switch to destination switch, according to the offsets of coordinates and the two connectivity bits per switch. The CBDOR scheme is described in detail as below.

466

G. Sun et al.

The pseudo-code of the CBDOR scheme:

if Ydes < Ycur and Cs =1 then output_port= south; elseif Ydes > Ycur and Cn =1 then output_port= north; elseif X des > X cur then output_port= east; elseif X des < X cur then output_port= west; else output_port= local_node; end

4 Analyses and Evaluations In this section, we analyze the connectedness among all the nodes and the deadlockfreedom in the CBDOR scheme. In order to evaluate the complexity of the CBDOR scheme, we provide the synthesis results for the switch architecture of the CBDOR and compared them with DOR, which is very simple in routing logic and economical in area overheads. 4.1 Connectedness The connectedness of a routing scheme means that any pair of nodes in the supported topologies can communicate with each other by the routing scheme. All the supported topologies in the CBDOR must be convex. If the pair of nodes can be included in a rectangle topology, which is a part of original irregular topology, they can communicate with each other by the scheme just as DOR. Thus, in this case, the connectedness of the CBDOR is satisfied. If the pair of nodes can not be included in a rectangle topology, the connectedness of the CBDOR is satisfied through change the order in which the different dimensions are routed. For example, the destination switch is in the northeast of the source switch (as showed in Fig. 3). Before Ydes equals Ycur , the CBDOR scheme in the intermediate switches can be divided into two cases. First, messages are routed in the north direction when Cn equals one. Second, messages are routed in the east direction when Cn equals zero. Because of the convex property of the supported topology, this scheme can make sure that the messages can be routed towards destination switch until Ydes equals Ycur . After Ydes equals Ycur , the messages are routed in east direction until

X des equals X cur . Thus, in this case, the connectedness of the CBDOR is satisfied.

Convex-Based DOR Routing for Virtualization of NoC

467

Similarly, if the destination switch is in other position of the source switch, the connectedness of the CBDOR is also satisfied.

Fig. 3. The routing path from source switch to its northeast destination switch with the CBDOR scheme

From aforementioned analyses, we prove the connectedness of the CBDOR scheme. Moreover, because of the convex property of the supported topology, the CBDOR can route with minimal hops, which equals the Manhattan distance between the source and destination switches. 4.2 Deadlock-Freedom A routing scheme for the supported topology is deadlock-free if there are no cycles in its channel dependency graph. Fig. 4 shows the channel dependency in YX routing. The turns marked with dotted lines are forbidden and the turns marked with solid lines are permissible.

Fig. 4. The channel dependency in YX routing. The turns marked with dotted lines are forbidden and the turns marked with solid lines are permissible.

From Fig. 4, we can find WS, ES, EN and WN1 turns are forbidden in YX routing. Although it is possible that WS, ES, EN and WN turns happen in the CBDOR routing, the CBDOR is deadlock-free. We take the EN turn as an example to demonstrate that there are no cycles in its channel dependency graph. If EN turn happens in a

1

S, W, N and E represent south, west, north and east respectively.

468

G. Sun et al.

switch, we can deduce that the connectivity bit

Cn of its west neighbor switch equals

zero. For example, in Fig.3, switch B may have EN turn, and the connectivity bit Cn of its west neighbor switch, namely the Cn of switch A, equals zero. This is to say, if a switch permits the EN turn, the north port of its west neighbor switch is unavailable. Because of the convex property of the supported topology, the EN turn can not form cycles in its channel dependency graph. Similarly, we can demonstrate that the WS, ES and WN turns in the CBDOR can not form cycles in its channel dependency graph. Therefore, the CBDOR scheme is deadlock-free. 4.3 Evaluation and Results In this section, we implement the synthesis for the switch architecture of CBDOR, and compared them with tradition DOR. In order to compare the area overheads of switch architectures using different routing schemes, VHDL language is used to design our performance simulation platform because it is believed that simulation platform designed by hardware description language is more similar to realistic on-chip network. It is believed that the DOR scheme is very simple in routing logic and economical in area overhead. In order to evaluate the complexity of the CBDOR scheme, we synthesize the switch architectures for the CBDOR and traditional DOR with the VHDL-based platform. In the simulation, both schemes use wormhole switch technique. The synthesized results are given in Table 1. In Table 1, the unit of synthesized area is ALUT (adaptive look-up table). From Table 1, we can see that the area overhead of the CBDOR switch is just 2.2% higher than that of traditional DOR switch. Thus, the added complexity of the CBDOR is negligible. Regarding the networkthroughput performance of the CBDOR, this scheme is slightly worse than DOR because it aggregates some traffic towards the edges of the topology. Table 1. Synthesized results of switches using the DOR and the CBDOR

DOR switch CBDOR switch Added

Area Overhead 6975 ALUT 7131 ALUT 2.2%

5 Conclusion In order to fully exploit the increasing number of cores and get enough parallelism for applications, it is necessary to allow for partitioning the whole interconnected network into several separate regions. However, virtualization for NoC may lead to irregular topologies. So we propose a Convex-Based DOR (CBDOR) routing scheme for most of the practical topologies that we might find in the near future in NoC. We demonstrate the connectedness and deadlock-freedom of CBDOR. This routing scheme is a pure logic-based routing and uses only two bits per switch to route in the convex topologies with minimal hops. Therefore, CBDOR is very simple. Simulation

Convex-Based DOR Routing for Virtualization of NoC

469

results show that the area overhead of CBDOR switch is just 2.2% higher than that of traditional DOR switch, with the added complexity negligible. Therefore, the simplicity in the routing mechanism and switch architecture makes CBDOR more practical and scalable when compared to LBDR and FDOR. Acknowledgments. This work is partly supported by National Natural Science Fund (NNSF-90607009), partly supported by the National High Technology Research and Development Program (No. 2008AA01Z107) and partly supported by the National Basic Research Program (No. 2007CB310701).

References 1. Hu, J., Marculescu, R.: Energy-aware mapping for tile-based NoC architectures under performance constraints. In: ASP-DAC 2003, pp. 233–239. IEEE Press, Kitakyushu (2003) 2. Benini, L., Micheli, G.D.: Networks on chips: a new SoC paradigm. Computer 35, 70–78 (2002) 3. Marculescu, R., Ogras, U., Peh, L.S., Jerger, N., Hoskote, Y.: Outstanding research problems in NoC design: system, microarchitecture, and circuit perspectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 3–21 (2009) 4. Kreutzand, M., Marcon, C.L., Carroand, N., Susin, A.: Energy and latency evaluation of NoC topologies. In: ISCAS 2005, pp. 5866–5869. IEEE Press, Kobe (2005) 5. Lee, H.G., Chang, N., Ogras, U.Y., Marculescu, R.: On-chip communication architecture exploration: a quantitative evaluation of point-to-point, bus, and network-on-chip approaches. ACM Transactions on Design Automation of Electronic Systems 12, 20–40 (2007) 6. Dally, W., Towles, B.: Route packets, not wires: On-chip interconnection networks. In: DAC 2001, pp. 684–689. IEEE Press, Las Vegas (2001) 7. Horowitz, M., Ho, R., Mai, K.: The future of wires. Proceedings of the IEEE 89, 490–504 (2001) 8. Mejia, A., Palesi, M., Flich, J., Kumar, S., Lopez, P., Holsmark, R., Duato, J.: RegionBased Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 356–369 (2009) 9. Flich, J., Rodrigo, S., Duato, J.: An efficient implementation of distributed routing algorithms for nocs. In: The Second ACM/IEEE International Symposium on Networks-onChip, pp. 87–96. IEEE Press, Newcastle (2008) 10. Gratz, P., Kim, C., McDonald, R., Keckler, S.W., Burger, D.C.: Implementation and evaluation of on-chip network architectures. In: International Conference on Computer Design 2006, pp. 477–484. IEEE Press, Las Vegas (2006) 11. Flich, J., Rodrigo, S., Duato, J., Sodring, T., Solheim, A.G., Skeie, T., Lysne, O.: On the Potential of NoC Virtualization for Multicore Chips. In: CISIS 2008, pp. 165–177. IEEE Press, Washington (2008) 12. Skeie, T., Sem-Jacobsen, F.O., Rodrigo, S., Flich, J., Bertozzi, D., Medardoni, S.: Flexible DOR routing for virtualization of multicore chips. In: International Symposium on Systemon-Chip 2009, pp. 73–76. IEEE Press, Tampere (2009) 13. Chou, C.L., Ogras, U.Y., Marculescu, R.: Energy-and performance-aware incremental mapping for networks on chip with multiple voltage levels. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27, 1866–1879 (2008)

MPSoC Architecture-Aware Automatic NoC Topology Design Rachid Dafali and Jean-Philippe Diguet European University of Brittany - UBS/CNRS/Lab-STICC dept. BP 92116, F-56321 Lorient Cedex, France [email protected]

Abstract. This paper presents a methodology for the automatic definition of NoC topology according to application and architecture requirements. The proposed solution, which has been implemented as a new step of our NoC design flow, results from the analysis of real concerns and demands from designers. The main contribution lies in the fact that we customize the method according to multiprocessor architecture models and associated memory organizations. A real-life H264 example has been used to compare synthesis results for NoCs generated by our tool, with automatic topology selection with well-known efficient topologies, which wer manually specified. Results clearly show the efficiency of our approach. Keywords: Application Specific Network, Network-on-Chip, Topology Generation, Mapping.

1 Introduction Networks-on-Chip (NoC) have recently emerged as a new generation of communication infrastructure to support communication on SoCs providing a solution to connect different IP-cores through an effective, modular, and scalable communication network. The choice of the most suitable network topology for a particular set of applications is an important phase in designing an efficient NoC. Wrong topologies choices can dramatically affect network performances. Consequently, the topology not only impacts the efficiency of the network in terms of bandwidth and speed, but also has a major influence on the area of the network and its energy consumption. Area cost is one of the most frequently cited drawbacks addressed to NoC solutions and the purpose of this paper is precisely to describe a new methodology that aims to minimize the final NoC area by means of a relevant topology selection. This design step is implemented within our existing NoC CAD tool in order to choose a network topology according to application constraints and architecture model. This methodology is based on the heterogeneity of application domains and exploits the designer’s knowledge on the model of his architecture including processors and memory organization to guide the topology selection. This aspect is usually ignored although it is of great importance. Actually, in real-life conditions the architecture model usually exists before the NoC has been designed for two major reasons. First, it is directly related to the application domain and secondly it depends on previous design C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 470–480, 2010. c IFIP International Federation for Information Processing 2010

MPSoC Architecture-Aware Automatic NoC Topology Design

471

reuse. As a result, our method depends on the initial architecture model and then aims to minimize the final cost by topology refinements according to application constraints. This paper is organized as follows: In section 2, a survey of the related work is done. The methodology and the design are explained in sections 3 and 4, section 5 describes results obtained with a real-life video encoder. Finally, section 6 concludes the paper and gives an overview of our future work.

2 State of the Art To integrate and automate the choice of topology in the NoC design flow, current methods [11, 9, 7] select a generic topology among a predefined set. This approach is based on the placement of IPs on various topologies, followed by a comparison of performances to identify the best topology. The relevancy of this approach depends on many factors, such as the criteria for comparison, the number and the models of topologies compared, and the placement algorithm used. The Stanford solution presented in [11] consists of a tool called SUNMAP, which is used for the automatic selection of the best topology and the NoC code generation. This tool places the IP blocks on several standard topologies (2D mesh, Torus, Hypercube, Butterfly, and Stage Clos); it then selects the best architecture by estimating model performances in terms of bandwidth and area. The mapping is performed through a heuristic approach formulated in [10], with an objective function, the goal of which is to minimize average communication time, as well as surface and energy consumption, while meeting bandwidth constraints. Similarly, the simulation tool SIMOO from University of Rio Grande do Sul automatically chooses a topology among several by analyzing the communication of a given application [9]. This tool first extracts the application communication behaviour as a dynamic communication graph from high-level communication primitives and simulation results. Then, it maps this graph into different previously characterized architecture templates. Finally, it analyses and selects the architecture that best fits the application communication performance constraints. Moreover, a similar approach is presented in [7]. This paper proposes a methodology that aims at minimizing the power consumption by selecting the optimum network topology for a specific application. Another method for designing the best network topology is based on a hierarchical partitioning [12,3]. In this approach, the network is divided into two or more partitions and each partition is modeled by a dedicated topology. The different partitions are then combined to form a single network topology. The efficiency of this approach depends on the choice of model for a given partition and the choice of the algorithm partitioning. The approach in [12] customizes a specific topology for a given application by decomposing the communication requirements into a set of generic communication primitives, such as gossiping and broadcasting. Each primitive is swapped by a specific representation graph in the library. This graph structure is the pattern that the decomposition algorithm searches for when processing the input application graph. After the decomposition step, the communication primitives are replaced by their optimal implementations and finally glued together to synthesize the customized architecture.

472

R. Dafali and J.-P. Diguet

The personalization of the network topology for a given application remains the best option to achieve the highest degree of flexibility in the NoC paradigm. This customization can be achieved by exploiting the specific communication application to design the network that meets the designer’s requirements for performance, surface and energy consumption. Our view differs from previous works, as they are directly guided by the architecture model and strongly focalized on the real design bottleneck that motivates the bandwidth requirement and consequently the need of a NoC solution, namely the concurrent memory accesses. However, our methodology is adapted to the way local, shared, and global memories are accessed. Thus, we consider that the NoC topology design flow must be based on two strongly related main stages. The first one starts with the exploration and analysis of application domain requirements. It results in Communication Dependency Graphs (CDG) with bandwidth requirements. Our tool already performs these tasks. The second one defined as an interactive step, which is necessary to integrate designer knowledges and experience. This is the specification of the architecture model in terms of memory accesses and address computation mechanisms. Area optimization is then performed along with topology design steps including clustering of non communicating IPs, load balancing and IP placement. It results in a topology that fits application and architecture requirements. This second stage is the main contribution presented in this paper.

3 Design Methodology The heterogeneity of application domains in embedded system-on-chip makes it impossible to develop a single method for NoC topology design because the varying nature of communications means various different schemes from one domain to another. The aim of our approach is to exploit the awareness of the design about the architecture model to guide the topology selection. Consequently, we propose a new method that extracts the application communication behaviour from high-level communication primitives, analyzes and identifies the class of multiprocessor architecture and designs the topology with an automatic method customized for the class that has been chosen. This section describes the various multiprocessor architecture classes identified and explains the topology design method developed for each one. 3.1 Multiprocessor Architecture Class Identification To classify the multiprocessor architecture, we use Raina taxonomy [5] that extends Flynn’s [4] classification by considering that multiprocessor architectures are based on two elemental features: the address space and the physical memory organization. These are the basic properties that represent a memory system and its relations with processors. Raina completed this terminology by designing the address space to be either shared or disjoint and the physical memory organization to be either shared or distributed. Therefore, this taxonomy organizes memory systems into three different classes described in Fig 1. The first class is shared address-space shared memory (SASM), which corresponds to the conventional shared memory systems. The second

MPSoC Architecture-Aware Automatic NoC Topology Design

473

class is disjoint address-space distributed memory (DADM), which corresponds to the conventional distributed memory systems. The third class is shared address-space and distributed memory (SADM). We have chosen this terminology primarily because the the memory’s access policy has a major impact on the NoC performance. Secondly, it is based on criteria that differentiate application fields. For example, the architecture of a video coding application fits the SASM class because all processors communicate with shared memories through the network. However, the architecture of a turbo-communication application belongs to the DADM class because the messages circulate between processors and the local memories are not connected to the network. The choice of the architecture model and the component specification is made by the designer during a first interactive step implemented in the µSpider II CAD tool. Then, the associated method is applied according to the designer’s choice. These methods are detailed in the following subsections.

IP1

IP2

…...

IPn

…...

Local memory1 IP1

NoC

Shared memory1

DADM (NORMA)

SADM (NUMA)

SASM (UMA)

Shared memorym

Local memorym IPn IP……... 3

IP2

NoC

Local memory1 IP1

Shared memory1

Local memoryn IP2

IP3 ……...

IPn

NoC

Fig. 1. Raina multiprocessor architecture classification

3.2 SASM Topology Design Method The SASM class is based on the Uniform Memory Access (UMA) model. In UMA architecture, all the processors share the physical memory uniformly. Usually, the SASM architecture is used to speed up the execution of a single large program in time critical applications. The Algorithm 1 describes the method used to create a dedicated topology for a SASM class, this algorithm includes three steps: 1. First, choose the structure to map the shared memories (SMs). Thus, if the SMs are less than or equal to three, they are connected with point-to-point links. Otherwise they are mapped onto a mesh 2D topology. The SMs location on the mesh 2D topology is guided according to the number of common SMs interlocutors and the communication bandwidth required for these common interlocutors. For example, in Fig.2, a SM4 has a PU5 as a common interlocutor with SM1 and SM3 and PU5 communications require more bandwidth with the SM1 , this implies that the SM4 is placed closer to the SM1 .

474

R. Dafali and J.-P. Diguet

Algorithm 1. Topology Design for SASM class First step: Shared memories (SM) mapping. if Number of SM ≤ 3 then SM are connected by point-to-point links. else SM are mapped on mesh 2D topology. end if Second step: Processing Unit (PU) mapping. if PUx communicates with SMy then Map the PUx on SMy star topology. end if Third step: Each PU must belong to only one star-group repeat Selection of PUx which demands the smallest bandwith. Computation of the PUx bandwith impact (Eq. 1) on each star-group. Add the PUx . to the star-group which incurs the minimum impact. until all PU are not connected

2. In the second step, place the processing units (PUs) onto the topology. The strategy used consists in attaching to each shared memory node a star topology that connects the PUs communicating with this shared memory. This implies that a PUx can belong to several star-groups. As illustrated in Fig.2, PU5 belongs to three stargroups: SM1 , SM3 and SM4 . we choose the star structure bacause is scalable, easy to set up and to expand. 3. Finally, to address the problem of the PUx belonging to several star groups, compute the PUx ’s bandwidth impact on the star-groups and then keep the PUx in the group where it introduces the minimum amount of impact on. The PU’s bandwidth impact is calculated using this formula: ImpactPUxG =

PU (∑Ni=0 BWPUi ∗ δi(G) ) − BWPUxG PU BWPUi ∗ δi(G) ∑i=0

N

∗ 100

(1)

where: – ImpactPUxG is the impact of PUx attached to G group. – NPU is the total number of PUs. – BWPUi is the bandwidth required between the PU and the SM belonging to the star-group. 1 i f PUi ∈ G – δi(G) = 0 otherwise Consequently, in the example shown in Fig.2, the PU5 is placed definitively in the SM1 star-group because it incurs the minimum bandwidth impact compared to the impact in the other star-groups.

MPSoC Architecture-Aware Automatic NoC Topology Design

PU7

PU1 Shared memory1

Router PU5

Shared memory3

Router

Router

PU5 impact = 78,03%

475

PU2 Router PU5

PU3

PU5 impact = 82,28%

Mesh2D PU6

Router

Router PU5

Shared memory4 PU3

PU5 impact = 91,06%

Router

Shared memory2

PU8

PU4

Router PU3

Star

Fig. 2. SASM design topology method

3.3 SADM Topology Design Method The SADM class is based on a Non-Uniform Memory Access (NUMA) model. NUMA model is a memory design used in multiprocessor architecture, where the processor can access its own local memory faster than non-local memory. The non-local memory can be a local memory to another processor or a shared memory. To automatically generate a dedicated topology for the SADM class, we have also implemented a specific algorithm similar to the approach used for the SASM class. However, the difference is located in the third step of the algorithm. Where the processors are connected to the star-groups of their local memories, even if they have the higher impact on these star-groups. Thus, we keep the specification provided by the NUMA model, while guaranteeing that the processor can access its own local memory rapidly. 3.4 DADM Topology Design Method The memory access in the DADM class is based on the No Remote Memory Access (NORMA) model. NORMA is a memory organization used in multiprocessor architectures with distributed memories, where a processor cannot access non-local memories. To deal with this model, we use an algorithm composed of three steps to design a dedicated topology for the DADM class: 1. First step: create non-communication PU groups. As shown in Fig.3, the processors are grouped in 4 clusters represented by different colors and in each group the processors do not communicate. 2. Secondly, structure all non-communication PU groups in pairs; a pair links two nodes of a group through a router; a node represents a processor or another pair. The arrangement of the pairs are created according to the communication bandwidth required for the common interlocutors of the pair elements. In Fig.3, we illustrate the pair arrangement in the example by structuring the group represented by the yellow color. 3. Finally, connect the top router of each group on mesh 2D topology.

476

R. Dafali and J.-P. Diguet

PU0

PU3

PU10

PU6

PU13

Router Router

PU5

Router

Router Router

PU7

Router

Mesh 2D

PU9

Router Router

Router

Router PU16

PU14

Router

PU15

Router

PU17

PU2

PU4

Router PU8

PU11

Router PU12

PU1

Fig. 3. DADM design topology method

4 From Application to NoC Design The design of NoC implies a complex set of tasks, because the NoC design flow includes several stages with different choices and complex functions. Moreover, the arrangement of these stages, their interactions, and any incompatibilities affect a large number of parameters. To address this complexity of design, we have design µSpider II CAD tool. This tool performs design space exploration and code generation. Design space exploration is implemented in an interactive way based on designer choices for arbiter, routing policies and topology selections. Then automatic procedures are available for time-consuming and error prone tasks such as Time Division Multiplexing (TDM), FIFO sizing and path allocation for guaranteed traffic management. In the continuity of µSpider II CAD tool development [1], we replaced the choice of a generic topology with an automatic generation of specific topology based on the methodology developed in section 3. The CAD environment flow design consists of four parts: 1. The first step simulates the application and extracts the communication specification such as bandwidth, latency, and communication sequences. These specifications represent the input point of the NoC design flow. They allow for an accurate description of the performances to be delivered by the NoC. To obtain the maximum benefits from this information, we developed a method for extracting constraints based on CDG. For instance, these graphs can be generated from a specification with the cadence E language. The CDG describes the the communication scheduling. This scheduling organizes communications according to their order of execution and it can automatically extract tedious information such as: (a) Parallel communications. (b) Mutual exclusion between communications. (c) Synchronisation constraints.

MPSoC Architecture-Aware Automatic NoC Topology Design

477

2. Thanks to the scheduling of communications in the CDG graphs. The second step explores these graphs to calculate the real bandwidth and latency, provides parameters to generate automatically a specific topology using the methodology explained in section 3 and determine the mutual exclusive communications. 3. The third step deals with derivation of local latency and bandwidth constraints for each unidirectional communication, and computes the minimum TDM table size required for implementing guaranteed traffic communications and a minimum bandwidth for all best effort communications. Then, it computes and allocates time slots and path to each guaranteed traffic communication. 4. The last step is the VHDL code generator, some additional C APIs are also provided for interfacing NoC components with IP cores, currently the library provides interfaces compliant with the OPB bus.

5 Experimental Results 5.1 Experiments with Complex Video Decoder Application To evaluate the potential of the topology design methodology implemented in µSpider II CAD tool for real applications, we have implemented a H.264 video decoder, provided by our industrial partner. The SoC is composed of 35 cores (28 masters and 7 slaves). The video compression standards are defined by means of profiles and levels that correspond to the configuration of the codec. A profile is usually a set of algorithms and a level corresponds to a complexity degree (e.g. pixel resolution, decoding speed, data-rate ...). H.264 standard has 14 different levels and 3 profiles: baseline (videoconferencing, wireless network ...), X (video streaming) and main (high quality and performances). Each profile can be combined with any of the different levels (e.g. resolution levels). The estimation of bandwidth by masters and slaves have been done in the case of a high definition television (HDTV) video mode (1080i/1920/30Hz) which corresponds 50%

7000

6000

65% 36%

5000

Mbits/s

4000

Average Savings : 27,80% 3000

2000 Industry Estimation CAD tool Estimation 1000

0 1

2

3

4

5

6

7

8

9

10 11

12 13

14 15

16 17

18 19

20 21

22

23 24

25 26

27 28

IPs

Fig. 4. H.264 bandwidth gains with mutual exclusion & dependency analysis

478

R. Dafali and J.-P. Diguet

to the main profile and it is the maximum size running on a simulation tool over a 8ms cycle frame 5 macro-block time. The designers of our industry partner manually estimated the bandwidth requirements to 3, 2Gbits/s but our CDG-based analysis tool showed that these features were over-estimated. While taking into account mutualexclusion and dependency constraints we finally saved around 20% (max: 65%) of bandwidth as shown exhaustively for all IP in Fig.4. The H.264 video decoder application is distributed over multiple processors via data partitioning. The multiprocessor architecture platform contains 28 processors and 7 shared memories. All the processors can read and write data from/to all shared memories. Therefore, this multiprocessor architecture belongs to the SASM class as specified in our methodology explained in section 3, because all the processors share the physical memory uniformly. 5.2 Test Conditions The experimental NoC generated by our tool has been parameterized as follows: bit width 32; Routing: street sign; End to end flow control; Round robin arbiter; and single guaranteed traffic (GT) channel with a 32 slot TDMA table. We generate four NoCs for the H.264 video decoder application with the same parameters and bandwidth and latency constraints in order to obtain the same performances. We have generated 4 NoCs with our tool. The first NoC is based on a specific topology automatically generated. And 3 other NoCs are based on the most popular NoC regular topologies, which are also used in tools addressing topology selection: – A Mesh 2D topology connects the nodes in mesh. It is so far the most popular and exploited architecture in network on chip design. The main reason is that the routing algorithms are simple to introduce, it is scalable particularly in view of its location on silicon, and provides in theory high-speed communication, but is expensive to implement on silicon because of its large number of port for routing and connection. – The WK-recursive topology [2] is a structure for recursively scalable networks. It offers many interesting properties such as a high degree of scalability, and symmetry, which correspond very well to a modular design of distributed systems connecting a large number of processors. – The Spidergon topology [6] developed by STMicroelectronics is a network topology in which all nodes are arranged in ring form and each node is connected to its clockwise and its counter-clockwise neighbour. Also, each node is connected directly to its diagonal counterpart in the network. This structure allows for a simple routing scheme to minimize the number of nodes that a data packet must cross before reaching its destination. The main advantage of this topology over a conventional 2D mesh is that the network latency can be significantly reduced for the same number of nodes. The number of links required and the silicon implementation cost of the Spidergon topology are also better than the 2D mesh.

MPSoC Architecture-Aware Automatic NoC Topology Design

479

5.3 Evaluation Experiments We compared the synthesis results obtained by the NoC based on the custom topology and the ones based on generic topologies for the same video encoder application and the same performances. On average, the area required by the generic topologies is 30% more than the area occupied by the custom topology. The gain we have obtained for the NoCs (Table1) in the worst case (Spidergon) are equal to 17%, 22%, 14% for logic, registers and memory respectively and 46%, 37%, 25% in the best case (WKRecursive). Finally, Table 3 shows that the gains obtained for the links and routers are more significant (37%, 55%, 53% for Spidergon) than the ones obtained for the full NoC. This gap is due to the area occupied by the network interfaces (NIs) (Table 2), which do not vary as much from one scheme to another. The NI area depends on the size of buffers, which depends on the bandwidth and latency constraints and not on the kind of topology used. Table 1. NoC costs of various topologies Synthesis results (Gain) Logic (LC)

Registers (LC) Memory (bits)

Dedicated 50512 47808 58442 Mesh 2D 69880 (27,7%) 64390 (25,8%) 70322 (16,9%) WK-Recursive 94774 (46,7%) 76136 (37,2%) 78242 (25,3%) Spidergon 61056 (17,3%) 61663 (22,5%) 68474 (14,7%)

Table 2. Network Interface costs of various topologies Synthesis results (Gain) Logic (LC)

Registers (LC) Memory (bits)

Dedicated 33412 36578 49862 Mesh 2D 33640 (0,67%) 36599 (0,06%) 49862 (0%) WK-Recursive 33751 (1%) 36801 (0,6%) 49862 (0%) Spidergon 33688 (0,82%) 36583 (0,02%) 49862 (0%)

Table 3. Router and Link costs of various topologies Synthesis results (Gain) Logic (LC)

Registers (LC) Memory (bits)

Dedicated 17100 11230 8580 Mesh 2D 36240 (52,8%) 27791 (59,6%) 20460 (58,1%) WK-Recursive 61023 (72%) 39335 (71,5%) 28380 (69,8%) Spidergon 27368 (37,5%) 25080 (55,2%) 18612 (53,9%)

480

R. Dafali and J.-P. Diguet

6 Conclusion and Future Work In this paper, we have developed a new topology-based design methodology for NoC applications. The proposed methodology automatically generates a custom topology according to Raina multiprocessor architecture classification. This methodology has been integrated in our CAD tool to replace the generic topology with the automatic generation of a specific topology. Using our tool to design a specific topology for a H264 video decoder application has proven the efficiency of our approach. The results from custom topology compared to generic topologies show a 30% improvement in the average area. We have noticed that TDMA-based network interface area is close to half of the total cost independent of NoC topology, and this is the point that has motivated our current research direction dedicated to interface optimization. We are adding a last optimization step to the topology design flow, this work focuses on local optimizations to remove resources (link, router) from the dedicated topology, that have a low utilization rate. We also plan to test our method for applications that belong to the DADM class.

References [1] Evain, S., Dafali, R., Diguet, J.P., Eustache, Y., Juin, E.: µSpider cad tool: Case Study of NoC IP Generation for FPGA. In: DASIP (2007) [2] Della Vecchia, G., Sanges, C.: A recursively scalable network vlsi implementation. In: Future Gener. Comput. Syst., pp. 235–243. Elsevier, Amsterdam (1988) [3] Ahonen, T., Sigenza-Tortosa, D.A., Bin, H., Nurmi, J.: Topology optimization for application-specific networks-on-chip. In: Int. work. on System level Interconnect Prediction (SLIP 2004), pp. 53–60 (2004) [4] Flynn, M.J.: Some computer organisations and their effectiveness. IEEE Trans. on Computers (1972) [5] Raina, S.: Virtual Shared Memory: A Survey of Techniques and Systems (1992) [6] Coppola, M., Locatelli, R., Maruccia, G., Pieralisi, L., Scandurra, A.: Spidergon a novel on chip communication network. In: Int. Symp. on System on Chip, pp. 15–22 (2004) [7] Elmiligi, H., Morgan, A.A., El-Kharashi, M.W., Gebali, F.: A Topology-based Design Methodology for Networks-on-Chip Applications. In: Int. Design and Test Workshop, pp. 61–65 (2007) [8] Hu, J., Marculescu, R.: Energy- and performance-aware mapping for regular NoC architectures. IEEE Transaction on CAD of Integrated Circuits and Systems (2005) [9] Kreutz, M.E., Carro, L., Zeferino, C.A., Susin, A.A.: Communication architectures for system-on-chip. In: 14th Symp. on Integrated Circuits and Systems Design, pp. 14–19 (2001) [10] Murali, S., De Micheli, G.: Bandwidth-constrained mapping of cores onto NoC architectures. In: DATE (2004) [11] Murali, S., De Micheli, G.: SUNMAP: a tool for automatic topology selection and generation for NoCs. In: DAC, pp. 914–919 (2004) [12] Ogras, U.Y., Marculescu, R.: Energy- and performance-driven NoC communication architecture synthesis using a decomposition approach. In: DATE, pp. 352–357 (2005)

ERA: An Eﬃcient Routing Algorithm for Power, Throughput and Latency in Network-on-Chips Varsha Sharma, Rekha Agarwal Manoj S. Gaur, Vijay Laxmi, and Vineetha V. Computer Engineering Department, Malaviya National Institute of Technology, Jaipur, India {varsha.m26,rekha.agarwal9}@gmail.com, {gaurms,vlaxmi}@mnit.ac.in, [email protected]

Abstract. Network-on-Chip (NoC) is viewed as a viable substitution for traditional interconnection networks to achieve high performance, communication eﬃciency and reliability in complex VLSI architectures at deep sub micron. Achieving high performance, power eﬃciency with optimum area is a target for any routing algorithm in NoC. In this paper, we propose a novel routing scheme named ‘ERA’, which oﬀers higher throughput with controlled delays while remaining power aware. ERA is an adaptive routing algorithm, which avoids congestion and tends to minimize the hot spots in the network. Unlike other existing algorithms, the proposed algorithm does not require any virtual channels to avoid deadlocks. We compare our algorithm with XY and OE on the basis of a performance metric called ‘power performance factor’ for diﬀerent trafﬁc patterns and injection models. Our results show that ERA performs better than these two algorithms. Keywords: Network-on-Chip, Energy Model, Deterministic Routing, Adaptive Routing, Deadlock, Turn Model, Latency, Throughput.

1

Introduction

Regular tile-based NoC architecture has recently been proposed as a solution to interconnect design for complex on-chip designs [1]. This architecture consists of a number of IP cores (DSP, storage or a processing element) and switches/routers. Various IP cores communicate with each other by means of routers. Routing is essentially how data ﬂows from a source to the destination. It can be deterministic or adaptive. Deterministic routing algorithms allow lower delays due to their simplistic logic when network is not congested [2]. Adaptive algorithms have the ability to sense the network congestion and redirect the packets along the other routes. Adaptivity may lead to uniform traﬃc and power distribution (minimizing hot spots) in the network and higher throughput. In this paper, we propose a novel adaptive routing scheme which senses power dissipation on the neighbouring nodes. In this way, it optimizes power and also avoids the congestion occurring in the network. Latency and path length (Manhattan distance of the destination from a node) is taken into account when equal C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 481–490, 2010. c IFIP International Federation for Information Processing 2010

482

V. Sharma et al.

power dissipation is anticipated in two or more directions. OE turn model [3] is used for routing as it avoids deadlock without the need of any virtual channel (VC). Deadlock can also be avoided by increasing the number of VCs but employing more VCs require adding buﬀer space and additional control logic to routers; communication performance of the network and reliability of the routers may be aﬀected [3,4]. OE (Odd Even) turn model approach [3] is preferred for deadlock avoidance instead of the turn model [4] because of its uniformly distributed nature. This paper is organised as follows. Section 2 presents an overview of the NoC simulators – NIRGAM integrated with ORION used for evaluating the performance statistics and power calculation respectively. Section 3 brieﬂy summarizes the work done in the development of various routing strategies for NoC. Detailed description of the proposed scheme is discussed in Section 4. Section 5 introduces a ﬁgure of merit termed power performance factor for multiobjective optimization of power, latency and throughput. Comparative study of the other algorithms with our proposed function is given in Section 6 with detailed results. We ﬁnally summarize our work in Section 7. 0

5

0, 0

1, 0

10

2, 0

15

3, 0

20

4, 0

1

6

11

16

21

0, 1

2

1, 1

7

12 2, 1

3, 1

4, 1

17

22

0, 2

3

1, 2

8

2, 2 13

3, 2

4, 2

18

23

0, 3

4

0, 4

1, 3

9

1, 4

2, 3

14

2, 4

3, 3

4, 3

19

24

3, 4

4, 4

Fig. 1. 2D mesh topology

2

NoC Simulation Framework

NIRGAM (NoC Interconnect Routing and Application Modelling) [5] is a modular, discrete event and cycle accurate simulator developed in SystemC. In NIRGAM, a 2D interconnect of tiles (Fig. 1) can be simulated with various options of virtual channels, clock frequency, buﬀer parameters, routing mechanisms and applications, etc. Each NIRGAM tile consists of various components like input channel, controller, virtual channel allocator, output channel controller and an IPcore. Each IPcore is attached to a router/switch by means of a bidirectional core channel. Energy dissipated at these components are calculated using ORION model [6,7]. ORION [6,7] is a power simulator used to evaluate power performance tradeoﬀs in respect of NIRGAM design parameters. ORION can compute static as well as dynamic energy of a component. Power computation is done by passing requisite parameters such as number of buﬀer reads and writes, crossbar traversals and arbitrations at each clock intervals through ORIONs functions. Total energy (ET ) dissipated across a router in terms of power consumed by its components is expressed as:

ERA: An Eﬃcient Routing Algorithm

ET = Ebuﬀer + Earbiter + Ecrossbar + Eclk + Elink

483

(1)

In this manner power dissipated at each router is calculated at each clock cycle using the ORION power models. At each clock cycle, these updated power values are transmitted to the neighbouring routers. Each router has a process sensitive to the power updation event and maintains an array Power[ ] which holds the value of power dissipated at the neighbouring routers. Values stored in Power[ ] is used in the proposed algorithm for uniform power dissipation and minimizing hotspots.

3

Related Work

A number of deterministic and adaptive routing algorithms have been proposed for NoC [8,9,2,10,11]. Problems like deadlock, livelock and starvation are major issues for a routing function. Routing algorithms can easily avoid livelock by being minimal. Deadlock can be avoided by removing any cyclic dependency between channels. Turn Model [4] prohibits smallest number of turns for preventing cycles to ensure no deadlocks [12]. The degree of adaptiveness oﬀered by turn model is highly uneven. An improvement is OE turn model [3] which provides more even routing adaptiveness. Algorithms like XY, OE, etc are deadlock free without requiring multiple VCs whereas algorithms like PROM [8], MAXY [9], etc. use diﬀerent VC allocation schemes for achieving deadlock free scenario. So far, most of the published works either focus on improving performance parameters (throughput, latency) or on minimizing energy needs. To the best of our knowledge, there does not exist any NoC routing algorithm which takes into account both power and performance factors simultaneously to develop an overall eﬃcient routing scheme. ERA is such an eﬃcient routing algorithm which not only guarantees high performance but also helps in the design of low power NoCs by minimizing the power consumption and appearance of hot spots.

4

ERA: An Eﬃcient Routing Algorithm

We propose an adaptive routing algorithm which takes into account power dissipation, congestion as well as the delays encountered in the network. At every node, the set of available directions for the next hop are determined as per OE turn model. For optimizing power dissipation, only those directions are chosen for which the power dissipation is less than the average power computed over all available directions. If the set contains multiple directions, algorithm tends to minimize number of hops by selecting those directions for which product of power and path length (manhattan distance of the next hop node from the destination node) has lower value. Finally for minimizing congestion, we check the availability of buﬀers in the ﬁnal two directions and choose the one with more number of available buﬀers. Proposed method is illustrated in Algorithm 1.

484

V. Sharma et al.

Algorithm 1. ERA: An Eﬃcient Routing Algorithm Require: Sx , Sy : x and y coordinate of source node Dx , Dy : x and y coordinate of destination node Cx , Cy : x and y coordinate of the current node Power[4]: set containing power at neighbours Ensure: Routing path from current node. set-avail-ﬁnal = ∅ p-factor = ∅ set-avail = available directions given by OE turn model path-len[N] = |Dx − (Cx − 1)| + |Dy − Cy | path-len[S] = |Dx − (Cx + 1)| + |Dy − Cy | path-len[E] = |Dx − Cx | + |Dy − (Cy + 1)| path-len[W] = |Dx − Cx | + |Dy − (Cy − 1)| if (Cx = Dx ) AND (Cy = Dy ) then return ip-core end if for i ∈ { N, S, E, W } do if set-avail[i] then total-power += Power[i] p-factor[i] = path-len[i] * Power[i] count-avail-dir++ end if end for average-power = total-power/count-avail-dir for i ∈ { N, S, E, W } do if set-avail[i] then if Power[i] < average-power then set-avail-ﬁnal = set-avail-ﬁnal ∪{i} end if end if end for if set-avail-ﬁnal has one direction then return direction else direction1 = set-avail-ﬁnal with min p-factor direction2 = set-avail-ﬁnal with 2nd min p-factor if buﬀer[direction1] > buﬀer[direction2] then return direction1 else return direction2 end if end if

5

Power-Performance Factor (φf )

For a power aware yet most eﬃcient routing scheme, low power consumption, smaller delays and high values of throughput are desirable. To take all these

ERA: An Eﬃcient Routing Algorithm

485

factors into account, we deﬁne a ﬁgure of merit, ‘power-performance factor’ derived from average value of performance metrics – power, latency, throughput. φf =

(average power consumed ∗ average latency) average throughput

(2)

A low value of φf indicates low average power consumption and latency and/or high average throughput. Lower the value of φf , better is performance of algorithm in terms of performance metrics.

6

Experimental Setup and Results

NIRGAM [5] integrated with ORION [6,7] is used for performing simulations to evaluate eﬀectiveness of our algorithm. All the experiments are performed on a 5x5 mesh topology. Flit size is 5 bytes, 1 for the head and 4 for the data payload. Only single VC is used with 32 buﬀers. Simulations are run for 5000 clock cycles– ﬁrst 3000 cycles for traﬃc generation. For power calculations 110nm technology and Clock Frequency of 1 GHz is assumed. 6.1

Transpose and Bit Shuﬄe Traﬃc Scenarios

Transpose and Bit Shuﬄe are synthetic traﬃc patterns commonly employed for real world applications [13]. Under ‘Transpose’ traﬃc scenario, source and destination addresses are represented as a 2-tuple r, c, r is the row and c is the column index. For source at i, j, the destination will be j, i. In ‘bit shuﬄe’, address of a node is represented by n ∗ r + c for a mesh of size nxn. Also for a given source, the destination node is obtained by a circular left shift of the source address. Fig. 1 show both 1-tuple and 2-tuple addresses for nodes in a 5x5 mesh. Fig. 2(d) and Fig. 3(d) indicate that the φf is lesser for ERA as compared to both XY and OE. ERA exhibits higher throughput and controlled latency because of its adaptive nature and in-built capability of avoiding congestion through checking the availability of buﬀers and choosing path of lower power consumption. Under transpose traﬃc pattern, ERA performs better than OE in all respects i.e. higher throughput, lower latency and lower power consumption. In case of Bit Shuﬄe, though OE oﬀers higher throughput at higher values of load but ERA performs better in case of power and latency factors resulting in a better power-performance factor in all cases. XY is a simple, minimal and deterministic algorithm with low latency. It does not respond to the congestion taking place in the network and a large part of network bandwidth is left unexploited. This results in lower throughput values for XY than ERA in both transpose and bit shuﬄe scenarios. Slightly higher value of latency and power in case ERA than XY is compensated by achieving higher throughput and ERA ultimately succeeds in maintaining lower value for the power-performance factor. As a result, it can be observed that its performance level is better than the other schemes like XY and OE.

486

V. Sharma et al. 70

0.3 XY OE ERA

60

0.25

50

0.2

Power (Watts)

Latency (Clock Cycles/flit)

XY OE ERA

40

0.15

30

0.1

20

0.05

10

0 0

5

10

15

20

25

30

35

40

45

50

0

5

10

15

Load (in %)

20

25

30

35

40

45

50

35

40

45

50

Load (in %)

(a) Latency

(b) Power

9

3 XY OE ERA

XY OE ERA

8 2.5

Power-performance factor (Watt*Cycles/Gbps)

7

Throughput (Gbps)

6

5

4

3

2

1.5

1

2 0.5 1

0

0 0

5

10

15

20

25

30

35

Load (in %)

(c) Throughput

40

45

50

0

5

10

15

20

25

30

Load (in %)

(d) φf

Fig. 2. (a) Latency, (b) Power, (c) Throughput and (d) φf for XY, OE and ERA under Transpose traﬃc pattern

6.2

Multimedia Traﬃc with Negative Exponential Distribution

Multimedia traﬃc exhibits the property of self similarity and resembles closely to traﬃc of the real time applications [14]. It has been reported that NoC traﬃc shows the property of self similarity [15,16]. The main conﬁgurable parameters are Hurst Parameter (degree of self-similarity), minimum and maximum size of each frame. The inter frame interval is generated using an exponential distribution. Negative Exponential Distribution (NED) function as discussed in [17] is used with multimedia traces to get a more realistic scenario. Spatial inter-core communication pattern can be approximated by NED. This characteristic is exploited

ERA: An Eﬃcient Routing Algorithm

30

487

0.3 XY OE ERA

XY OE ERA

28 0.25 26

0.2

22

Power (Watt)

Latency (Clock Cycles/flit)

24

20

18

0.15

0.1

16 0.05 14

12

0 0

5

10

15

20

25

30

35

40

45

50

0

5

10

15

Load (in %)

20

25

30

35

40

45

50

35

40

45

50

Load (in %)

(a) Latency

(b) Power

8

1.4 XY OE ERA

XY OE ERA

7

1.2

Power-performance factor (Watt*Cycle/Gbps)

6

Throughput (Gbps)

5

4

3

1

0.8

0.6

0.4

2

0.2

1

0

0 0

5

10

15

20

25

30

35

Load (in %)

(c) Throughput

40

45

50

0

5

10

15

20

25

30

Load (in %)

(d) φf

Fig. 3. (a) Latency, (b) Power, (c) Throughput and (d) φf for XY, OE and ERA under bit shuﬄe traﬃc pattern

by using exponential distribution function based on Manhattan distance between source and destination. Keeping Hurst parameter constant (0.81), we vary the range of minimum and maximum size of a frame and the average oﬀ time between them so as to simulate results for diﬀerent quality of videos [18]. Flit interval is maintained at 2 clock cycles. Comparison is done for three cases namely low, medium and high quality videos. Multimedia injection model is used with NED traﬃc pattern. Fig. 4 shows that in all the three scenarios our algorithm performs nearly similar to XY and better than OE. For high quality videos, ERA is better than XY.

488

V. Sharma et al. 130

0.56 XY OE ERA

XY OE ERA

120 0.54

110 0.52

0.5 90

Power (Watts)

Latency (clock cycles per flit)

100

80

0.48

70 0.46

60 0.44 50

0.42 40

30

0.4 LOW

MEDIUM

HIGH

LOW

Video Quality

(a) Latency

HIGH

(b) Power

15.5

4.5 XY OE ERA

XY OE ERA

15

4

14.5

3.5

Power-performance factor (Watt*cycles/Gbps)

Throughput (Gbps)

MEDIUM Video Quality

14

13.5

13

12.5

3

2.5

2

1.5

12

1 LOW

MEDIUM

HIGH

Video Quality

(c) Throughput

LOW

MEDIUM

HIGH

Video Quality

(d) φf

Fig. 4. (a) Latency, (b) Power, (c) Throughput and (d) φf for XY, OE and ERA under NED traﬃc pattern

6.3

Hot Spot Traﬃc Scenario

Under hot spot traﬃc, hot spots are created at nodes where heavy traﬃc is felt. The percentage deﬁned with respect to a particular hot-spot node determines the amount of traﬃc passing through it. In our simulation framework, node 18 is chosen as a hot spot with 30%. The routing algorithm chooses those paths in which less average power is dissipated from the set of all available directions. In this eﬀort, it signiﬁcantly helps in the removal of hot spots through uniform power distribution across network. Table 1 depicts the results obtained. We can observe that the percentage of total power dissipated at the hot spot node is more in case of other routing algorithms

ERA: An Eﬃcient Routing Algorithm

489

Table 1. Power dissipation at the hot-spot and the power performance factor XY

OE

ERA

% of total power dissipated 7.156 7.342 5.78 at hot spot node Power performance factor 0.5803 0.5622 0.3670

as compared to ERA. Moreover there is no sensing of power dissipation at the neighbouring nodes by the other algorithms. It can be argued that ERA not only adapts to the congestion in the network but also remains power aware.

7

Conclusion

We have presented a novel routing scheme which can consider the performance and power simultaneously. The proposed scheme is adaptive which controls the delay by taking into account the path length and oﬀers high throughput as well by avoiding the congested paths. The proposed method remains power aware as it prioritises the path with low power dissipation and signiﬁcantly helps in the minimisation of hot spot occurrences. The proposed method outperforms the deterministic algorithms like XY even when the traﬃc load is high in mesh topology. However, at very heavy loads, XY performs slightly better because of its simplicity and minimality. As can be seen from the results, ERA outperforms OE routing algorithm in terms of latency, throughput and power under all traﬃc scenarios. ERA requires all the routers to have the information about the power consumption on all its neighbouring routers. The hardware implementation of the algorithm and power calculation for the route logic circuitary can be looked upon as a part of future work. Also detailed studies for comparison of ERA with respect to other existing algorithms will be performed in future.

References 1. Dally, W.J., Towles, B.: Route packets, not wires: on-chip interconnection networks. In: Annual Design Automation Conference, Las Vegas, Nevada, United States, pp. 684–689. ACM, New York (2001) 2. Hu, J., Marculescu, R.: Dyad smart routing for network-on-chip. In: 41st Design Automation Conference (2004) 3. Chiu, G.-M.: The odd-even turn model for adaptive routing. IEEE Transactions on Parallel and Distributed Systems, 729–738 (2000) 4. Glass, C., Ni, L.: The turn model for adaptive routing. In: 19th Annual International Symposium, Computer Architecture, pp. 278–287 (1992) 5. Jain, L., Al-Hashimi, B., Gaur, M.S., Laxmi, V., Narayanan, A.: Nirgam: A simulator for noc interconnect routing and application modelling. In: Design Automation and Test in Europe (DATE), Nice, France (2007) 6. Wang, S., Zhu, X., Peh, L., Malik, S.: Orion: A power-performance simulator for interconnection networks. In: 35th annual ACM/IEEE International Symposium on Micro architecture, Istanbul, Turkey (2002)

490

V. Sharma et al.

7. Kahng, A.B., Li, B., Peh, L.S., Samadi, K.: Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In: Design Automation and Test in Europe (DATE), Nice, France (2009) 8. Cho, M.H., Lis, M., Shim, K.S., Kinsy, M., Devadas, S.: Path-based, randomized, oblivious, minimal routing. In: 2nd International Workshop on Network on Chip Architectures, New York (2009) 9. Rameshan, N., Biyani, A., Gaur, M., Laxmi, V., Ahmed, M.: Qos aware minimally adaptive xy routing for noc. In: 17th International Conference on Advanced Computing and Communication (ADCOM), Bangalore, India (2009) 10. Glass, C., Ni, L.: Adaptive routing in mesh-connected networks. In: 12th international conference on Distributed Computing Systems (1992) 11. Marvasti, M.B., Daneshtalab, M., Afzali-Kusha, A., Mohammadi, S.: Pampr: Power-aware and minimum path routing algorithm for nocs. In: International Conference on Electronics, Circuits and Systems, ICECS (2008) 12. Duato, J., Yalamanchili, S., Ni, L.: Interconnection Networks-An Engineering Approach. Morgan Kaufmann, San Francisco (2003) 13. Dally, W.J., Towles, B.: Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco (2003) 14. Varatkar, G.V., Marculescu, R.: On-chip traﬃc modeling and synthesis for MPEG2 video applications. IEEE Transactions on Very Large Scale Integrated Systems 12, 108–119 (2004) 15. Soteriou, V., Wang, H., Peh, L.: A statistical traﬃc model for on-chip interconnection networks. In: IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 104– 116 (2006) 16. Paxson, V.: Fast, approximate synthesis of fractional gaussian noise for generating self-similar network traﬃc. Computer Communications Review 27(5), 5–18 (1997) 17. Rahmani, A.M., Afzali-Kusha, A., Pedram, M.: A novel synthetic traﬃc pattern for power/performance analysis of network-on-chips using negative exponential distribution. Journal of Low Power Electronics 5, 1–10 (2009) 18. http://www.tkn.tu-berlin.de/research/trace/trace.html

Author Index

Agarwal, Rekha 481 Ao, Wenbing 299 Bader, Kai Christian Ben, Kerong 432

He, Tiening 289 Huang, Zhenchun 232

Cao, Feifei 442 Cao, Qian 314 Cao, Yuanda 190 Chen, Enhong 121 Chen, Guoliang 289 Chen, Jianhai 390 Chen, Junxiu 314 Chen, Ming 165 Chen, Yi 314 Chen, Zhipeng 452 Cheng, Boqun 412 Cheng, Buqi 3 Cutway, Anthony 277 Dafali, Rachid 470 Diguet, Jean-Philippe 470 Dong, Mianxiong 358 Eißler, Tilo 232 Evans, Nathan 232 Fang, Binxing 91 Fang, Jianan 200 Fang, Wei 289 Feng, Xiaobing 329 Gao, Feng 423 Gaudiot, Jean-Luc 106, 344 Gaur, Manoj S. 481 GauthierDickey, Chris 232 Grothoﬀ, Christian 232 Grothoﬀ, Krista 232 Guo, Jichi 46 Guo, Minyi 358 Guo, Song 358 Han, Jian-Jun 344 He, Jingsha 423 He, Qinming 390

155

Jantsch, Axel 452 Jia, Yaocang 329 Jiang, Jinlei 155 Jiang, Song 121, 373 Jiang, Xiaohong 390 Jin, Depeng 462 Jin, Hai 299, 358 Jin, Jun 190 Keene, Jeﬀ Kim, Mihye

232 134

Lai, Rongfeng 220 Laxmi, Vijay 481 Lee, HyeongOk 134 Li, Bo 299 Li, Jiaxin 190 Li, Quanlong 18 Li, Shisheng 3 Li, Xiao-Feng 3 Li, Xiaoming 220 Li, Xing 390 Li, Yong 462 Li, Zhanhuai 76 Lin, Yuan 344 Liu, Dongsheng 442 Liu, Guiquan 121, 373 Liu, Hao 442 Liu, Jia 247 Liu, Leibo 177 Lu, Xiaoyi 412 Luo, Yingwei 220 Luo, Yunfeng 432 Mao, Feng 61 Meier, Harald 232 Meng, Dan 262 Mi, Lei 432 Mi, Wei 329 Nguyen, Tung 277 Nie, Jiutao 3 Nistler, Peter J. 106

492

Author Index

Pei, Yunxia 402 Peng, Juan 412 Peng, Shufen 423 Peng, Wen 18 Pu, Xing 190

Xing, Changyou 165 Xu, Xiao 91 Xu, Yun 35 Xu, Zhiwei 247 Xue, Jingling 329

Qasem, Apan Qu, Yanhuai

Yang, Guangwen 155 Yang, Li 165 Yang, Liang 220 Ye, Kejiang 390 Yi, Qing 46 Yuan, Lina 18

46 200

Rahman, Faizur 46 Ritzdorf, Craig 232 Rutherford, Matthew J. Seo, JungHyun 134 Shang, Shifeng 155 Shao, Mingzhi 35 Sharma, Varsha 481 Shen, Xipeng 61 Shi, Weisong 277 Sim, Hyun 134 Song, Jiqiang 177 Su, Li 462 Sun, Guang 462 Sun, Guangzhong 289 Tang, Eugene Y. 177 Teng, Da 35 Tu, Xuping 358 V., Vineetha

481

Wang, Huaimin 2 Wang, Ligang 3 Wang, Shicai 247 Wang, Xiaolin 220 Wang, Zhenlin 220 Wu, Tingpeng 18 Wu, Xiaodong 344 Wu, Xu 210 Wu, Yongwei 155

232

Zeng, Leijie 76 Zeng, Lieguang 462 Zha, Li 247, 412 Zhan, Jianfeng 262 Zhang, Binbin 220 Zhang, Hongli 91 Zhang, Lei 121 Zhang, Qin 299 Zhang, Shuai 200 Zhang, Weizhe 91 Zhang, Xiaodong 1 Zhang, Xuechen 121, 373 Zhang, Yu 18, 190 Zhang, Yuanyuan 462 Zhang, Yue 402 Zhang, Yunxing 314 Zhang, Zhigang 442 Zhang, Zhihong 262 Zhao, Chongchong 314 Zhao, Xiaonan 76 Zheng, Long 358 Zheng, Peng 289 Zheng, Ran 299 Zheng, Weimin 155 Zhou, Shuchang 144 Zhou, Wei 262 Zou, Xuecheng 442 Zou, Yongqiang 247

Network and Parallel Computing, IFIP, NPC 2007

Network and parallel computing: IFIP international conference, NPC 2004, Wuhan, China, October 18-20, 2004: proceedings

CONCUR 2010 - Concurrency Theory: 21th International Conference, CONCUR 2010, Paris, France, August 31-September 3, 2010, Proceedings (Lecture Notes ... Computer Science and General Issues)

Distributed Computing: 24th International Symposium, DISC 2010, Cambridge, MA, USA, September 13-15, 2010, Proceedings (Lecture Notes in Computer ... Computer Science and General Issues)

Advances in Grid and Pervasive Computing: 5th International Conference, CPC 2010, Hualien, Taiwan, May 10-13, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Sequences and Their Applications - SETA 2010: 6th International Conference, Paris, France, September 13-17, 2010. Proceedings (Lecture Notes in ... Computer Science and General Issues)

Computing and Combinatorics: 16th Annual International Conference, COCOON 2010, Nha Trang, Vietnam, July 19-21, 2010 Proceedings (Lecture Notes in ... Computer Science and General Issues)

Frontiers in Algorithmics: 4th International Workshop, FAW 2010, Wuhan, China, August 11-13, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Distributed Computing and Networking: 11th International Conference, ICDCN 2010, Kolkata, India, January 3-6, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Advances in Swarm Intelligence: First International Conference, ICSI 2010, Beijing, China, June 12-15, 2010, Proceedings, Part II (Lecture Notes in ... Computer Science and General Issues)

Network and Parallel Computing: IFIP International Conference, NPC 2005, Beijing, China, November 30 - December 3, 2005, Proceedings

Advances in Swarm Intelligence: First International Conference, ICSI 2010, Beijing, China, June 12-15, 2010, Proceedings, Part I (Lecture Notes in ... Computer Science and General Issues)

Fun with Algorithms: 5th International Conference, FUN 2010, Ischia, Italy, June 2-4, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Unconventional Computation: 9th International Conference, UC 2010, Tokyo, Japan, June 21-25, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Interactive Theorem Proving: First International Conference, ITP 2010 Edinburgh, UK, July 11-14, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Network and Parallel Computing - NPC 2011

Internet and Network Economics: 6th International Workshop, WINE 2010, Stanford, CA, USA, December 13-17, 2010, Proceedings (Lecture Notes in Computer ... Computer Science and General Issues)

Artificial Immune Systems: 9th International Conference, ICARIS 2010, Edinburgh, UK, July 26-29, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Architecture of Computing Systems - ARCS 2010: 23rd International Conference, Hannover, Germany, February 22-25, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Evolvable Systems: From Biology to Hardware: 9th International Conference, ICES 2010, York, UK, September 6-8, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Graph Transformations: 5th International Conference, ICGT 2010, Twente, The Netherlands, September 27--October 2, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Language and Automata Theory and Applications: 4th International Conference, LATA 2010, Trier, Germany, May 24-28, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)

Entertainment Computing - ICEC 2010: 9th International Conference, ICEC 2010, Seoul, Korea, September 8-11, 2010. Proceedings

Parallel Problem Solving from Nature, PPSN XI: 11th International Conference, Krakov, Poland, September 11-15, 2010, Proceedings, Part II (Lecture Notes ... Computer Science and General Issues)

Algorithms and Complexity: 7th International Conference, CIAC 2010, Rome, Italy, May 26-28, 2010, Proceedings (Lecture Notes in Computer Science)

Mathematics of Program Construction: 10th International Conference, MPC 2010, Québec City, Canada, June 21-23, 2010, Proceedings (Lecture Notes in Computer ... Computer Science and General Issues)

Algorithms - ESA 2010: 18th Annual European Symposium, Liverpool, UK, September 6-8, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)

Parallel Problem Solving from Nature, PPSN XI: 11th International Conference, Krakov, Poland, September 11-15, 2010, Proceedings, Part I (Lecture Notes ... Computer Science and General Issues)

Advances in Geometric Modeling and Processing: 6th International Conference, GMP 2010, Castro Urdiales, Spain, June 16-18, 2010, Proceedings (Lecture Notes ... Computer Science and General Issues)

Algorithms -- ESA 2010, Part II: 18th Annual European Symposium, Liverpool, UK, September 6-8, 2010, Proceedings (Lecture Notes in Computer Science Theoretical Computer Science and General Issues)