Network and Parallel Computing: IFIP International Conference, NPC 2010, Zhengzhou, China, September 13-15, 2010, Proceedings (Lecture Notes in ... Computer Science and General Issues)
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6289
Chen Ding Zhiyuan Shao Ran Zheng (Eds.)
Network and Parallel Computing IFIP International Conference, NPC 2010 Zhengzhou, China, September 13-15, 2010 Proceedings
13
Volume Editors Chen Ding University of Rochester P.O. Box 270226, Rochester, NY, 14627, USA E-mail: [email protected] Zhiyuan Shao School of Computer Science and Technology Huazhong University of Science and Technology Wuhan, 430074, China E-mail: [email protected] Ran Zheng School of Computer Science and Technology Huazhong University of Science and Technology Wuhan, 430074, China E-mail: [email protected]
The IFIP International Conference on Network and Parallel Computing is an international conference, aimed at providing an exciting platform and forum for researchers and developers from academia and industry to present their latest research in the field of parallel computing systems and applications. This year NPC received 89 submissions from authors in 11 countries. The papers were reviewed by a 51-member Program Committee, with 26 members from USA, 9 from mainland China, and the rest from Canada, Hong Kong, Taiwan, Korea, Japan, UK, and France. Each paper received three to six reviews. Based on a total of 287 reviews, the Program Co-chairs accepted papers into two categories: Select papers: 23 papers passed the most stringent selection. Each paper is up to 15 pages in the conference proceedings. The acceptance rate for select papers is 25.8%. Regular papers: 11 additional papers are of suffcient publishable quality. Each paper is up to 10 pages. For the main conference, we invited three distinguished speakers: – Xiaodong Zhang (Ohio State University, USA): Building a Domain-Knowledge Guided System Software Environment to Achieve High Performance of Multicore Processors – Huaimin Wang (National University of Defense Technology, China): InternetBased Virtual Computing Environment – Jesse Fang (Intel Labs, China) One workshop was held in conjunction with the NPC 2010 conference: the International Workshop on Network on Chip (IWNoC 2010). The workshop was chaired by Huaxi Gu and Jiang Xu. The proceedings of the workshop are included in this volume. We would like to thank all the authors of submitted papers for their work and their interest in the conference. We would like to express our sincere appreciation to all members of the Program Committee. Of the 296 reviews we assigned, 287 (97%) were completed. It was from these reviews that we identified a set of submissions that were clear, relevant, and described high-quality work in parallel systems and applications. Through these reviews hundreds of authors received objective and often detailed feedback from a diverse group of experts. In addition, we would like to thank the General Chairs Hai Jin and Jean-Luc Gaudiot and the Steering Committee members Kemal Ebcioglu and Guang Gao for their invaluable advice and guidance, as well as Alfred Hofmann and Ursula Barth of the LNCS editorial team for their prompt and patient response to our questions and requests. The conference proceedings would not have been possible without
VI
Preface
the support of these individuals and organizations. At closing, it is our hope that all of these efforts have helped to improve and promote parallel-computing research in China, other Asian countries, the USA and beyond. September 2010
Chen Ding Zhiyuan Shao Ran Zheng
Organization
Executive Committee General Co-chairs
Steering Committee Chair
Jean-Luc Gaudiot (University of California-Irvine, USA) Hai Jin (Huazhong University of Science and Technology, China)
Kemal Ebcioglu (Global Supercomputing Corporation, USA)
Program Co-chairs
Chen Ding (University of Rochester, USA) Zhiyuan Shao (Huazhong University of Science and Technology, China)
Publication Chair
Ran Zheng (Huazhong University of Science and Technology, China)
Publicity Chair
Wenbin Jiang (Huazhong University of Science and Technology, China)
Workshop Chair
Chen Yu (Huazhong University of Science and Technology, China)
Finance Chair
Na Zhao (Huazhong University of Science and Technology, China)
Registration Chair
Yingshu Liu (Huazhong University of Science and Technology, China)
Web Chair
Xuejiao Xie (Huazhong University of Science and Technology, China)
Program Committee Ishfaq Ahmad Luc Boug´e Sun Chan Wenguang Chen Arun Chauhan Randy Chow Yeh-Ching Chung
University of Texas at Arlington, USA IRISA/ENS Cachan Brittany, France Intel Corp. Tsinghua University, China Indiana University, USA University of Florida, USA National Tsing Hua University, Taiwan
VIII
Organization
Xiaobing Feng Bjoern Franke Yaoqing Gao Hwansoo Han Wei-Chung Hsu Weijia Jia Song Jiang Guohua Jin Francis C.M. Lau Kuan-Ching Li Xiaoming Li Xiao-Feng Li Zhiyuan Li Shih-wei Liao Shinming Liu Paul Lu Yingwei Luo Jun Ni Dimitrios Nikolopoulos Sven-Bodo Scholz Xipeng Shen Evgenia Smirni Jaspal Subhlok Makoto Takizawa Chunqiang Tang Xinmin Tian Clark Verbrugge Guojun Wang Zhenlin Wang Chengyong Wu Youfeng Wu Nong Xiao Chao-Tung Yang Laurence T. Yang Qing Yi Yijun Yu Xin Yuan Chao Zhang Weizhe Zhang Xiangyu Zhang Chengliang Zhang Yuan Zhao Weiming Zheng Yutao Zhong Xiaotong Zhuang
Institute of Computing Technology, China University of Edinburgh, UK IBM Toronto, Canada Sungkyunkwan University, Korea National Chiao Tung University, Taiwan City University of Hong Kong, Hong Kong Wayne State University, USA Rice University, USA Hong Kong Polytechnic University, Hong Kong Providence University, Taiwan University of Delaware, USA Intel Corp., China Purdue University, USA Google Corp. HP Corp. University of Alberta, Canada Peking University, China University of Iowa, USA Foudation for Research and Technology Hellas (FORTH), Greece University of Hertfordshire, UK The College of William and Mary, USA The College of William and Mary, USA University of Houston, USA Seikei University, Japan IBM T.J. Watson Research Center, USA Intel, USA McGill University, Canada Florida Atlantic University, USA Michigan Technological University, USA Institute of Computing Technology, China Intel Corp. National University of Defense Technology, China Tunghai University, Taiwan St. Francis Xavier University, Canada University of Texas at San Antonio, USA Open University, UK Florida State University, USA Intel China Research Center, China Harbin Institute of Technology, China Purdue University, USA Microsoft Corp. IBM T.J. Watson Research Center, USA Tsinghua University, China George Mason University, USA IBM T.J. Watson Research Center, USA
Table of Contents
Keynote Speech Building a Domain-Knowledge Guided System Software Environment to Achieve High-Performance of Multi-core Processors . . . . . . . . . . . . . . . . Xiaodong Zhang Internet-Based Virtual Computing Environment . . . . . . . . . . . . . . . . . . . . . Huaimin Wang
The Core Degree Based Tag Reduction on Chip Multiprocessor to Balance Energy Saving and Performance Overhead . . . . . . . . . . . . . . . . . . . Long Zheng, Mianxiong Dong, Hai Jin, Minyi Guo, Song Guo, and Xuping Tu
358
XII
Table of Contents
Session 6: Cloud and Grid Infrastructure Improve Throughput of Storage Cluster Interconnected with a TCP/IP Network Using Intelligent Server Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . Xuechen Zhang, Guiquan Liu, and Song Jiang Evaluate the Performance and Scalability of Image Deployment in Virtual Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kejiang Ye, Xiaohong Jiang, Qinming He, Xing Li, and Jianhai Chen A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Zhang and Yunxia Pei JAMILA: A Usable Batch Job Management System to Coordinate Heterogeneous Clusters and Diverse Applications over Grid or Cloud Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan Peng, Xiaoyi Lu, Boqun Cheng, and Li Zha User-Centric Privacy Preservation in Data-Sharing Applications . . . . . . . Feng Gao, Jingsha He, and Shufen Peng Software Metrics Reduction for Fault-Proneness Prediction of Software Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunfeng Luo, Kerong Ben, and Lei Mi
Building a Domain-Knowledge Guided System Software Environment to Achieve High-Performance of Multi-core Processors Xiaodong Zhang The Ohio State University, USA
Although multi-core processors have become dominant computing units in basic system platforms from laptops to supercomputers, software development for effectively running various multi-threaded applications on multi-cores has not made much progress, and effective solutions are still limited to high performance applications relying on exiting parallel computing technology. In practice, majority multi-threaded applications are highly concurrent programs demanding high throughput, such as concurrent database transactions, massive and independent query requests in Web servers and search engines, and executing many-tasks for scientific applications in a multiprogramming mode. To best utilize the increasingly rich computing and cache resources in multi-core processors (many-cores in the near future) we must address several serious and difficult challenges. First, there are several critical hardware resources for multi-threads to share, such as the last level caches and the memory buses. However, the shared resource management is Largely controlled by hardware. Second, OS scheduler has little knowledge about applications' data demanding and access behavior, making sub-optimal task assignment decisions. Finally, the space allocation for each thread in the shared-cache is demand-based, often causing access conflicts and pollution, significantly degrading overall execution performance. We have developed a runtime environment connecting multi-threaded applications and system software in a collaborative way, where operating system is guided by application domain knowledge including data access locality and execution behavior to schedule tasks and allocate shared hardware resources for each running thread. We evaluate our environment by concurrent database transactions and multi-threaded scientific computing programs, and show strong performance and Throughput improvement by minimizing cache conflicts and misses in the last level caches of multi-cores. We further develop our system as a general framework to automatically manage multithreaded applications on multi-core processors.
Internet-Based Virtual Computing Environment Huaimin Wang National Lab. For Parallel and Distributed Processing National University of Defense Technology Changsha, Hunan, P.R. China
Different from the traditional resources, resources over the Internet have such natural characteristics as growth, autonomy and diversity, which have brought grand challenges to their efficient sharing and comprehensive utilization. As there are the essential differences between the Internet computing environment and the traditional computer environment, the resource management methods for traditional computers is no longer suitable for the internet any more. The emergence and development of Web 2.0 and P2P technologies in recent years manifests that the resource management over Internet should adapt to the natural characteristics of the Internet resources and virtualization is one of the potential means to deal with those challenges. However, virtualization in Internet computing environment is essentially different from that in traditional computer environment. There are three scientific issues to be explored, i.e., on-demand aggregation in the open Internet environment, autonomic collaboration among distributed autonomous resources, computational properties of the aggregation and collaboration. We propose the Internet-based Virtual Computing Environment (iVCE), which aims at providing harmonious, trustworthy and transparent computing environment and services for end-users and applications by the means of novel computing models and innovative mechanisms, such as the on-demand resource aggregation and autonomic resource collaboration. The iVCE tries to virtualize and autonomicize Internet resources, design new models and mechanisms adapting well to the Internet computing environment, and promote the sharing and utilization of the Internet resources. Many progresses have been achieved in the model, technical framework and mechanisms of iVCE in the past several years. The concept model of iVCE includes autonomic element (AE), virtual commonwealth (VC) and virtual executor (VE). The architecture of iVCE is designed based on these models. Many mechanisms have been proposed in such areas as resource virtualization, on-demand aggregation, autonomic collaboration, trustworthy enhancement and programming language facility. The iVCE middleware has been designed and implemented, providing a set of services to support the flexible sharing and collaboration of Internet resources. The iVCE middleware has also provided some tools, such as virtual network storage, virtual network memory and virtual cluster for users. Many applications have been run on the iVCE middleware. iVCE will support the multi-scale sharing and collaboration of resources in various granularities, including the large-scale data center, client resources and various resources on the edge of Internet. The productivity and trustworthiness of iVCE will be focused on to investigate in the future. The integration of multiple virtual commonwealths is also part of our future work.
Vectorization for Java Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li China Runtime Technologies Lab, Intel China Research Center {jiu-tao.nie,bu.qi.cheng,shisheng.li, ligang.wang,xiao.feng.li}@intel.com
Abstract. Java is one of the most popular programming languages in today’s software development, but the adoption of Java in some areas like high performance computing, gaming, and media processing is not as universal as in general-purpose computing. A major drawback preventing it from being extensively adopted in those areas is its lower performance than the traditional or domain-specific languages. This paper describes two approaches to improve Java’s usability in those areas by introducing vector processing capability to Java. The first approach is to provide a Java vectorization interface (JVI) that developers can program with, to explicitly expose the programs’ data parallelism. The other approach is to use automatic vectorization to generate vector instructions for Java programs. It does not require programmers to modify the original source code. We evaluate the two vectorization approaches with SPECjvm2008 benchmark. The performances of scimark.fft and scimark.lu are improved up to 55% and 107% respectively when running in single thread. We also investigate some factors that impact the vectorization effects, including the memory bus bandwidth and the superscalar micro-architecture. Keywords: Java, vectorization, dependence graph, memory bus bandwidth, superscalar micro-architecture.
inserting inline assembly, calling vector intrinsics, writing programs with vector API, etc. However, none of those approaches is available to Java programmers. To bring the benefit of vector computation capability into Java world, we develop two complementary approaches: a library-based programming approach and a compilation-based automatic approach. In this paper, we describe the two approaches and discuss the issues we meet with Java vectorization. The main contributions of this paper include: 1.
2. 3.
4.
We define a generic set of Java vectorization interface (JVI) with Java class library and implement JVI support in a Java virtual machine. JVI covers the vector operation spectrum of common IA32 and EM64T microprocessors. We implement the automatic vectorization in a Java virtual machine that tries to vectorize Java applications automatically. We evaluate the two vectorization approaches with SPECjvm2008 benchmark, and the performance of scimark.fft and scimark.lu is doubled (up to 55% and 107% respectively) when running in single thread. We investigate the factors that impact the vectorization effect. Our study shows that memory bus bandwidth can be a limiting factor for vectorization to scale up on multicore platforms. Superscalar micro-architecture can also hide the benefit of vectorization.
The rest of the paper is organized as follows. Section 2 discusses related work in program vectorization. Section 3 and Section 4 are the main body of this paper describing our Java vectorization work based on JVI and on automatic vectorization respectively. Section 5 gives the experimental results and discusses the issues. We conclude our work in Section 6.
2 Related Work The latest published vectorization work for Java is an SLP [2] automatic vectorizer implemented in Jikes RVM [3]. It uses a modified tree-pattern matching algorithm to identify similar successive instructions and turn them into equivalent vector instructions. It relies on loop-unrolling to vectorize computations of different iterations as other pure SLP algorithms do. The implementation is in the initial stage. It does not generate real single instruction multiple data (SIMD) instructions. Instead, it uses 32bit integer instructions to simulate the simple vector computations of 16-bit short and 8-bit byte types. In recent years, lots of work on automatic vectorization has been devoted into the GCC compiler [4, 5]. Java users can benefit from this work by compiling Java programs into native code with Gnu Compiler for Java (GCJ), which uses GCC's middle and back end to compile both Java source code and Java bytecode into native code. The vectorizer implemented in GCC supports simple loop-based automatic vectorization [6] and interleaved memory accesses in loop [7, 8]. A limitation of the current GCC’s vectorizer is that the memory accessing strides must be constants whose values are powers of 2. Our automatic vectorization algorithm does not have this limitation. It unrolls the part of the loop that cannot be vectorized. If the unrolled instructions access consecutive addresses, a complementing SLP vectorizer can
Vectorization for Java
5
further group them into vector instructions. A later important improvement on GCC's vectorizer is to integrate SLP vectorization into the previous loop-based vectorizer, which results in a loop-aware SLP vectorizer [9]. The improved one is more flexible to handle various memory accessing patterns. The Intel C++ compiler (ICC) provides four levels of vectorization supports: inline assembly, intrinsics, C++ vector library and automatic vectorization. The C++ vector library provides vector classes with overloaded operators. Operations on those types are translated into vector IR by the compiler's front end. Our Java vector class library does the same thing except that no overloaded operators are provided since Java does not support operator overloading. Novell implements a vector class library in its .Net framework, called Mono.Simd, to provide the vector API supports for C# programming. The APIs are mapped directly to the hardware vector operations. Due to the directly mapping, Mono.Simd might be tightly coupled with certain hardware versions.
3 Java Vectorization Interface (JVI) Both the JVI based vectorization and the automatic vectorization need the support of a just-in-time (JIT) compiler. We implement the two vectorization approaches in Jitrino, the optimizing JIT compiler of Apache Harmony. Jitrino has two levels of intermediate representation (IR), HIR and LIR. It supports most modern facilities for optimization, such as the static single assignment (SSA) form, the control flow graph and the loop tree. It also contains many commonly used compiler optimizations. The infrastructure of Jitrino with our vectorization work is shown in Figure 1. The flow on the right side illustrates the process from the Java source code to the native machine code. Boxes represent transformations, and ellipses represent programs in different forms. The JIT compiler, i.e. Jitrino is encircled by the dotted box, in which, the dashed boxes B, C and F are transformations modified for vectorization, and the other two dashed boxes D and E are modules added for vectorization and depended on by various transformations. The upper-left dashed box A is the JVI class library that can be used by programmers in their Java source code. In the figure, A and B are for JVI based vectorization, and C is for automatic vectorization. D, E and F provide support for both vectorization approaches. This section introduces the design and implementation of the JVI based vectorization. 3.1 JVI Design JVI is designed to be an abstract vector interface independent to concrete vector instruction sets. Programs written with it can be compiled to use different vector instructions, such as SSEx and AVX of Intel processors. JVI comprises a set of classes representing vector types of various primitive Java types, including 8, 16, 32 and 64 bits signed integers, as well as 32-bit and 64-bit floating point numbers. They are packaged into the name space com.intel.jvi. Currently, the default size of a vector is 128-bit.
6
J. Nie et al.
JVI class library A
Java source code Programmers write vectorized program with JVI
Java compiler Java bytecode
JIT compiler (Jitrino) with vectorization support
B
High level IR (HIR)
D Definition of vector type and operators for HIR
E Description of instructions for LIR
SIMD
Translate bytecode to HIR
HIR optimization passes C
Automatic vectorization
F
Translate HIR to LIR Low level IR(LIR) LIR optimization Native code encoder
Native code Uses or depends on Modules and transformations related to vectorization
Fig. 1. Infrastructure of vectorization implementation in Jitrino
Each vector class exposes a set of methods acting as vector operators. These “operators” cover most hardware supported vector operations, including initialization from scalar(s), arithmetic and logical computation, bit shifting, comparison, conditional selection, memory accessing and element shuffling. They provide enough flexibility for programmers to write efficient vectorized code. The following is a piece of the source code a JVI class representing the vector type of double. public class F64vec2 { public static F64vec2 make(double d0, double d1) { return fail (); } public static F64vec2 load(double[] a, int i) { return fail (); } public F64vec2 add(F64vec2 B) { return fail (); } public F64vec2 sub(F64vec2 B) { return fail (); } public F64vec2 mul(F64vec2 B) { return fail (); } …… …… }
Vectorization for Java
7
The exposed vector methods are only used to tell the front end of the JIT compiler (part B) how to translate them into the vector IR defined in part D. They should never be actually called, so their bodies only throw an exception (with fail()) to indicate that. These methods are designed to be pure functions, i.e. they never rely on or change the state of “this” object. This design forces programmers to treat vector types as primitive types, since internally, these vector types indeed behave as primitive types rather than classes. They can reside in both memory and registers, and can be passed as values in function calls. This is important to avoid dangerous misuse of the JVI library. At the same time, it gives the compiler more freedom to optimize. 3.2 JVI Implementation JVI interface is defined in class library. Programs using JVI must be translated into machine code by JVM. We extend the HIR of Jitrino by defining vector types and operators (see part D) to represent the JVI interface internally. The existing high level optimizations in Jitrino can be applied to the IR directly. Part B in Figure 1 is the front end of Jitrino. It translates Java bytecode into Jitrino HIR. JVI class references and method callings are correspondingly translated into vector types and vector instructions of the extended HIR in this stage. We will use the following code fragment from scimark.lu of SPECjvm2008 to illustrate the translation process. This hot loop consumes more than 90% execution time of the benchmark. for (int jj = j + 1; jj < N; jj++) Aii[jj] -= AiiJ * Aj[jj]; In the code, Aii and Aj are two arrays of double type values, and AiiJ is a double type variable. The manually vectorized version of the loop is as follows: 1 F64vec2 v_aiij = F64vec2.make (AiiJ); 2 for (int jj = j + 1; jj < N - 1; jj += 2) { 3 F64vec2 v_t1 = F64vec2.load (Aj, jj); 4 F64vec2 v_t2 = F64vec2.load (Aii, jj); 5 v_t2.sub (v_aiij.mul (v_t1)).store (Aii, jj); 6 } In the above code, F64vec2.make(AiiJ) creates a vector containing two double type values of AiiJ. F64vec2.load loads two consecutive double type values starting at the given index from the given array. v_t2.sub() and v_aiij.mul() return the results of vector subtraction and vector multiplication correspondingly between the “this” objects and their arguments. XXX.store(Aii, jj) stores two elements of XXX into Aii[jj] and Aii[jj+1]. The bytecode of this code fragment contains the vector type com.intel.jvi.F64vec and six calls of its methods. After translated by part B (in Figure 1), the six method calls become the following HIR instructions, where double<2> is the vector type of double. It is the vector IR defined for com.intel.jvi.F64vec2: I247: I262: I270: I271: I272: I274:
4 Automatic Vectorization Part C in Figure 1 is the automatic vectorization pass we implement in Jitrino. Since most opportunities for data parallelism occur between iterations of loops, traditional vectorization techniques mainly focus on exploiting loop level data parallelism. This kind of techniques is referred to as loop-based vectorization. Another kind of vectorization is called SLP (Superword Level Parallelism) vectorization [9]. It identifies groups of isomorphic instructions exposing superword level parallelism, and combines them into equivalent vector instructions. The loop-based vectorization exploits data parallelism among different executions of the same instruction, while the SLP vectorization exploits data parallelism among different instructions in the straight-line code (usually in the same basic block), so the SLP vectorization can be a complement to the loop-based vectorization. With loop-unrolling, loop level data parallelism can be transformed into superword level parallelism, so the SLP vectorization can also be used to exploit loop level data parallelism with the help of loop unrolling. However, duplicating instructions that can be parallelized multiple times and then re-recognizing them to be isomorphic from all duplicated instructions and finally combining them back into one vector instruction is not as efficient as the loop-based vectorization. The loop-based vectorization only transforms the loop once. To achieve the same effect as the loop-based vectorization, SLP vectorization also needs induction variable analysis and data dependence analysis (it may not do them, but that will cause missing vectorization opportunities). In fact, SLP vectorization is more complex and inefficient than loop-based vectorization for vectoring loops. As a result, we implement the loop-based vectorization in Jitrino and treat SLP vectorization as a complement that may be implemented in the future. Our loop-based vectorization is composed of two parts: vectorization analysis, and vectorization transformation. The first part analyzes and collects necessary information for all loops, and the second part performs the transformation. 4.1 Vectorization Analysis The vectorization analysis is applied to the leaf nodes in the loop tree. Only countable leaf loops with single entry and single exit are taken as vectorization candidates. The countability of a loop is determined by its exit condition. The exit condition of a countable loop must be a comparison between a loop invariant value and an induction variable. Whether a variable is loop invariant and how a variable changes is analyzed by the scalar evolution analysis. This analysis is called on demand for requested single variable, and is also called by data dependence analysis for analyzing array indices. The data dependence analysis builds data dependence graph (DDG) among all instructions of a loop. The dependences due to explicit variable references can be easily retrieved from the use-define chains incorporated in the SSA form IR. To determine dependences due to accessing aliased array elements, we first use the simplest GCD (greatest common division) test to filter out most non-aliased cases quickly. Then, according to the number of index variables in the array accessing expressions, we call the ZIV (zero index variable) test or SIV (single index variable) test [13]. For the case of multiple index variables (MIV), we simply assume there is a dependence since this case is relatively rare but the testing cost is quite high.
Vectorization for Java
9
Different from the algorithms described in [4, 7, 8], our vectorization algorithm tries to vectorize the loop even when there exists cyclic dependence in the loop, or some of the instructions can not be vectorized. This is a novel feature of our vectorization algorithm. As we know, other algorithms just give up the vectorization when there is a dependence circle. With a strongly connected component (SCC) analysis on the DDG, DDG nodes (instructions) can be grouped into different SCCs. Instructions of trivial SCCs (that only contain a single instruction) that can be supported by hardware vector instructions are considered as candidates for vectorization. All other instructions, including those of non-trivial SCCs and those that are not supported by hardware are considered as candidates for loop unrolling. The candidates for loop unrolling and those for vectorization are not split into two separate loops. Two separate loops require additional temporary array and memory accessesing instructions to pass data between them, which may greately degrade the performance. Our algorithm uses pack and extract operations to transfer data among unrolled scalar instructions and the vectorized vector instructions through registers, which is far more efficient than through memory. The pack operation packs a set of scalar values into a vector, which is implemented by a series of SSE instructions. The extract operation extracts a specific scalar element from a vector, which is implemented by a single SSE instruction. To increase vectorization opportunities, our algorithm tries to break dependence circles through dynamic alias testing. The algorithm finds DDG edges between two array accessing instructions in non-trivial SCCs, and generates alias testing code into the pre-header of the loop for all the pairs of arrays. Then, it removes all such DDG edges. As a result, some dependence circles may be broken. If there is no any alias, the control flow is directed into the vectorized code path. Otherwise, it is directed into the original loop. Each instruction selected for vectorization is assigned a vectorization factor (VF). The vectorization factor of an instruction is a number denoting how many scalar operations of that instruction can be parallelized in one corresponding vector operation. For example, suppose the size of vector types are all 128 bits, then the VF of a 32-bit integer instruction is four and the VF of a 64-bit floating point instruction is 2. The vectorization factor of a loop is a number denoting how many iterations of the original loop can be executed in one iteration of the vectorized loop. Instructions in a loop may have different vectorization factors. We choose the largest one as the VF for the loop to maximize the data parallelization. An approximate cost model for evaluating the profit of vectorization is given as follows: C1 ( I )
cost(I ) vf c
A( I ) U
cost ( I V ) (vf c / vf ( I ))
A(I ) V
0 A(def(o)) = A(I) ⎧ ⎪ C2 ( I ) = ∑o∈opnd ( I ) ⎨ cost(vec_pack) A(def(o)) = U ∧ A(I) = V ⎪cost(vec_extract) × vf ( I ) A(def(o)) = U ∧ A(I) = U ⎩ P = ∑ I∈L (cost ( I ) × vf c ) − ∑ I∈L (C1 ( I ) + C 2 ( I ))
(1)
(2)
(3)
10
J. Nie et al.
Notations: - C1(I): instruction cost of I after transformation; - C2(I): operand cost of I after transformation. - A(I): transformation action to instruction I. U: to unroll; V: to vectorize. - vfc: common VF of the loop; - vf(I): VF of instruction I. - IV : vector version of instruction I. - def(o): defining instruction of variable o. - opnd(I): operand set of instruction I. - cost(I): cost of instruction I. Equation (1) estimates the execution cost of instruction I after the transformation. Equation (2) estimates the operand retrieval cost of I after the transformation. The vec_pack is a pseudo operator representing a set of SSE instructions for packing scalar values into a vector operand. The vec_extract is another pseudo operator representing the SSE instruction for extracting a scalar operand from a vector value. Equation (3) calculates the performance profit of the action set given by A. For instructions that cannot be vectorized, we have no choice other than to unroll them. For other instructions that can be vectorized, we have two choices: to vectorize them or to unroll them. Different determinations of transforming actions for these instructions may lead to different profit. We use a simple local policy to try to maximize the profit. Our algorithm first finds all instructions that can be vectorized, marking them as "to-be-vectorized" and marking all others as"to-be-unrolled". Then, it goes through all instructions marked as "to-be-vectorized" and checks for each one of them whether changing the action for the instruction to "to-be-unrolled" will bring more profit (from eliminating operand retrieval cost). If yes, the algorithm changes the action of that instruction. This greedy policy may not generate the optimal result, but it is good enough in practice considering its simplicity and efficiency. 4.2 Vectorization Transformation After all SCCs of the DDG have been assigned actions, the transformation algorithm traverses all SCCs in the dependence order, in which depended SCCs appear before depending SCCs. For each of the SCCs, the algorithm emits either a vector instruction or a set of scalar instructions according to the action of that SCC. The key structure for connecting vectorized instructions and unrolled instructions is a map from variables of the original loop to arrays of variables of the transformed loop. For each original variable, the mapped-to array stores its copies in the transformed loop, corresponding to a sequence of iterations. Specifically, in our algorithm, the arrays have VF + 1 elements; the first VF elements of the array stores scalar copies of the original variable corresponding to VF iterations, and the last element of the array stores the vector copy of the original variable. Both scalar and vector copies are created on demand. The transformation algorithm is as follows. The function unroll emits unrolled scalar instructions to the loop body for instructions in a given SCC. The function vectorize emits a vector instruction to the loop body for the single instruction of the given SCC. The function map returns the variable corresponding to
Vectorization for Java
11
the requested iteration. At the first time a variable is requested, it creates the variable and generates extract or pack instructions to initialize that variable. transform () { for each SCC, say s, of the DDG in the dependence order switch (action of s) { case to_unroll: unroll (s); break; case to_vectorize: vectorize (s); break; } } unroll (s) { for (i = 0; i < VF; i++) for each instruction "'x' '=' 'y' 'op' 'z'" in s { create a new scalar variable, say 't'_i; emit "'t'_i '=' map('y',i) 'op' map('z',i)"; MAP('x')[i] = 't'_i; } } vectorize (s) { let "'x' = 'y' 'op' 'z'" be the single instruction of s; create a new vector variable, say 'v'; emit "'v' '=' map('y',VF) 'op' map('z',VF)"; MAP('v')[VF] = 'v'; } map (var, i) { if (MAP(var)[i] == NULL) { if (i < VF) { for (j = 0; j < VF; j++) { create a new scalar variable, say 't'_j; emit "'t'_j '=' 'vec_extract' MAP(var)[VF], j"; MAP(var)[j] = 't'_j; } } else { // i == VF create a new vector variable, say 'v'; emit "'v' '=' 'vec_pack' MAP(var)[0], MAP(var)[1], ..., MAP(var)[VF-1])"; MAP(var)[i] = 'v'; } } return MAP(var)[i]; } For example, in the following code, the arrays and the variable C are of floating point type. Reading from a[i-1] at line #5 and writing to a[i] at line #7 plus the addition at line #6 create a dependence circle with distance 1 and hence cannot be vectorized. Other instructions do not belong in any dependence circles and can be vectorized. The comments in the code indicate which SCC an instruction belongs in.
12
J. Nie et al.
1 for (int i = 1; i < N; i++) { 2 t1 = b[i]; // SCC 0 3 t2 = c[i]; // SCC 1 4 t3 = t1 * t2; // SCC 2 5 t4 = a[i - 1]; // SCC 3 6 t5 = t4 + t3; // SCC 3 7 a[i] = t5; // SCC 3 8 t6 = t5 * C; // SCC 4 9 d[i] = t6; // SCC 5 10 } The following is the transformed code commented with changes of the variable map, i.e. the MAP in the above algorithm. The SCCs are processed just in the order of SCC numbers given in the comments of the above code. Lines #2 through #4 are instructions generated for SCCs #0 through #2. Lines #5 through #20 are for SCC #3. Lines #21 through #23 are for SCC #4 and line #24 is for SCC #5. 1 for (int i = 1; i < N; i++) { 2 v1 = b[i:i+3]; // t1 -> [0, 0, 0, 0, v1] 3 v2 = c[i:i+3]; // t2 -> [0, 0, 0, 0, v2] 4 v3 = v1 * v2; // t3 -> [0, 0, 0, 0, v3] 5 s1 = a[i - 1]; // t4 -> [s1, 0, 0, 0, 0] 6 s2 = vec_extract(v3, 0); // t3 -> [s2, 0, 0, 0, v3] 7 s3 = s1 + s2; // t5 -> [s3, 0, 0, 0, 0] 8 a[i] = s3; 9 s4 = a[i]; // t4 -> [s1, s4, 0, 0, 0] 10 s5 = vec_extract(v3, 1); // t3 -> [s2, s5, 0, 0, v3] 11 s6 = s4 + s5; // t5 -> [s3, s6, 0, 0, 0] 12 a[i + 1] = s6; 13 s7 = a[i + 1]; // t4 -> [s1, s4, s7, 0, 0] 14 s8 = vec_extract(v3, 2); // t3 -> [s2, s5, s8, 0, v3] 15 s9 = s7 + s8; // t5 -> [s3, s6, s9, 0, 0] 16 a[i + 2] = s9; 17 s10 = a[i + 2]; // t4 -> [s1, s4, s7, s10, 0] 18 s11 = vec_extract(v3, 3); // t3->[s2, s5, s8, s11, v3] 19 s12 = s10 + s11; // t5 -> [s3, s6, s9, s12, 0] 20 a[i + 3] = s12; 21 v4 = vec_pack(s3, s6, s9, s12);//t5->[s3,s6,s9,s12,v4] 22 v5 = vec_duplicate(C); // C -> [0, 0, 0, 0, v5] 23 v6 = v4 * v5; // t6 -> [0, 0, 0, 0, v6] 24 d[i:i+3] = v6; 25 }
5 Experiments and Discussions We use SPECjvm2008 to evaluate the impact of our vectorization on performance on two representative multi-core platforms:
Vectorization for Java
1. 2.
13
Core i7 965 3.2GHz with 8MB L3, 4.80 GT/s QPI and 3GB DDR3 memory, representing desktop machines. Dual Xeon 5560 2.8 GHz with 8MB L3, 6.4 GT/s QPI and 12G DDR3 memory, representing server machines.
Processors of the machines are both based on the Nehalem micro-architecture with a powerful out-of-order engine and supporting 32-bit and 64-bit scalar instruction set and up to SSE4.2 instruction set. We used 32-bit Linux as the operation system (OS) on both platforms. The performance measurement includes the JIT compiling time, but the compiling time only occupies a very little portion of the whole running time of the benchmarks. In fact, the additional compiling time of the automatic vectorization comparing to JVI only comes from the automatic vectorization pass, which only applies to few loops satisfying many restrictions, and hence has very little impact on the benchmark scores. Therefore, JVI and automatic vectorization achieve similar performance and we do not distinguish them in the following discussions. 5.1 Performance Improvement by Vectorization During the execution of SPECjvm2008, there are totally 44 loops of 37 methods within benchmarks or libraries being vectorized. Though many of them only have a very low coverage and hence do not significantly contribute to the performance gain. These numbers at least indicate that there are considerably vectorization opportunities in general Java programs and libraries. In the performance evaluation, due to their high coverage of the hottest loop, scimark.lu (LU) and scimark.fft (FFT), two sub-benchmarks of SPECjvm2008, are selected as the main workloads for our vectorization evaluation. Both of these benchmarks have three kinds of inputs. They are small, default and large input sets, corresponding to 512KB, 8MB, 32MB input data set per thread. According to the data size requirement, we configure Harmony VM with maximal 600M heap to eliminate the unnecessary overhead from garbage collection. Figure 2 shows the program level performance improvements of LU and FFT. Multi-threading data are provided to check the vectorization effect in multi-core systems. From the figure we can see that vectorization can get obvious performance gain in almost all scenarios. With single thread, FFT and LU get 55% and 107% performance improvements on the i7 platform, and 45% and 100% on the Xeon platform. On the other hand, we can see that with the increment of the number of threads, the performance gain is reduced. Especially, for the default and large input data set, the performance gain degradation on the i7 platform is much faster than on the Xeon platform. Figure 3 shows the scalability of LU benchmark. From the data, we can find that, by vectorization, the scalability becomes worse on both platforms. Based on the fact that the size of the default and large data set of LU and FFT is larger than the whole L3 cache size, we deduce that the limit of memory bus bandwidth is the main reason that restrains the performance gain of vectorization. The higher memory bus bandwidth on the Xeon platform remedies the problem to some degree, and hence causes better scalability comparing with the i7 platform.
14
J. Nie et al. Small
Default
Small
Large
Improvement
Improvement
50.00% 40.00% 30.00% 20.00% 10.00%
Large
40.00% 30.00% 20.00% 10.00% 0.00%
0.00% 1
2
3
1
4
2
3
(a) FFT on Core i7 965 Small
Default
4
5
6
7
8
Threads number
Threads number
(b) FFT on Xeon 5560
Large
Small
120.00%
120.00%
100.00%
100.00%
Improvement
Improvement
Default
50.00%
60.00%
80.00% 60.00% 40.00% 20.00% 0.00%
Default
Large
80.00% 60.00% 40.00% 20.00% 0.00%
1
-20.00%
2
3
4
1
2
3
Threads number
4
5
6
7
8
Threads number
(c) LU on Core i7 965
(d) LU on Xeon 5560
Fig. 2. The performance improvement of FFT and LU by vectorization Orginal + Default Orginal + Large
Vectorized + Default Vecorized + Large
2.5
6
2
5
Speedup .
Speedup
Original + Default Original + Large
1.5 1 0.5
Vectoried + Default Vectoried + Large
4 3 2 1
0 1
2
3
4
0 1
2
3
Threads number
4
5
6
7
8
Threads number
(a) LU on Core i7 965
(b) LU on Xeon 5560
Fig. 3. The scalability of LU
To verify the bandwidth effect, we use the NUMA feature of the Xeon 5560 platform. On NUMA enabled platforms, accessing the local memory of the host processor is much faster than accessing the remote memory. Cores accessing local memory do not consume the bus bandwidth of remote memory. The NUMA feature can not be exploited with current Harmony VM because the VM allocates and binds whole memory heap to single processor in the main thread, although all benchmark threads run in single VM (Single VM mode). To illustrate the impact of NUMA feature on bandwidth with Harmony VM, we simulate the local memory access by running one benchmark thread in one VM instance, and binding the VM with one core. Multiple VM instances (Multiple VM mode) are executed at the same time to simulator multiple benchmark threads.
Fig. 4. LU performance improvements and scalabilities in different scenarios
In the experiment, we run both original sequential and vectorized codes in single VM and multiple VMs mode, totally four scenarios are executed. In both execution modes, all the benchmark threads are evenly mapped to different processors, and large data input set is used to make sure that the effect of cache misses is negligible. Figure 4 shows the data of benchmark LU from Xeon 5560 platform. The data of 4~8 threads are showed to illustrate the scalability problem. Figure 4 (a) shows the performance improvements of benchmark LU with the aids of NUMA and vectorization. In the figure, the sequential execution in single VM mode is used as the base line of the performance. From the data we can find that NUMA can dramatically improve the performance of vectorized programs. Figure 4 (b) compares the scalabilities of the LU benchmark running in four scenarios. From the data we can find that the scalability problem can be largely solved with NUMA architecture. 5.2 Limitation of the Partial Vectorization The partial vectorization algorithm described in the last section is powerful in the sense of exploiting data parallelism. However, whether it can improve the performance of a program depends on two factors: how much the inherentparallelism exists in the program, and how well the micro-architecture automatically exploits the parallelism without the vectorization. For example, most modern processors have superscalar micro-architecture that can exploit general instruction-level parallelism. If the inherent data parallelism of a program has been sufficiently exploited by the superscalar engine of the processor, performing vectorization for that program cannot bring any more performance gain. We observe this phenomenon with the benchmark scimark.sor, whose hot spot is a loop covering more than 90% execution time. The pseudo-code of the loop is given below: for (j = 1; j < N; j++) G[j]= A * (Gm[j] + Gp[j] + G[j-1] + G[j+1]) + B * G[j]; In this loop, all arrays and variables except j and N are of double type. With our partial vectorization algorithm, the first addition and the last multiplication at the second line are vectorized, and others are unrolled due to the dependence circle. One double addition, one double multiplication and three double loads are saved per two iterations. However, the performance of the vectorized program does not improve compared
16
J. Nie et al.
to the non-vetorized version. The reason is that the inherent parallelism of the loop has been sufficiently exploited by the powerful superscalar engine of Nehalem. Figure 5.(a) is the data dependence graph of the loop. The least execution time of this loop is determined by the execution time of the circle of the dependence graph in spite of how strong the parallelization ability of a processor is. Suppose double type addition, multiplication, load and store requires 3, 5, 4 and 1 processor cycles respectively, then the least execution cycles of the loop is (3 × 3 + 5 + 4 + 1) × N = 19N . During the 19 cycles of each iteration, the Nehalem micro-architecture with six issue ports, three pipelined execution units, one load unit and one store unit can easily arrange the other loads and computations to be executed concurrently. Thus, the execution speed of this loop has reached its toplimit.
19
+ * A + +
B
Gm[j] Gp[j]
(a)
* G[j]
A
+
G[j-1]
+ G[j]
*
* A
*
Gm[j]
A
G[j-1]
G[j]
+ B +
G[j+1]
Gm[j] Gp[j]
13
+
* + B
+
G[j+1]
G[j-1]
16
+
* +
G[j]
G[j]
G[j]
G[j+1]
Gp[j]
(b)
(c)
Fig. 5. Data dependence graph of the kernel loop of scimark.sor
In this example, expression re-association can help to expose more parallelism. When floating point precision is not required to be strict, the computation order can be changed to shorten the length of the dependence circle in the loop. The dependence graphs of the following two versions of the loop are shown in Figure 5 (b) and (c). for (j G[j] for (j G[j]
= = = =
1; j < N; j++) A * (Gm[j] + Gp[j] + G[j+1] + G[j-1]) + B * G[j]; 1; j < N; j++) A * (Gm[j] + Gp[j] + G[j+1]) + B * G[j] + A * G[j-1];
With the reorder of the computation, the numbers of the approximate cycles required by the dependence circles of version (b) and version (c) are 16 and 13 respectively, and their (non-vectorized) execution speeds are improved by about 30% from (a) to (b) and then 40% from (b) to (c). This means that, the superscalar is still powerful enough for exploiting the increased data parallelism even when the dependence circle is minimized, and there is no further optimization space left for the vectorization technique.
Vectorization for Java
17
6 Conclusion and Future Work In this paper, we introduce our Java vectorization work that uses two ways to exploit the data parallelism of Java applications. They can achieve similar vectorization results. Our work shows up to 55% and 107% performance gain for scimark.fft and scimark.lu of SPECjvm2008 when running in one thread. Performance gain was also observed with eight threads. In the manual vectorization, we define, design and implement a unified and generic set of Java Vectorization Interface so that Java programmers can use the interface for Java programming without considering the specific hardware supports. In the automatic vectorization, we propose a novel algorithm which provides aggressive vectorization supports to maximize the vectorization benefit. We analyze the result of our work, and our investigation shows that several factors such as memory bus bandwidth, superscalar micro-architecture and code shape need to be considered when applying vectorization techniques. Currently we are porting this work to JavaScript engine. It is challenging because JavaScript language is dynamically typed. Lots of branches inside code make regular data parallelism hard to be exploited. We are developing type analysis technique to overcome the problem.
References 1. The Apache Software Foundation. Apache Harmony, http://harmony.apache.org 2. Larsen, S., Amarasinghe, S.P.: Exploiting superword level parallelism with multi-media instruction sets. In: PLDI, pp. 145–156 (2000) 3. El-Mahdy, S.E.S.A., El-Mahdy, A.: Automatic vectorization using dynamic compilation and tree pattern matching technique in jikes rvm. In: ICOOOLPS, pp. 63–69 (2009) 4. Free Software Foundation. Auto-vectorization in gcc, http://gcc.gnu.org/projects/tree-ssa/vectorization.html 5. Free Software Foundation. Gcc, http://gcc.gnu.org 6. Naishlos, D.: Autovectorization in gcc. In: GCC Summit, pp. 105–118 (2004) 7. Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for simd. In: Schwartzbach, M.I., Ball, T. (eds.) PLDI, pp. 132–143. ACM, New York (2006) 8. Nuzman, D., Zaks, A.: Autovectorization in gcc - two years later. In: GCC Summit, pp. 145–158 (2006) 9. Rosen, I., Nuzman, D., Zaks, A.: Loop-aware slp in gcc. In: GCC Summit, pp. 131–142 (2007) 10. Novell Corporation. Mono, Simd. Namespace, http://www.mono-project.com/ 11. Intel Corporation. IA-32 Intel Architecture Optimiztion Refernce Manual. Copyright (1999-2003) 12. Pedro, V., et al.: Automatic loop transformations and parallelization for Java. In: ICS 2000: 14th Int. Conf. on Supercomputing, pp. 1–10 (2000) 13. Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures: A Dependencebased Approach. Morgan Kaufmann, San Francisco (2001)
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse Yu Zhang1,2 , Lina Yuan1 , Tingpeng Wu1 , Wen Peng1,2 , and Quanlong Li1,2 1
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, P.R. China 2 Software Security Laboratory, Suzhou Institute for Advanced Study, University of Science and Technology of China, Suzhou 215123, P.R. China [email protected]
Abstract. Garbage collection consumes significant overhead to reclaim memory used by dead (i.e., unreachable) objects in applications. This paper explores techniques for compiler assisted object reclamation and allocation on an actual JVM. Thereinto, the just-in-time compiler identifies dead objects using pointer and escape analysis combining liveness information and inserts calls to free them. The garbage collector provides runtime support for explicit reclamation and space reuse. Our approach differs from other compiler assisted GC in two crucial ways. First, it identifies not only the objects that are no longer referenced directly by the program, but also the objects that are referenced only by those identified to-be-freed objects. Second, it modifies a parallel garbage collector, and not only frees the identified dead objects, but also tries to reuse their space immediately. The experimental results show that the JIT-assisted GC improves the memory utility and the performance efficiently. Keywords: Compiler assisted garbage collection, pointer and escape analysis, live variable information, parallel garbage collector.
1
Introduction
Garbage collection (GC) [1] is a technology that frees programmers from the error-prone task of explicit memory management. However, it consumes significant overhead to find dead (i.e., unreachable) objects in the managed heap and to reclaim the memory used by them. Accordingly, GC has become one of the dominant factors influencing performance of the runtime systems such as Java virtual machine (JVM). For example, SPECjbb2005 [2] usually spends 10% of its total execution time in GC. In order to reduce the cost of GC, other than improving GC algorithms [1, 3, 4], a more effective approach is compiler assisted memory management, including stack allocation [5–7], region allocation [8–11], compile-time free [12–15] and reuse [15–17].
This research was supported in part by the National Natural Science Foundation of China under Grant No. 90718026 and gifts from Intel Corporation.
C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 18–34, 2010. c IFIP International Federation for Information Processing 2010
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
19
Stack or region allocation reduce the load of GC through allocating some objects in a method stack frame or a region, and all objects in a stack frame or in a region should be reclaimed simultaneously even if some of them became dead before. However, stack allocation may induce stack overflow, while region allocation needs sophisticated region management, neither has delivered improvements on garbage collectors. Compile-time free and reuse belong to compiler assisted GC, they improve the collection or reuse of objects allocated in heap through compiler efforts. Some works insert free instructions to free dead objects [12–15], thus reduce GC the load of identifying dead objects. Others automate compile-time object merging or reuse [15–17] to decrease the number of objects allocated in heap. We explore techniques on compiler assisted object reclamation and space reuse on an actual JVM, i.e., Apache Harmony DRLVM [18], and implement them as a system called just-in-time compiler assisted garbage collection (JIT-assisted GC). The novel contributions we made are as follows: – We design a novel object lifetime analysis algorithm which is field-sensitive and context-sensitive. The analysis combines pointer and escape analysis with flow-sensitive liveness information to identify not only the objects that are no longer referenced directly by the program, but also the objects that are referenced only by those identified objects. – We collect the free instrument information from the dead object information based on the dominance relationship in control flow. Various strategies are used to ensure the validity and flexibility of the instrumentation. – We modify GCv5 [19], a parallel garbage collector, not only adding gc free interface for explicit reclamation but also improving gc alloc to try to reuse the explicitly reclaimed space immediately. The JIT-assisted GC system can handle multi-threaded programs. The experimental results show that the memory utility and the performance of the whole runtime system are improved efficiently.
2
Overview of the JIT-Assisted GC
In this section we first give an overview of the framework of JIT-assisted GC, then take a simple example to illustrate the compiler analysis and transformation for explicit object deallocation. 2.1
The Framework of the JIT-Assisted GC
The JIT-assisted GC is built on DRLVM, involving several components of DRLVM, such as VMCore, EM(Execution Manager), Jitrino.OPT ( a JIT optimizing compiler), and GCv5 garbage collector, etc. VMCore concentrates most of the JVM control functions. EM selects a compiler or an interpreter for compiling/executing a method, handles profiles and the dynamic recompilation logic. Jitrino.OPT features two types of code intermediate representation (IR): platform-independent high-level IR (HIR) and
20
Y. Zhang et al.
platform-dependent low-level IR (LIR). Both of them are graph-based structures denoting the control flow of a program. Jitrino incorporates an extensive set of code optimizations for each IR type, and defines the compilation process as a pipeline, which is a linear sequence of steps. Each step stores a reference to an action object (e.g., an optimization pass), its parameters and other information. GCv5 is a fully parallel garbage collector including various algorithms, and can work in generational and non-generational modes.
Fig. 1. Framework of the JIT-assisted GC
Fig.1 shows the framework of JIT-assisted GC, which mostly refers to the shadowed areas in the figure. On the JIT side, the compilation pipeline loads the bytecode of current to-be-compiled method, first translates it into the HIR via the translator, then transforms the HIR into the LIR via the Code Selector, and last emits the native code via the Code Emitter. We currently explore techniques for explicit object reclamation and space reuse based on the framework and they work as follows: JIT side. To support explicitly object deallocation, an Op Free instruction and a VM RT FREE runtime call instruction are extended into HIR and LIR, respectively. And the Code Selector and the Code Emitter are modified to support translating the extended instructions. Then algorithms on compiler analysis and transformation for explicit object deallocation are designed and implemented as an optimization pass for HIR type. GC side. A gc free interface is added to support explicitly reclaiming object space, and the implementation of gc alloc is modified to try to reuse the explicitly reclaimed space immediately. VMCore side. Runtime support for mapping VM RT FREE instruction to gc free interface is implemented. The mapping needs to push the right parameters onto the runtime stack, and to ensure the consistency of the stack pointer before and after the call of gc free interface.
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
21
Fig. 2. Code fragment from BH, one of the Jolden benchmarks. The code in italics is inserted by the compiler.
Beyond the explicit reclamation and space reuse discussed in this paper, more techniques for compiler assisted GC can be explored on the framework. e.g., as to some allocation sites in loops we can let them produce objects at the first iteration and reset such objects in subsequent iterations to reduce the allocation overhead of GC. 2.2
A Simple Example
Fig.2 shows a code fragment from BH of Jolden. The code underlined is inserted by the compiler. Lines 1-20 show a class MathVector containing a static field NDIM and an array field data. The constructor at lines 4-8 builds a double array object o1 and initializes each element of o1 . Lines 23-31 show a method subdivp in class Cell which creates an object o2 of type MathVector at line 24. Note that after line 26, dr is not live, thus o2 only referenced by dr is dead. If the compiler maintains the field reference information across procedures, it can further check whether objects referenced by the fields of the identified dead objects are dead. In the example object o1 is only referenced by field data of o2 in method subdivp, so once o2 is dead, the compiler can detect that o1 is also dead according to the field reference information, and can decide that the dead points (i.e., program point where an object is dead) of o1 and o2 are the same. Although the compiler identifies o1 and o2 are dead after line 26, there are still two problems to be considered. One is does the allocation site of an object dominates its dead point? The other is how to get the reference to the dead object? Here the allocation site of o2 dominates the point after line 26, so its free instruction can be inserted after line 26. However, there is no reference to o1 in original subdivp code, so the compiler has to generate instructions to obtain the reference of o1 , the load of dr.data in Fig.2 (line 27) will correspond to several instructions in HIR level.
22
3
Y. Zhang et al.
Object Lifetime Analysis and Program Transformation
In this section we first give the compiler analysis and transformation for explicit object deallocation in a nutshell, then describe some key parts in it. 3.1
Analysis and Transformation in a Nutshell
When executing a Java application on DRLVM, VMCore controls the class loading, and interacts with the compiler to compile the bytecode of a to-be-executed method into native code, and then executes it. The compilation process of a Java method is defined as a pipeline specified in the EM configuration file, and our analysis and transformation is developed as an HIR optimization pass which can be configured into a pipeline. Due to the limitation of the pipeline management framework, the pass can only directly obtain the HIR of the current compiling method M , and have to insert free instructions into M only when it is the compiling method of the pipeline. 01 02 03 04 05 06 07
if (M is not analyzed){ G = init(M); L = calculateLiveInfo(M); B = getRevRuntimeBasicBlocks(M); foreach b in reverse iterator of B { foreach instruction i in b transGwithInst(i, G);
Fig. 3. Flow of the object lifetime analysis and transformation
Fig.3 is the flow of the pass, where method M is in the HIR of static single assignment (SSA) form; L, D, I, and M represent the liveness information, the dead object information, the free instrument information and the summary of M , respectively. Line 03 calculates L. Line 04 gets the reversed pseudo-runtime basic block sequence of M , which consists of all reachable basic blocks of M in reverse topological order, where the basic block exited from a loop is located before all basic blocks in the loop, and the exception handling block edged from a basic block b is located before other blocks edged from b. Lines 05-10 include operations on identifying D through the intra-procedural and inter-procedural analysis based on a program abstraction called points-to escape graph (PEG), and collecting I from D. Line 11 records the analyzing result of M , and line 14 transforms the HIR of M according to the analyzing result of M. In the following subsections we present the details of the PEG, the intraprocedural and the inter-procedural analysis, and the instrument information collection in turn. 3.2
Points-to Escape Graph
Definition 1. Suppose M is a method, and V , P denote the set of variables and the set of formal parameters of method M , respectively. The PEG of M is a directed graph, denoted as G = (No Nr , Ep Ef ) ( represents disjoint union) where:
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
23
– No = Nc Np represents the set of objects accessed in M . • Nc represents the set of objects created by allocation sites in M . • Np = Nf p ∪ Nin represents the set of objects created outside M , called phantom objects, where Nf p represents the set of objects created in the direct or indirect callers of M and passed into M via formal parameters of M and their fields, Nin represents the set of objects created in the direct or indirect callees of M and passed into M via the return value receivers and their fields or fields of the actual parameters at each call site in M . • Nret ⊆ No represents the set of objects returned from M . – Nr is the set of reference nodes in M . Each variable with reference type in M corresponds to a reference node, i.e., Nr ⊆ V . – Ep ⊆ Nr × No represents the set of points-to edges. v, o ∈ Ep denotes that reference node v may point to object o. – Ef ⊆ No × F × No represents the set of field edges where F represents the set of non-static fields in M . o1 , f, o2 ∈ Ef denotes that field f of object o1 may point to object o2 . Each object o in a PEG G (o ∈ No ) has an associated escape state, denoted as ξ(o). The range of ξ(o) is a lattice E consisting of two elements: EN ≺ EG . EG means the object escapes globally and may be accessed by multiple threads, EN means that the object may not escape globally. If an object o does not escape globally, that is, the object can be accessed only by a single thread, and no other variables or object fields refer to object o after a program point p, then o can be reckoned as a dead object at point p, we call p the dead point. 3.3
Intra-procedural Analysis
The identification of dead objects are accompanied by building and transforming the PEG of M according to each instruction in the pseudo-runtime basic block sequence, and the live variable information, i.e., lines 07-08 in Fig.3. We first discuss the analysis process neglecting call instructions in this subsection. Transforming the PEG According to the Basic Instructions. Given an instruction i in the HIR, the PEG at entry to i (denoted as G(i) ) and that at exit from i (denoted as G(i) ) are related by the standard data flow equations: G(i) = f i (G(i) )
(1)
G(i) = ∧i ∈P red(i) G(i )
(2)
where f i denotes data flow transfer function of instruction i, P red(i) is the set of predecessor instructions of i and operator ∧ is a merge of PEGs. Table 1 shows the transfer function f i for each kind of basic instructions i, where the Ac and Ap operations are defined in Definitions 2 and 3.
24
Y. Zhang et al. Table 1. The transfer functions for each kind of basic instructions
G(i) = f i (G(i) ) o := newObject(); Nf p := Nf p ∪ {o}; Ep := Ep ∪ {f p, o}; ξ(o) := EG . new : v = new C o := newObject(); Nc := Nc ∪ {o}, Ep := Ep ∪ {v, o}; v = new C[] if (o is a thread object) ξ(o) := EG else ξ(o) := EN . copy: v1 = v2 Ep := Ep ∪ {v1 , o|v2 , o ∈ Ep }. phi: v=phi(v1 ,v2 ) Ep := Ep ∪ {v, o|v1 , o ∈ Ep ∨ v2 , o ∈ Ep }; putField : v1 .f = v2 suppose X = {x|v1 , x ∈ Ep }, Y = {y|v2 , y ∈ Ep } Ef := Ef ∪ {x, f, y|x ∈ X, y ∈ Y }; ∀x ∈ X, ∀y ∈ Y .Ac (x, y); if (ξ(y) = EG ) Ap (y). getField : v1 = v2 .f suppose X = {x|v2 , x ∈ Ep }, Y = {y|x, f, y ∈ Ef , x ∈ X} if (Y = ∅){ o := newObject(); Nin := Nin ∪ {o}; ξ(o) := EN ; Ef := Ef ∪ {x, f, o|x ∈ X}; Ep := Ep ∪ {v1 , o}} else{Ep := Ep ∪ {v1 , y|y ∈ Y }} putStaticField :C.sf=v ∀v, o ∈ Ep .ξ(o) := EG ; Ap (o). getStaticField : v=C.sf o := newObject(); Nin := Nin ∪ {o}; ξ(o) := EG ; Ep := Ep ∪ {v, o}. return: return v Nret := Nret ∪ {o|v, o ∈ Ep }. HIR instruction i defineArg: fp ∈ P
Definition 2. Given two object nodes o1 , o2 ∈ No in the PEG G, the escape state combination operation Ac (o1 , o2 ) which propagates ξ(o1 ) to ξ(o2 ) is defined as: e ∈ E e = ξ(o1 ) ξ(o2 ) ≺ e (3) ξ(o2 ) := e Definition 3. Given an object node o ∈ No in the PEG G where ξ(o) = EG , operation Ap (o) sets the escape state of each object reachable from object o via a path of field edges to be EG . The Ac operation is used when there is a field assignment (i.e., putField or getField in Table 1) or inter-procedural information combination, while the Ap operation is used when there is a static field assignment (i.e., putStaticField or getStaticField ). Transforming the PEG Combining with the Live Variable Information. After analyzing all instructions in a basic block, combining with the live variable information, if a variable v is not live, the out points-to edges of v will be clipped, thus objects only pointed to by v can be regarded as dead. Furthermore, if an object o dies, the out field edges of o will be clipped, thus objects only referenced by the fields of o can also be regarded as dead. The clip operation AD is based on the live variable information and produces the dead object information at the end of each basic block. Definition 4. Given an object o ∈ No , a reference node v ∈ Nr , G and G denote the PEGs before and after the AD operation respectively. The AD operation is defined as the following two rules.
3.4
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
25
G = (No ∪ Nr , Ep ∪ Ef ) v ∈ Nr Epv = {v, o|o ∈ No } G = (No ∪ Nr , Ep ∪ Ef ) Nr = Nr − {v} Ep = Ep − Epv
(4)
G = (No ∪ Nr , Ep ∪ Ef ) o ∈ No Efo = {o, f, o |o ∈ No } G = (No ∪ Nr , Ep ∪ Ef ) No = No − {o} Ef = Ef − Efo
(5)
Inter-procedural Analysis
When analyzing a method M , only objects with EN state may be explicitly freed. If such an object is referenced by a formal parameter or the return value of M or reachable from their fields, the object cannot be freed in M because M ’s callers may use it. The object lifetime analysis needs to record them into a summary of M , and update the PEG of M ’s caller using M ’s summary when analyzing a call instruction to invoke M . Definition 5. Given a method M and its PEG G = (No ∪ Nr , Ep ∪ Ef ), the object lifetime analysis result of M is a 2-tuple M, D where: – M = (Nf p ∪Nret , Ef ) is a summary of M . It records all objects referenced by the formal parameters or the return value of M , i.e., Nf p ∪Nret , and the set of field edges starting from them, i.e., Ef = {o, f, o |o ∈ Nf p ∪Nret ∧o, f, o ∈ Ef . – D describes the dead object information in M . It is a set of triples, each triple is denoted as o, r, p, where • o ∈ No is dead after the point p in M , • r = v, f represents the reference to o. If f is null, then v is the reference to o, otherwise, v.f is the reference to o. Given a method M and its PEG G, suppose there is a call instruction v = m , E m . The v0 .m(v1 , ..., vn ) in M and the summary of m is Mm = Nfmp ∪ Nret f process of dealing with the call instruction is as follows: 1. Combine the formal parameters and the actual parameters. For each vi , o in Ep , perform Ac (f pi , o) where f pi ∈ Nfmp is the corresponding formal parameter of m. 2. Combine the return value and the return value receiver. For each v, o in m is the return value of m. Ep , perform Ac (r, o) where r ∈ Nret 3. Map field edges. For each edge in Efm , add a corresponding edge in Ef . 4. Propagate escape states. If the escape state of an object o referenced by one of the actual parameters or the return value receiver becomes EG , then perform Ap (o). The above inter-procedural combination takes a callee summary as precondition. However, a callee of M may not be analyzed when analyzing M . If so, there are two optional ways to deal with the call site. One is to neglect the callee and to make a decision conservatively. The other is to start up a new pipeline to compile the unanalyzed callee to obtain its summary. This special pipeline only includes a few basic passes translating bytecode into non-optimized HIR of SSA
26
Y. Zhang et al.
form and the object lifetime analysis pass in order to obtain the analyzed result of the callee and not the native code of the callee. The latter way is more precise but consumes more overhead. We introduce an argument to control the depth level starting up the special compilation of unanalyzed callees, thus users can use it to trade off between precision and performance. 3.5
Collecting Free Instrument Information
The dead objects and their dead points in D cannot be directly used as the instrument information generating free instructions. Sometimes instrumenting directly at a dead object o’s dead point may bring compile-time or runtime errors. e.g., if o is created in a branch and dies outside the branch, freeing o at the dead point might induce a runtime error. Another problem is how to free an object o that dies in method M and has no explicit reference to o in M , e.g., o1 referenced by dr.data in method subdivp. Therefore, we need to collect instrument information from the dead object information. The structure of the instrument information I is quite similar to that of dead object information, the only difference is that the dead point in the latter is changed into the instrument point in the formal. When collecting instrument information, a rule must be followed: given a dead object o and its dead point p, the basic block in which the allocation site of o appears must dominate the dead point p of o. Otherwise, it means the compiler is trying to free an object that may not be allocated before, thus causing runtime errors. The dominance relationship can be obtained from the dominance tree of HIR. Two key steps of the collecting process are as follows: Confirm dead object reference: for an object o, there are two kinds of references: one is at its allocation site, e.g., A a = new A( ), where a is the reference to the newly created object o here; the other is brought by phi, assign, getF ield, or putF ield instructions. We preferentially choose the reference at allocation site for a dead object. If the dead object has no explicit reference in the method, we can make use of other object’s field to generate its reference indirectly. Confirm instrument point: for a dead object o, we preferentially choose the basic block which contains a return instruction (denoted return node) to insert instructions to free o. If there is not any exception when executing the program, the return node must be executed. If the confirmed reference point of dead object o cannot dominate return node, and the dead point of o is in a branch, then we have to insert instructions to free o at the dead point because such a branch may not be executed. According to the reference and the instrument point information of each dead object provided in I, the code transformation of free instrumentation can easily create instructions to explicitly free object. Fig.4 shows the analysis process of method subdivp in Fig.1. We give sourcelevel statements for the sake of brevity. The new expression in block 2 will implicitly invoke the constructor, so object o1 created there will be passed into subdivp and become an element of Nin in the PEG of subdivp. At the end of
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
27
Fig. 4. An example illustrating the analysis process
block 4, variable dr is not live, so the analysis determines that o2 only referenced by dr and o1 only referenced by dr.data are both dead. At last the analysis will record the analysis result. 3.6
Special Design Tradeoff
Thread Object. Thread objects are distinguished from other ordinary objects based on class hierarchy diagrams, and their escape states can be initialized as EG . If an object o is assigned to a thread object field, then ξ(o) := EG . If a thread object ot has ended its execution, ot will be treated as an ordinary object and ξ(ot ) will be reset according to the escape states of objects referring to ot in the current PEG. However, it is difficult to identify when and where a thread has ended execution. Our analysis only judges this case by join() invocations of thread objects. Loop. Instructions in loops are analyzed only once, which makes the analysis simpler and cheaper since the analysis overhead is a part of the whole program runtime overhead. The analysis is also correct and conservative because according to the rules in section 3.5, 1)assuming the allocation site pa of an object o occurs before a loop entry and o dies in the loop, if pa dominates a return node pr , then select pr as the instrument point, else might select some point after the loop exit; 2) assuming the allocation site pa of o occurs in a loop, if o dies in the loop, then select the dead point in the loop which can be dominated by pa as the instrument point, otherwise indicating any reference to o is live at all basic blocks of the loop, and not freeing o.
28
Y. Zhang et al.
Array. All elements of an array are abstracted as an object with a special field, and accesses to an element are treated as accesses to the special field. It may reduce the size of explicitly freed objects but save analysis overhead. Recursion. Our inter-procedural analysis can handle recursion. It maintains a chain of method invocation when meeting an unanalyzed callee, if the current unanalyzed callee has already existed in the chain (i.e., there is a recursion), the loop in the chain is cut and the inter-procedural combination could be done conservatively based on the current method summaries.
4
Explicit Reclamation and Space Reuse
GCv5 [19] is a parallel GC which support multiple collector s running collaboratively. We choose GCv5-MS to implement explicit reclamation and space reuse (denoted as JIT-GCv5-MS), because it uses free-list to organize heap space and is convenient to add or acquire a free space from the heap. 4.1
Brief Overview of GCv5-MS
Each thread in an application (called application thread ) corresponds to a mutator thread in GCv5-MS. Each mutator takes charge of the allocation of the corresponding application thread. GCv5-MS classifies objects into two kinds, i.e., small objects (less than 1KB) and large objects (greater than or equal to 1KB), and provides Free Block Pool (FBP) and Free Area Pool (FAP) shown in Fig.5 for the allocation of the two kinds, respectively. Each pool is organized as an array of segregated free lists, where each free list contains blocks/areas of the same size or class size. The FBP has 254 segregated free block lists shared among all mutator s, and block s in the same list provide objects of the same size (from 8B to 1020B, aligned in 4B). Each block comprises a header and a data area. The header depicts information on the data area, e.g., a bitmap marking the status of each slot in the data area, such as in use or free. Each mutator requests a free block from the pool and its acquired blocks are local to the mutator. When a mutator receives a request of allocating a small object, it searches its local block of the requested size. If there is a free slot of the requested size then the mutator can return one; otherwise it need to request a free block of the requested size from the pool. Operations on the pool must be synchronized while operations on the mutator-local blocks need not. The FAP has 128 segregated free area lists. The last list contains free areas of the size greater than or equal to 128KB. All mutator s share the pool and must request memory for large objects with synchronization. Generally speaking, there are relatively few large objects in applications, so the synchronization overhead of parallel large object allocations is not high. 4.2
Allocation and Explicit Reclamation in JIT-GCv5-MS
In order to support explicit reclamation and space reuse, we modify GCv5-MS as JIT-GCv5-MS to add gc free and to modify the implementation of gc alloc.
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
Fig. 5. Heap space management of GCv5-MS
29
Fig. 6. Heap space management of the explicitly reclaimed objects
Each mutator in JIT-GCv5-MS handles not only allocation requests but also explicit free ones from its corresponding application thread. Due to the different memory management mechanisms between the small and the large objects in GCv5-MS, we take different methods. Handling with Small Objects. If a mutator mf receives a request to free a small object o, slot occupied by o must belong to a mutator-local block of some mutator ma , where mf may not be ma , that is, o may not be thread-local. If mf directly modifies the mark bits of o in the block header as free status to reclaim the slot, and lets the original allocation algorithm control the reuse, accesses to the word containing the mark bits by mf need to be synchronized, because the word contains other slots’ mark bits, which may be accessed by ma to handle an allocation request or by other mutator to handle another explicit free request, simultaneously. Thus allocation operations on the mutator-local blocks which need not be synchronized originally, have to be synchronized, which brings more synchronization overhead. In order to avoid such synchronization, we introduce a Reclaimed Object Pool (ROP) (shown in Fig.6) for each mutator to collect its explicitly reclaimed object spaces. When mutator mf reclaims an object o, it does not modify the mark bits of o, but forces the object slot into a node of type Reclaimed Object Entry and inserts the node into a list of the same size in mf ’s local ROP. gc alloc need be modified to try to reuse the explicitly reclaimed object space immediately. That is, it first searches its local ROP for free space of the requested size. If there are none, it continues to allocate as the original strategy in GCv5-MS. Handling with Large Objects. Because all mutator s share the FAP for allocating large objects and the synchronization on these operations cannot be neglected, we keep the implementation on allocating large objects as original. When mutator mf receives a request to free a large object o, it directly insert the memory area occupied by o into the free list of the matched size in the FAP, thus the subsequent object allocation of the same size will reuse the memory area. It is noticed that explicit reclamation of large objects need be synchronized, and we cannot easily obtain the reuse rate of large reclaimed object space.
30
Y. Zhang et al.
5
Experimental Results
We have implemented the above work in DRLVM and evaluated it with Jolden and SPECjbb2005. The experiments were performed on 2.1GHz AMD Athlon dual core machine with 896MB of memory running Windows XP. 5.1
Effectiveness of the JIT-Assisted GC
First, we check whether the JIT-assisted GC frees still reachable objects or frees dead objects at wrong program points. In order to perform the correctness validation, we modify the implementation of gc free, mark the explicitly reclaimed object as un-useable and un-reusable. In addition, the pipeline performs many checks in LIR, such as variable liveness checking. In this way if done a wrong free action, the system will throw exception at the next access to a freed object or at the access to a potential undefined variable. The experiments show that there are no such exceptions and errors in compile time or runtime. Table 2 presents the statistics on allocation, free and reuse for our JIT-assisted GC system at the default heap size 256MB of the VM. The first four programs are from Jolden, the JIT-assisted GC explicitly frees 66% of all objects on average and up to 96% in Jolden. We find that the free instructions inserted in loops or recursive methods can bring considerable income, and these explicitly reclaimed object spaces can be reused easily because the same allocation site will be executed many times, e.g., Health reclaims 14MB and almost all the space is from such free instructions. Table 2. Memory freed and reused by JIT-assisted GC. Application
BH Health Power TSP SPECjbb2005
Total Alloc Free Mem Free Mem Total Mem on free(x) on free(x.f) Free Mem. 67MB 14MB 46MB 60MB 60MB 14MB 0B 14MB 24MB 23MB 100B 23MB 51MB 28MB 88B 28MB 1419MB 104MB 0B 104MB
Total Reuse Mem 60MB 14MB 23MB 27MB 104MB
%Free Mem
%Reuse Mem
90% 23% 96% 55% 7%
100% 100% 100% 96% 100%
The last column of the table shows the explicitly reclaimed memory reuse ratio. For the programs in Jolden, the ratios are high and the explicitly reclaimed objects are all small objects, this illustrates that the JIT-GCv5-MS can reuse almost all these small objects. The reuse ratio of SPECjbb2005 is relatively low because the system reclaims many large objects, and the system does not count statistics for the large object space reuse due to not increasing the synchronization cost. 5.2
Time Cost
Table 3 presents the statistics on the time cost of our object lifetime analysis and transformation pass and the total compilation time, we can see that the pass cost less than 10% of the total compilation time.
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
31
Table 3. Analysis time and total compilation time. Application BH Health Power TSP SPECjbb2005
5.3
Objlife Time Total Comp. Time %Objlife 23ms 537ms 4.3% 14ms 309ms 4.5% 12ms 332ms 3.6% 11ms 207ms 5.3% 738ms 19011ms 3.9%
Performance Improving of JIT-Assisted GC
To evaluate the performance impact, we compared the GC execution times of the benchmark programs. Fig.7 presents the GC execution time comparison of programs in Jolden. The x-axis is the heap size and the y-axis is the GC execution time. We can see that GC execution time of JIT-GCv5-MS is less than that of GCv5-MS. Along with the increase of the heap size, the performance improvement becomes small. This is because the numbers of the explicitly reclaimed objects and those of the reused objects are fixed, the larger heap size relatively decreases the performance improvement, as Health and TSP in Fig. 7. As to BH and Power, even if the heap size is set to the least 16MB, the execution time of GC in JIT-GCv5-MS is zero, since more than 90% of the allocated space can be explicitly reclaimed.
Fig. 7. Performance comparison of 4 programs in Jolden
Fig.8 shows the throughput comparison of SPECjbb2005 with and without JIT-assisted GC optimization. It lists the collectively throughputs of 6 group experiments. The dark column and the first row in the data table illustrate the throughput without JIT-assisted GC. The tint column and the second row in the
32
Y. Zhang et al.
Fig. 8. The throughput comparison of SPECJbb2005
data table illustrate the throughput with JIT-assisted GC. The third row in the table is the improving ratio. We can see the improving ratio is about 1.3∼2.9%.
6
Related Work and Conclusions
Guyer et al. propose a free-me analysis [12] which is closest to our work. They combine a light-weight pointer analysis with liveness information that detects when short-lived objects die, and insert calls to free dead objects. However, their method cannot identify the lifetime of objects referenced by fields due to its field-insensitive property. Cherem et al. present a uniqueness inference and can free objects with unique reference in the whole heap through free instructions and destructors [13, 14], the work needs to modify libraries to add destructors, this method is complex and difficult and not fit for the system built in virtual machine because the latter need to exucute applications accompanied by justin-time compilation. Both of the works do not support the reuse of the explicitly reclaimed space. Lee and Yi’s analysis inserts free instructions only for immediate reuse, i.e., before an allocation of the same size [15]. Marinov et al. present Object Equality Profiling (OEP) [16] to discover opportunities for replacing a set of equivalent object instances with a single representative object. Gheorghioiu et al. present an inter-procedural and compositional algorithm for finding pairs of compatible allocation sites [17], which have the property that no object allocated at one site is live at the same time as any object allocated at the other site. All these works focus on object merging and reuse only for the same size objects with lifetime homogeneity only on the compiler end. Our work can identify some short-lived objects not limited in method scope or other special features like [15], it also detects objects only referenced by the fields of the identified dead objects. The PEG based analysis seems similar to [7], however, the definition of the escape lattice and the rules on building and transforming the PEG are very different. In addition, our work not only frees the identified dead objects, but also tries to reuse them immediately. Although our current work in GCv5 is on Mark-sweep algorithm, we can easily extend the work to other algorithms in GCv5.
Just-in-Time Compiler Assisted Object Reclamation and Space Reuse
33
Based on the JIT-assisted GC framework, we can explore more optimization on memory management. We are analyzing the benefit of each free instruction inserted by JIT and the memory utility of each allocation site in loop or recursive method by developing a log system with the cooperation among JIT, VMCore and GC. According to the analysis results, we will find more chances on memory management optimization.
References 1. Jones, R., Lins, R.: Garbage collection: algorithms for automatic dynamic memory management. John Wiley & Sons, Chichester (1996) 2. Specjbb2005 benchmark (2005), http://www.spec.org/jbb2005/ 3. Kero, M., Nordlander, J., Lundgren, P.: A correct and useful incremental copying garbage collector. In: Proc. 6th Int’l Symp. on Memory Management, October 2007, pp. 129–140. ACM Press, New York (2007) 4. Blackburn, S., McKinley, K.: Immix garbage collection: mutator locality, fast collection, and space efficiency. In: Proc. 2008 ACM Conf. on Prog. Lang. Design and Impl., pp. 22–32. ACM Press, New York (June 2008) 5. Whaley, J., Rinard, M.: Compositional pointer and escape analysis for java programs. ACM SIGPLAN Notices 34(10), 187–206 (1999) 6. Gay, D., Steensgaard, B.: Fast escape analysis and stack allocation for object-based programs. In: Watt, D.A. (ed.) CC 2000. LNCS, vol. 1781, pp. 82–93. Springer, Heidelberg (2000) 7. Choi, J.D., Gupta, M., Serrano, M.J., Sreedhar, V.C., Midkiff, S.P.: Stack allocation and synchronization optimizations for java using escape analysis. ACM Trans. on Programming Languages and Systems 25(6), 876–910 (2003) 8. Gay, D.E., Aiken, A.: Language support for regions. In: Proc. 2001 ACM Conf. on Prog. Lang. Design and Impl., June 2001, pp. 70–80. ACM Press, New York (June 2001) 9. Grossman, D., Morrisett, G., Jim, T., Hicks, M., Wang, Y., Cheney, J.: Regionbased memory management in cyclone. In: Proc. 2002 ACM Conf. on Prog. Lang. Design and Impl., June 2002, pp. 282–293. ACM Press, New York (June 2002) 10. Salagnac, G., Yovine, S., Garbervetsky, D.: Fast escape analysis for region-based memory management. In: Proc. 1st Int’l Workshop on Abstract Interpretation for Object-Oriented Languages, January 2005. ENTCS, vol. 141, pp. 99–110. Elseiver, Amsterdam (January 2005) 11. Stefan, A., Craciun, F., Chin, W.N.: A flow-sensitive region inference for cli. In: Ramalingam, G. (ed.) APLAS 2008. LNCS, vol. 5356, pp. 19–35. Springer, Heidelberg (2008) 12. Guyer, S.Z., McKinley, K.S., Frampton, D.: Free-me: a static analysis for automatic individual object reclamation. In: Proc. 2006 ACM Conf. on Prog. Lang. Design and Impl., June 2006, pp. 364–375. ACM Press, New York (June 2006) 13. Cherem, S., Rugina, R.: Compile-time deallocation of individual objects. In: Proc. 5th Int’l Symp. on Memory Management, June 2006, pp. 138–149. ACM Press, New York (June 2006) 14. Cherem, S., Rugina, R.: Uniqueness inference for compile-time object deallocation. In: Proc. 6th Int’l Symp. on Memory Management, October 2007, pp. 117–128. ACM Press, New York (October 2007)
34
Y. Zhang et al.
15. Lee, O., Yi, K.: newblock Experiments on the effectiveness of an automatic insertion of memory reuses into ml-like programs, October 2004, pp. 97–108. ACM Press, New York (October 2004) 16. Marinov, D., O’Callahan, R.: Object equality profiling. In: Proc. 18th ACM SIGPLAN Conf. on Object-Oriented Prog. Systems, Lang., and Applications, October 2003, pp. 313–325. ACM Press, New York (October 2003) 17. Ovidiu Gheorghioiu, A.S., Rinard, M.: Interprocedural compatibility analysis for static object preallocation. In: Proc. 30th ACM Symp. on Principles of Prog. Lang., January 2003, pp. 273–284. ACM Press, New York (January 2003) 18. Apache harmony drlvm (2006), http://harmony.apache.org/subcomponents/drlvm/index.html 19. Apache harmony gcv5 (2008), http://harmony.apache.org/subcomponents/drlvm/gc-v5.html
Optimization of Triangular Matrix Functions in BLAS Library on Loongson2F Yun Xu1,2 , Mingzhi Shao1,2 , and Da Teng1,2 1
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China [email protected], [email protected], [email protected] 2 Anhui Province Key Laboratory of High Performance Computing, Hefei, China
Abstract. BLAS (Basic Linear Algebra Subprograms) plays a very important role in scientific computing and engineering applications. ATLAS is often recommended as a way to generate an optimized BLAS library. Based on ATLAS, this paper optimizes the algorithms of triangular matrix functions on 750 MHZ Loongson 2F processor-specific architecture. Using loop unrolling, instruction scheduling and data pre-fetching techniques, computing time and memory access delay are both reduced, and thus the performance of functions is improved. Experimental results indicate that these optimization techniques can effectively reduce the running time of functions. After optimization, double-precision type function of TRSM has the speed of 1300Mflops, while single-precision type function has the speed of 1800Mflops. Compared with ATLAS, the performance of function TRSM is improved by 50% to 60%, even by 100% to 200% under small-scale input. Keywords: BLAS; ATLAS; triangular matrix function; loop unrolling; data pre-fetching.
1
Introduction
In the contemporary scientific engineering, most of the running time is spent on basic linear algebra functions. A lot of software related to matrix computing invokes functions in BLAS [1] (Basic Linear Algebra Subprograms). As a consequence, it is imperative to optimize the BLAS libraries based on a specific machine to fully utilize its hardware resource. KD-50-I is a high performance computer that employs China’s Loongson 2F superscalar CPU, which has the advantage of low power, low cost and high integration. Our benchmark for high performance computer KD-50-I is HPL (High Performance Linpack), which is implemented by invoking functions in the BLAS library. Therefore, the efficiency of functions in the BLAS library can directly affect the performance of the KD50-I system. The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 35–45, 2010. c IFIP International Federation for Information Processing 2010
36
Y. Xu, M. Shao, and D. Teng
Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrixmatrix operations. In this paper we mainly aim at the optimization of functions that computes triangular-matrix and vector in level 2, and the TRMM and TRSM function that implement triangular-matrix and matrix operations in level 3. These functions take up about one third of the BLAS library. On optimization of linear algebra library, contemporary research focuses on an algorithmic level [2,3]. In a period of time that CPU reads one byte from memory, it can execute hundreds of instructions. Consequently, the bottleneck of optimizing functions is not computing time but memory access delay. In order to reduce memory access delay, RA Chowdhury [4] proposed a method that extends the cache-oblivious framework to solve The Gaussian Elimination Paradigm (GEP); Tze Meng Low [5] provided with a high-efficiency blocking algorithm for functions in level 3. ATLAS [6, 7] (Automatically Tuned Linear Algebra Software) is one of the matrix packages [8,9]. ATLAS is portable BLAS software which firstly tests hardware parameters and then optimizes some dense-matrix functions using basic optimizing techniques. ATLAS can optimize basic BLAS functions automatically upon the parameters of cache capacity and memory access delay that ATLAS has tested. However, there still exist unknown parameters of specific architectures, e.g. pipeline structure. Thus, there is room for optimization of codes that ATLAS generates. Based on ATLAS, we further optimize triangular-matrix functions in BLAS from an algorithmic level to reduce access delay and to improve the performance of BLAS, using general optimizing techniques (such as matrix blocking, loop unrolling) and optimizing techniques specific on Loongson 2F (such as multiplyadd instruction, instruction scheduling, data pre-fetching).
2
Triangular-Matrix Functions and ATLAS Methods
There are 8 triangular-matrix functions in BLAS, however, here we only take the TRSM function to illustrate optimizing methods and results. 2.1
TRSM and Symbols
TRSM implements multiplication of inverse of triangular-matrix A and matrix B, as formula(1) illustrates, B ← αop(A)B
or
B ← αBop(A−1 )
(1)
where α(ALPHA, a scalar) is an extension factor, B is an M -by-N matrix, A is an upper (or lower) triangular (or unitriangular) matrix, and op(A) can be A, the transpose of A, or the conjugate transpose of A. If op(A) is on the left of B (left multiplier), A is M -by-M ;if not, A is N -by-N . The declaration of TRSM function is xTRSM (ORDET, SIDE, UPLO, TRANS, DIAG, M, N, ALPHA, A, LDA, B, LDB), where x represents s, d, c or z which
Optimization of Triangular Matrix Functions in BLAS Library
37
respectively stands for single precision float data type, double precision float data type, single precision complex data type, and double precision complex data type. 2.2
ATLAS Methods
In terms of TRSM, ATLAS employs the solution method for linear equations. ATLAS has made some basic optimizations toward TRSM as follows: Matrix Blocking. ATLAS optimizes TRSM by blocking, setting block size of real numbers as 4 and block size of complex numbers as 8, the same as coefficients of loop unrolling. After matrix blocking, triangular matrix A is divided into several smaller rectangular matrices and triangular matrices, where smaller rectangular matrices can be solved by invoking GEMM function and smaller triangular matrices can be solved directly. Copy and Partial Matrix Transpose. Through copy and partial matrix transpose technique, ATLAS transfers a matrix into a transposed or non-transposed status, which changes data storage order and further improves the performance of functions. Loop Unrolling. ATLAS has devised the trsmKL and trsmKR function that operate on real numbers, and the CtrsmK function that operates on complex numbers, where trsmKL and trsmKR unroll a loop by 8 × 1 × 1 and CtrsmK unrolls all of the two inner loops.
3 3.1
General Optimizing Techniques Adjusting Block Size
Matrix blocking is a widely applied optimizing technique to enhance storage availability. It reduces local data sets to avoid conflicts. The matrix blocking algorithm is to partition the sub-data blocks of a matrix, in order to reuse data that are in the cache. We adjust the size of the blocks to a proper value so that each data set could be stored in a buffer, which reduces extra cost of blocking and ensures relatively low conflicts. The left multiplication format of dTRSM exemplifies the specific steps of matrix blocking. As Fig. 1 illustrates, triangular matrix AM×M is spitted into several smaller triangular matrices ARB×RB (grey region in Matrix A in Fig. 1) and several smaller rectangular matrices ARB×M (white region in Matrix A in Fig. 1, where M is a variable and RB ≤ M ≤ M ), so each A can be fully stored in L1 cache. As for ARB×RB , BM×N is split into several smaller rectangular matrices BRB×N , the counterparts of partitioned ARB×RB is solved by TRSM, and the counterparts of partitioned ARB×RB is solved by GEMM.
38
Y. Xu, M. Shao, and D. Teng
Fig. 1. Using matrix block technique to the triangular matrix A and the matrix B
As a consequence, optimization is concentrated in the multiplication operation of triangular matrix ARB×RB and BRB×N . It is clear that both A and B can be read from cache instead of memory, which avoids conflict and enhances the performance of TRSM. 3.2
Setting Coefficients of Loop Unrolling
Loop unrolling is a common optimizing compilation technique, employed to reduce both cyclic variable operations and branch instructions. Besides, unrolling the outer loop of multi-loop can make certain data reusable. We can put these data into registers so that these data can be read directly from registers instead of memory, which lowers the requirements of communication bandwidth between register and cache as well as that between cache and memory. Next, we analyze how varied coefficients affect performance of functions, and then specific coefficient will be chosen upon Loongson 2F architecture. We define the function ATL dreftrsmLUNN(ref for short) that implements the upper triangular form, non-transposed form, and non-unimatrix form of partitioned dTRSM. As Algorithm 1 illustrates, this function shows how coefficients affect function performance. There are 3 layers in the loop of algorithm 1, which are denoted by R, S, T. The ref function is attributable to the speed-up of memory access of TRSM under limitation. The times of memory access is (M 2 + M d)N when multiplication operation of M × M triangular matrix A and M × N rectangular matrix B is implemented. If layer T is unrolled for α times, the times of memory access of A and B is respectively N M (M + 1)/2 and M N (M + 1)/2α ; if layer S is unrolled forβ times, the times for memory access of A and B is respectively M N (M + 1)/2β and M N (M + 1)/2 . So the sum of memory access for unrolling layer T for α times and for unrolling layer S for β times is M N (M + 1)/2a + M N (M + 1)/2β. The computing complexity of dTRSM is (M 2 + 2M )N , and memory access speed is denoted by L(Mb/s), then the theoretical upper limit for the computing speed is:
Optimization of Triangular Matrix Functions in BLAS Library
(M 2 + 2M )N L 2αβ(M + 2)L = M N (M + 1)(α + β)/(2αβ) (α + β)(M + 1) ≈
(2)
2αβL α+β
When we are deal with large-scaled data, and efficiency of ATL dreftrsmLUNN is limited by memory access, increasing the value of α and β could elevate this upper limit of computing speed. 3.3
Other Optimizing Techniques
Except for transforming division to multiplication operation for ref, we have done the same to complex data type functions, because a multiplication operation only takes one cycle while a division operation takes tens of cycles. Taking the left multiplication form of zTRSM for example, computing each column of matrix B that stores complex numbers requires 2M times division operations, so the total number of operations is 2M N . An array is necessary to place the elements in the diagonal of the triangular matrix, whose expression is r/(r2 + i2 ) and i/(r2 + i2 ) where r and i respectively stand for the real part and the imaginary part. It is not difficult to replace the former division operations by multiplication, owing to which there is totally 2M division operations and 2M N multiplication operations-executing time is shortened. What’s more, after unwinding the loops in zTRSM and cTRSM, we also unroll Mmls multiply-subtract instruction for complex numbers used as core of computation to separate real part and imaginary part, so that it is easier for further optimization with specific techniques on Loongson 2F architecture.
40
4
Y. Xu, M. Shao, and D. Teng
Optimization Techniques Based on Loongson 2F Architecture
Loongson is a family of general-purpose MIPS-compatible CPUs developed at the Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS) in China . KD-50-I is a supercomputer with a total of more than 330 Loongson-2F CPUs. Loongson’s instruction set is 64bit MIPS III compatible. It has separate 64/64 KB instruction and data L1 caches. According to literature[10], RB is set as 60 when we block the matrices in dTRSM function in section 3.1. There are 32 registers in Loongson 2F, so theoretically the number of times of loop unrolling for ATL dreftrsmLUNN can be set as 4 × 4 × 2, that is α = β = 4, to acquire a good result speed = 4L. In the following sections, there are other special techniques based on Loongson 2F architecture. 4.1
Multiply-Add Instruction
Because the traditional multiply and add instruction is RAW (read-after-write), it is often necessary to put other instructions between them to reduce pipeline idling. However, Loongson 2F is compatible with a specific multiply-add (or multiply-substract) instruction, using which we can combine multiplication with add operations to improve our program. 4.2
Instruction Scheduling
Another technique is to adjust the sequence of instructions to avoid pipeline idling. Loongson Pentium Pro Architecture has 5 execution units: 2 ALUs, 2 FPUs, and 1 address generation unit (AGU). Consequently, in each execution, one load instruction has to be followed by two floating point instructions and one fixed-point math instruction, so that 4-way superscalar is strictly fitted, which is contributable to acceleration of IPC (Instructions Per Clock). 4.3
Data Pre-fetching
The prefetching instruction of Loongson instruction set can cause the reorder queue blocked, so we employ a branch prediction technique as data pre-fetching technique. After unrolling loops in algorithm 1(in Fig. 2)by 4× 4 × 2, we use data pre-fetching and instruction scheduling techniques to make instructions in each row fit for 4-way superscalar. The codes of the most inner loop are as follows: In Algorithm 2, incA and incB respectively stands for the distance of two continuous elements in matrix A and B. The load instruction in the first part has acquired data required by the second part, and the data acquired in the second part is necessary in the first part of the next iteration. It is obvious that except rA0, rB0, rA1, rB1, rA2, rB2, rA3 and rB3, each instruction can
Optimization of Triangular Matrix Functions in BLAS Library
41
Algorithm 2. Using data pre-fetching in loop unrolling 1:
attain the data needed at least one cycle ahead, that is, data is pre-fetched. By renaming registers and using registers in turns, dependency among instructions can be decreased. For example, as algorithm 2 indicates, we employ rA0, rA1, rA2, rA3 and ra0, ra1, ra2, ra3 by rotation. The following two lines states that rA0 and ra0 fetch data in turns and they are independent of each other, so that data is successfully pre-fetched. ra0 = pA0[incA]; rC00− = rA0 ∗ rB0; pA0+ = incA2; ra0 ∗ rb0.
5
rA0 = ∗pA0; rC00− =
Experimental Results and Discussion
Our experiment is implemented through the combination of repeatedly testing using one case and testing in circles using a group of cases. Specifically, every function is executed repeatedly under various data scale, until the average speed is calculated as the final result. Here we use Mflops(million floating-point operations per second) as the technical criteria. For convenience, parameters ”M”, ”N”, ”LDA” and ”LDB” of function TRSM is replaced by the same value ”length”. Configurations of compilers include set open multiply-substract instructions (-DAdd DStringSunStyle), Linux operating system(-DATL OS Linux), not saving frame pointer at function call (-forit-frame-pointer), optimization level (-O3), and unrolling all the loops(-funroll-all-loops).
42
5.1
Y. Xu, M. Shao, and D. Teng
Optimization Results of dTRSM
The outcome of optimization on dTRSM will be discussed from two aspects: influences placed on dTRSM by each techniques under small data scale and comprehensive effect by all the techniques under large data scale. Because implementation of functions is independently done by TRSM under small data scale, the effect of optimization is reflected directly by performance of TRSM. Fig. 2 shows the performance influenced by each technique under small data scale.
Fig. 2. Comparison of different optimization methods in terms of dTRSM function performance
Fig. 3. Performance of optimized function dTRSM
Optimization of Triangular Matrix Functions in BLAS Library
43
Fig. 4. Performance of optimized functions xTRSM
In Fig. 2, Test1 stands for invoking dTRSM directly; Test2 stands for using loop unrolling technique; Test3 for loop unrolling and data pre-fetching; Test4 for loop unrolling, data pre-fetching, and division replced by multiplication. Test2 outperforms Test1 by almost 3 times, which is close to the theoretical value; from Test2 and Test3, it is clear that rate is doubled by instruction scheduling technique and data pre-fetching technique on the basis of loop unrolling; from Test3 and Test4, it is not hard to observe that executing time is shortened by replacing division with multiplication instruction. Thus, loop unrolling, dataprefetching, instruction scheduling, and replacing division with multiplication techniques can all accelerate the speed of execution. After Test4 optimization, the speed of dTRSM can be 1723.45Mflops when the threshold of blocking size is 60-it outperforms the algorithm 1 by a factor of 7.37 and outperforms ATLAS by a factor of 1.5. When dealing with large-scaled data, the computation of TRSM takes up a rather low ratio. Thus, we only discuss a comprehensive result of optimizationopt in Fig.3 represents the result. The speed of dTRSM in the steady status is 1300Mflops, which outperforms the former function by 60%. 5.2
Performance of Other Optimized Functions in TRSM
In terms of double precision complex type function zTRSM, we acquire the proper optimizing techniques through testing how those techniques works under small scaled data. The function does not achieve a high performance-the speed is only 573.56Mflops when threshold of block size is 24, when loop unrolling and replacing division by multiplication techniques are employed. So we unroll the core instruction Mmls (multiply-subtract instruction for complex numbers) to respectively compute the real part and the imaginary part, and then instruction scheduling and data pre-fetching is applied. Finally, the rate has reached 1211.57Mflops, which outperforms ATLAS by 200%.
44
Y. Xu, M. Shao, and D. Teng
As for single float type function sTRSM, the techniques used are the same as that of dTRSM except the threshold of block size is set as 72. The optimizing methods for cTRSM are similar as that of zTRSM except the threshold of block size is set as 36. Fig. 3 presents the performance of optimized sTRSM, dTRSM and zTRSM. In Fig. 4, s atlas, z atlas, c atlas individually represent performance of sTRSM, zTRSM, cTRSM optimized by ATLAS, and s opt, z opt, c opt stands for performance of sTRSM, zTRSM, cTRSM optimized by us. It is obvious in Fig. 3 that the curves titled s opt, z opt, c opt are relatively smooth, and the final rate of sTRSM, zTRSM, cTRSM are respectively 1800, 1400, 1800Mflops, which outperforms that of ATLAS at least 70%.
6
Conclusion and Further Research
Every specific high performance computer has the necessity to be equipped with a specifically optimized BLAS, in order to have the hardware resources fully utilized. In this paper, we have optimized triangular matrix functions from an algorithmic level, based on the optimization of GEMM function and the Loongson 2F architecture. Here we have also employed optimizing techniques such as loop unrolling, data pre-fetching and instruction scheduling to elevate the performance of functions. The rate of double float functions dTRSM and zTRSM has reached 1300Mflops, and the rate of single float functions sTRSM and cTRSM has reached 1800Mflops. In comparison with ATLAS, our optimization has elevated by 50-60%, even 100-200% when dealing with small-scaled data. At present, our research is concentrated in optimization of BLAS in single core environment. In the future, we will start a new program researching the optimization of BLAS in parallelism when the multi-core CPU, Loongson 3, is put into use.
Acknowledgment We thank Bo Chen, Haitao Jiang, who provided many helpful suggestions. This paper is supported by the Key Subproject of the National High Technology Research and Development Program of China, under the grant No. 2008AA010902 and No. 2009AA01A134.
References 1. Lawson, C., Hanson, R., Kincaid, D., Krogh, F.: Basic Linear Algebra Subprograms for FORTRAN usage. ACM Transaction on Mathematical Software 5(3), 308–323 (1979) 2. Dumas, J.G., Gautier, T., Pernet, C.: Finite Field Linear Algebra Subroutines. In: Proceedings of the 2002 International Symposium on Symbolic on Symbolic and Algebraic Computation (2002)
Optimization of Triangular Matrix Functions in BLAS Library
45
3. Elmroth, E., Gustavson, F., Jonsson, I., Kagstrom, B.: Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. In: SIAM Review (2004) 4. Chowdhury, R.A., Ramachandran, V.: The Cache-oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation. In: Proceedings of the Nineteenth Annual ACM Symposium on Algorithms and Computation Theory, pp. 71–80 (2007) 5. Low, T.M., Robert, A., et al.: API for Manipulating Matrices Stored by Blocks. Department of Computer Sciences, the University of Texas at Austin (2004), http://www.cs.utexas.edu/users/flame/pubs/flash.ps 6. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27, 3–35 (2001) 7. Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., et al.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE. Special Issue on Program Generation, Optimization, and Adaptation 93(2) 2005 8. Goto, K., van de Geijn, R.: On reducing tlb misses in matrix multiplication. Technical Report TR02-55, Department of Computer Sciences, U. of Texas at Austin (2002) 9. Koenker, R., Pin, N.G.: SparseM: A sparse matrix package for R. J. of Statistical Software 8(6) (2003) 10. Gu, N.J., Li, K., et al.: Optimization for BLAS on Loongson 2F architecture. Journal of University of Science and Technology of China 38(7) (2008)
Exposing Tunable Parameters in Multi-threaded Numerical Code Apan Qasem1 , Jichi Guo2 , Faizur Rahman2 , and Qing Yi2 1
Texas State University [email protected] 2 University of Texas at San Antonio {jguo,srahman,qingyi}@cs.utsa.edu
Abstract. Achieving high performance on today’s architectures requires careful orchestration of many optimization parameters. In particular, the presence of shared-caches on multicore architectures makes it necessary to consider, in concert, issues related to both parallelism and data locality. This paper presents a systematic and extensive exploration of the combined search space of transformation parameters that affect both parallelism and data locality in multi-threaded numerical applications. We characterize the nature of the complex interaction between blocking, problem decomposition and selection of loops for parallelism. We identify key parameters for tuning and provide an automatic mechanism for exposing these parameters to a search tool. A series of experiments on two scientific benchmarks illustrates the non-orthogonality of the transformation search space and reiterates the need for integrated transformation heuristics for achieving high-performance on current multicore architectures. Keywords: Autotuning, memory hierarchy, optimization, parallelism.
1
Introduction
The emergence of chip multiprocessor systems has greatly increased the performance potential of modern computer systems. However, much of the responsibility in exploiting the on-chip parallelism lies with system software like the compiler and the operating system. The complexity of modern architectures implies that compilers, in addition to analyzing the code for profitability, need to account for a large number of architectural parameters to achieve a high fraction of peak performance. Since information about many of these hardware parameters is not readily available, achieving high performance on modern architectures is an extremely challenging task for a compiler. The problem of achieving portable high-performance is often more pronounced for numerical code in scientific domains. Scientific applications are characterized by high degrees of temporal reuse and large working sets that often do not fit in
This research is funded by the National Science Foundation under Grant No. 0833203 and No. 0747357 and by the Department of Energy under Grant No. DE-SC001770.
C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 46–60, 2010. c IFIP International Federation for Information Processing 2010
Exposing Tunable Parameters in Multi-threaded Numerical Code
47
the higher-level caches. These codes also involve significant amount of floatingpoint computation and contain loop nests that are amenable to parallelization at one or more levels. Thus, to achieve high-performance for numerical code, the compiler needs to find ways to extract parallelism and apply optimizations that exploit data reuse. This task of exploiting locality and parallelism is further complicated by the presence of shared-caches on current multicore platforms [1]. A shared-cache poses an inherent trade-off between data locality and parallelism. On one hand, any parallel decomposition of the application inevitably influences data access patterns in each concurrent thread. On the other hand, transformations for improving locality often impose constraints on how much parallelism can be extracted. For example, if a data-parallel decomposition of an application creates a working set of size W Si for thread ti , then to improve data locality, the compiler needs k to ensure that Σi=0 W Si < CS, where CS is the size of the shared cache. To satisfy this constraint, the compiler may consider several options including tiling each W Si , finding a suitable schedule for t0 − tk or reconfiguring the decomposition itself. Each approach will not only impact the data reuse of an individual thread but also the synchronization cost and task granularity. Thus, when parallelizing an application for multicore-core architectures, it is important to find the right balance between data locality and parallelism, which involves considering a large number of code transformations and parameters. This combinatorial explosion of performance influencing parameters has made static compiler heuristics largely ineffective in delivering portable performance on current architectures. In response to this daunting challenge, several research groups have proposed methods and techniques for automatic performance tuning [2–5]. In an autotuning framework, a code is analyzed, alternate code variants are generated with tuning parameters and then a heuristic search is performed based on execution time feedback to obtain an implementation that yields optimal or near-optimal performance for the target platform. Many of these tuning efforts have achieved reasonable success and have reduced the need for manual tuning of code in certain domains [5, 6]. However, one area where autoutning research has lagged is in considering a search space that covers both parallelism and data locality parameters for current multicore architectures. The few autotuning efforts that have considered tuning for parallelism have limited themselves to single dimensional problem decomposition and have not considered the issue of data locality in concert [7]. There is one notable work that considers both parallelism granularity and data locality in the search space of stencil computations [3]. However, the exploration of the search space is done in an orthogonal manner and does not consider the interaction between search dimensions. This paper takes a systematic approach to characterize and explore the search space of transformation parameters that affect both data locality and parallelism in multi-threaded applications. We observe that the interaction between locality and parallelism can be captured by considering both the shape and size of problem decomposition. The shape of a problem decomposition can be expressed as a
48
A. Qasem et al.
combination of multiple blocking factors used in the code. This is the key insight that has driven this research. In this work, we identify code transformations and transformation parameters that can be used to control both the granularity of parallelism and the memory access patterns of concurrent threads. We establish a set of criteria to characterize the relationship of transformation parameters that affect both data locality and parallelism. Since multicore systems with sharedcaches give rise to both intra and inter-core locality, we consider both types of reuse in constructing the initial search space. Additionally, we also incorporate the issue of false-sharing withing co-running threads. In terms of parallelism, we consider problem decomposition and thread creation at all levels of a given loop nest. We combine all of these parameters into one unified multi-dimensional search space. We use a transformation scripting language [8] to implement each optimization and expose the parameters to a search engine. To explore the search space in a non-orthogonal manner, we employ several multi-dimensional search methods, including direct search and simulated annealing. Analysis of our experimental results suggest that the shape of a problem decomposition does indeed have significant impact on performance of numerical kernels on multicore architectures. The main contributions of this paper include: • identification of key transformation parameters for optimizing parallel numerical code on multicore architectures • an automatic method of exposing these parameters for tuning • a non-orthogonal exploration of a search space that includes parameters for exploiting both parallelism and data locality.
2 2.1
Related Work Exploiting Parallelism and Data-Locality on CMPs
The dominance of multicore technology within the processor industry has lead to a plethora of work in code improvement techniques for this platform. Softwarebased approaches have been proposed to create new parallel abstractions, extract more parallelism, exploit data locality in the shared memory hierarchy, improve thread schedules, and control synchronization delays. In our treatment of related research, we limit the discussion to work most relevant to our approach, namely strategies for exploiting parallelism and data locality. Many techniques for extracting parallelism and controlling granularity is described in the literature [9]. Recent work has focused on extracting fine-grained parallelism and exploring different models of parallelism such as pipelined parallelization. Thies et al. [10] describe a method for exploiting coarse-grain pipelined parallelism in C programs. They also develop a set of parallel programming primitives to support pipeline parallelism. They utilize the asynchronicity of short threads to hide the latency of memory accesses to improve performance. Papadopoulos et al. show that adding more execution threads is not beneficial and can be prohibitively difficult to implement for database applications [11].
Exposing Tunable Parameters in Multi-threaded Numerical Code
49
Locality transformations described in this paper have been widely studied [9]. Loop blocking or tiling is the predominant transformation for exploiting temporal locality for numerical kernels [12, 13]. The use of unroll-and-jam to improve register reuse is also common in both commercial and research compilers. Loop fusion and array contraction have been used in conjunction to improve cache behavior and reduce storage requirements [14]. Loop alignment has been used as an enabling transformation with loop fusion and scalarization [9]. Loop Skewing and time skewing serve as enabling transformations in strip-mining and parallelizing loop with carried dependencies. Although the literature is replete with heuristics for selecting tile sizes and choosing unroll factors [12, 13], attempts at integrating all these transformations have been less common [15]. These approaches target single-core machines, and thus does not deal with the problem of exploiting parallelism. Relatively few papers have addressed the issue of data locality and parallelism in concert. Among these, Vadlamani and Jenks [16] present the synchronized pipelined parallelism model for producer-consumer applications. Although their model attempts to exploit locality between producer and consumer threads, they do not provide a heuristic for choosing an appropriate synchronization interval (i.e. tile size). Krishnamoorthy et al. [17] describe a strategy for automatic parallelization of stencil computations. Their work addresses both parallelism and data locality issues and is similar to the work presented in this paper. However, Krishnamoorthy et al. uses static heuristics for selecting tile sizes and does not employ empirical search, as we do in this work. 2.2
Autotuning Multi-threaded Applications
Since the autotuning effort started prior to the multi-core era, much of the earlier work focused on tuning for single-core machines. Among these, several autotuned libraries for specific scientific domains have been quite successful. ATLAS which provides highly tuned linear algebra routines is widely used within the scientific community and has become the de facto standard for evaluating other autotuning systems [5]. Research that aims to autotune general applications fall into two categories: those that tackle the phase-ordering problem and aim to find the best sequence of transformations [2] and those that concentrate on finding the best parameter values for transformations that use numerical parameters [3, 4]. Autotuning techniques for multicore processors have also been applied to stencil computations and other numerical code within the scientific domain. Hall et al. developed the Chill framework that can tune numerical code on a wide range of memory transformation parameters [18]. Datta et al. propose a framework that can tune stencil code for both parallelism and locality on current multicore systems [3]. The work by Datta et al. comes closest to the work presented in this paper. The key difference between their approach and ours is that our framework explores a multi-dimensional search space, whereas their framework performs an orthogonal search, looking at one dimension at a time.
50
A. Qasem et al. do k = 1, N do j = 1, N do i = 1, N A(i,j,k)= B(i,j,k) + B(i-1,j,k) + B(i+1,j,k) + B(i,j-1,k) + B(i,j+1,k) enddo enddo enddo
(a) example numerical kernel individual thread k j
i
“unit” sweep parallelization over i, j, k no blocking
“plane” sweep parallelization over k no blocking
“beam” sweep blocking of i and j parallellization over ii and jj
(b) thread execution patterns Fig. 1. Example execution patterns of parallelized and blocked numerical code
3
Characterizing Performance Trade-Offs
In this section, we characterize performance trade-offs in parallelizing and blocking memory intensive numerical code. For this discussion, we will consider a simple three dimensional loop nest, as presented in Fig. 1(a). We assume the loop is fully parallelizable and there is reuse of data along all three dimensions (i.e., the only carried dependence in the loop nest are input dependencies). Although these assumptions are somewhat simplistic, this example captures the core computation pattern for many scientific code and is a suitable tool for illustrating the complex interaction between blocking and parallelizing transformations. Fig. 1(b) depicts example execution patterns for the code in Fig. 1(a). As we can see, the number of loops that are parallelized and the number of dimensions that are blocked can result in widely varying thread granularity and data access patterns for each thread. For example, parallelizing across i, j and k loops creates extremely fine-grained parallelism, where each thread updates only one value in the array. We achieve a high-degree of concurrency with this decomposition.
Exposing Tunable Parameters in Multi-threaded Numerical Code
51
do t = 1, T do k = 1, N do j = 1, N do i = 1, N A(i,j,k)= B(i,j,k) + B(i-1,j,k) + B(i+1,j,k) + B(i,j-1,k) + B(i,j+1,k) enddo enddo enddo enddo
However, this variant is unlikely to perform well on most systems because the thread creation time to task completed ratio is very high. Moreover, parallelizing the innermost loop is going to negatively impact spatial locality, which can lead to performance loss (as we discuss later in this section). This issue of extreme fine-granularity can be addressed by parallelizing a subset of the loops in the nest (e.g., “plane” sweep) and by blocking in one or more dimensions and then parallelizing the blocked loops (e.g., “beam” sweep). However, both these methods have potential drawbacks. Parallelizing a subset of loops might imply that available parallelism on the target platform is not fully exploited. On the other hand blocking a loop with an unfavorable block size may lead to poor locality in threads. We now discuss these trade-offs in terms of exploitable intra-core and inter-core locality. Intra-core temporal locality occurs when a data value, touched by a thread running on core p, is reused either by a thread running on the same core (i.e., core p). Exploiting intra-core locality is particularly important for numerical code that sweeps over data domains multiple times (e.g., time-step computations). An example code is shown in Fig. 2(a). To achieve efficient sequential execution, this code would typically be tiled in all three spatial dimensions (with the
52
A. Qasem et al.
aid of loop skewing [19]) to exploit temporal locality across different time-steps. If we parallelize this code along the time dimension, the execution resembles the one shown on the left in Fig. 2(b), where the time dimension is broken up into four blocks. Each of the blocks are executed concurrently as a separate thread and each thread sweeps a block of data multiple times. In this scenario, it is important to ensure that the working set of each thread is made small enough to fit in the cache. To enforce this, we can further subdivide the time blocks as shown on the right in Fig. 2(b). By selecting a sufficiently small block size, we can ensure that the working set of each individual thread fits into the cache. However, as we observe, reducing the block size also causes a reduction in thread granularity. Thus, reducing the block size for improved intra-core locality may result in an unbalanced load for the entire application and also add to thread creation overhead. The optimal blocking factor that exploits intra-core locality and finds a suitable granularity depends on a host of factors including the number of cores, the cache size and associativity, the current system load and the input data set. Thus, finding a suitable block size for intra-core locality and parallelism is best achieved through autotuning, as we demonstrate later in this paper.
4
Tuning Framework
Fig. 3 gives an overview of our tuning framework, which includes two key components: a transformation engine based on the POET language [8] for generating alternate code variants, and a parameterized search engine (PSEAT) for searching the transformation search space. We use HPCToolkit to probe HW performance counters of the target platform and collect a variety of performance metrics [20]. Since we do not explicitly deal with feedback issues in autotuning in this research, the rest of this section is devoted to describing our transformation scripting language and the search engine. 4.1
Parameterization of Compiler Optimizations
We have used the POET transformation engine shown in Figure 3 to optimize the thread-level parallelism, memory locality, variable privatization, and register reuse for two SPEC95 benchmarks, mgrid and swim. The core computation of mgrid is a 27-point stencil computation on a three dimensional space. We have focused on optimizing two performance critical subroutines, RESID and PSINV, as they take around 58% and 23% of the mgrid execution time respectively. Each subroutine has three perfectly-nested loops, all of which can be parallelized. In particular, none of the loops carry any dependence, but each loop carries a large number of inter-interation reuses of the input data. We have applied three optimizations: OpenMP parallelization, loop blocking, and scalar replacement, for both subroutines, and have parameterized each optimization in the following fashion.
Exposing Tunable Parameters in Multi-threaded Numerical Code Arch Specs
53
Source
PSEAT
Search Space Interpreter
Hill Climber
Sim. Anneal
GA
next point
POET Interpreter
Build Script Fortran Frontend Specializa tion
Feedback Parser
feedback
Transformat ion Library
C Frontend Specializa tion
Code Variant
execute
Direct Search
POET Script
POET
initial script
Search Space
HPCToolkit
Fig. 3. Example execution patterns of parallelized and blocked numerical code
– Parallelization. We have parameterized which loops within each loop nest to parallelize, whether nested parallelism should be considered, the thread scheduling and chunk size computed by each thread, and the number of threads that will be used to evaluate the loops. – Loop blocking. We have parameterized which loops to block and the blocking factor of each loop. – Scalar Replacement. We have parameterezied whether to apply scalar replacement for each array referenced with the loop nests. The core computation of swim includes two subroutines, CALC1 and CALC2, each of which takes more than 30% of the overall execution time of swim. Each routine includes a sequence of three loop nests that can be selectively fused to improve register performance. Specifically, the first loop nest can be fused with either of the following loops, but not both. All loops can be parallelized. To optimize these loop nests, we have combined the application of OpenMP parallelization, loop blocking, loop fusion, loop unroll-and-jam, and scalar replacement. All transformations are parameterized in the following fashion. – Parallelization. We have parameterized which loops to parallelize, and how many threads to use. – Loop blocking. We have parameterized which loops to block, and the block factor for each loop. – Loop fusion. We have parameterized which loops to fuse together.
54
A. Qasem et al. 100 3 R 1 16 P 4 E 2 8 16
# # # # #
maximum number of program evaluations number of dimensions in the search space range : 1 .. 16 permutation : sequence length 4 enumerated : two possible value 8 and 16
Fig. 4. Example configuration file for PSEAT
– Scalar replacement. We have parameterized which loops to apply scalar replacement. – Unroll and Jam. We have parameterized which loops to apply the transformation, and what unroll factor to use for each loop. All the parameterized optimizations are implemented as source-to-source program transformations using POET [8, 21] and by invoking a POET optimization library which implements a large collection of compiler transformations such as loop fusion, loop blocking, scalar replacement, etc. First, an annotation is inserted as a comment within the original Fortran source code to identify each important sequence of loop nests to optimize. Then, a separate POET transformation script is written to apply necessary optimizations to the interesting loop nests. A large collection of command-line parameters are used to control how to apply each optimization in different ways. POET is a language specifically designed for parameterizing optimizations of application code for auto-tuning. For more details, see [8, 22]. 4.2
Searching the Space of Alternate Variants Using PSEAT
Searching for alternate code variants is accomplished through the use of PSEAT a parameterized search engine for automatic tuning. In most existing autotuning systems, the search module is tightly coupled with the transformation engine. PSEAT is designed to work as an independent search engine and provides a search API that can be used by other autotuning frameworks. This section discusses some of the design features of PSEAT and its integration into the the tuning system. Input to PSEAT is a configuration file that describes the search space of optimization parameters. Fig. 4 shows an example configuration file. The syntax for describing a search space is fairly simple. Each line in the configuration file describes one search dimension. A dimension can be one of three types: range (R), permutation (P) or enumerated (E). range is used to specify numeric transformation parameters such as tile sizes and unroll factors. permutation specifies a transformation sequence and is useful when searching for the best phase-ordering. An enumerated type is a special case of the range type. It can be used to describe a dimension where only a subset of points are feasible within a given range. An example of an enumerated type is the prefetch distance in software prefetching. In addition, PSEAT supports inter-dimensional constraints for all three dimension types. For example, if the unroll factor of an inner loop needs to be smaller than the tile size of an outer loop then this constraint is specified using a simple inequality within the configuration file.
(a) Execution time sensitivity
Exposing Tunable Parameters in Multi-threaded Numerical Code
55
(b) L1 miss rate sensitivity
(c) L2 miss rate sensitivity Fig. 5. mgrid performance sensitivity to block size
PSEAT implements a number of search strategies including genetic algorithm, direct search, window search, taboo search, simulated annealing and random search. We include random in our framework as a benchmark search strategy. A search algorithm is considered effective only if it does better than random on a given search space.
5 5.1
Experimental Results Experimental Setup
Platforms: We present experimental results on three Intel-based multicore systems: a dual-core (Core2Duo, Conroe), a quad-core (Core2Quad, Kentsfield) and an eight-core machine with two quad-core processors (Xeon, Nehalem). GCC 4.3.2 with the -O2 flag was used to compile variants on each platform. Benchmarks: For this study, we look at the performance characteristics of mgrid and swim, two scientific benchmarks from the SPEC95 benchmark suite. Both mgrid and swim contain several loops nests that can be fully parallelized and exhibit a high-degrees of temporal locality. Experiments for both swim and mgrid used the reference data set.
56
5.2
A. Qasem et al.
Performance Impact of Blocking
We first examine the blocking search space of mgrid and swim. A blocking search space is divided into two levels : loop selection and block sizes. The loop selection level refers to which loops are selected for blocking. For example, in a twodimensional nest, we may choose to block just the inner loop, just the outer loop, both the inner and outer loop or none of the loops. These choices can be represented with a bit string of size two and represents a search space of size 22 . For each loop selection level, there is a multi-dimensional search space that consists of all the valid block sizes for each loop that is blocked. For mgrid, we explored four different blocking options the loop nests appearing in resid and psinv routines. For each blocking selection we explored a range of block sizes, starting from four and going up to 64 in steps of four. This gave rise to a two-level 15 dimensional search space with about 1516 points. The loop nests in swim are two-dimensional. Thus, the blocking search space of swim has fewer dimensions. However, the range within each dimension was larger, starting at 16 and going up 512. Fig 5 shows how selection of blocking loops and choice of blocking factors impact performance of mgrid. The numbers presented are from experimental runs on Kentsfield. As we can see, there is significant variation in performance and L1 and L2 miss rates as we vary the block sizes. This is not surprising, since scientific code like mgrid are known to be sensitive to changes in the blocking factor. We notice that miss rates for L1 are high for smaller block sizes, and they gradually go down as we increase the block size. For L2, we observe a slightly different behavior. We notice that when multiple loops are blocked, the L2 miss rates are very low for smaller block sizes. This implies that multi-level blocking is able to exploit locality at multiple levels. However, for this to happen block sizes need to be very small (< 8). Interestingly, we observe that a reduction in the L1 or L2 miss rate does not necessarily correspond to performance gains. This means that although some blocking factors may be good for improving locality, they may have other adverse affects in terms of high loop instruction overhead and possibly reduced ILP. These results reiterate the need for tuning to find the optimal block size. 5.3
Performance Impact of Parallelization Granularity
Similar to blocking, selecting loops for parallelization can have a significant impact on performance. We explored several different parallelization options in terms of number of loops selected for parallelization and the number of threads used for the parallel variant. Although thread scheduling and chunk size can also have a huge impact on performance, we did not explore these dimensions for this study. Among the choices for parallelizing a loop nest, we observed that nested parallelism always resulted in huge performance loss due to thread synchronization overhead. This may be due to deficiencies in the GCC OpenMP library. Nevertheless, we discard these options in our study.
57
Exposing Tunable Parameters in Multi-threaded Numerical Code
(a) mgrid
(b) swim
Fig. 6. Performance sensitivity to parallelization parameters
Thus, parallelization search spaces for mgrid and swim have significantly fewer dimensions then the blocking search spaces. For mgrid, we considered parallelizing loops at each level of the three dimensional loop nests in resid and psinv subroutines. For each variant, we also explored implementations with two, four and eight threads. For swim, we considered parallelizing both the inner and outer loops of initial, calc1, calc2 and calc3 routines. Fig. 6(a) shows performance of six different parallel variants of mgrid on Conroe, Kentsfield and Nehalem. In the figure, P1TK means that the outermost loops in both resid and psinv was paralleled, and the parallel variant used K threads. P2TK means the middle loop was parallelized. We observe that generally, parallelizing the outermost loop is most profitable. However, this is not universally true. For example, on Conroe, the variant where the middle loop is parallelized performs the best. In terms of number of threads, as expected, setting the thread number to the available cores appear to work best. But again, there are exceptions. For example, on Kentsfield the best performance is obtained for P1T2 where the number of threads is 2. For swim, the performance sensitivity to different parallelization strategies is depicted in Fig. 6(b). For swim, we do not show numbers for parallelizing the inner dimension, as they always ran orders of magnitude slower than the sequential version. In Fig. 6(b), P1TK, refers to a parallel variant where all the outer loop of all four subroutines are parallelized, with K threads. The P2TK variant refers to the case where we only parallelize calc1 and calc2. Overall, the selection of parallel loops tend to have less of an impact on swim than mgrid. However, even in this case, there is no single configuration that works well for all three platforms. On Conroe, P2 performs better than P1 when using four threads, whereas for the other two platforms P1 appears to be the best choice. 5.4
Non-orthogonality of Search Space
One of the main goals of this experimental study was to validate our claim that blocking and parallelization search dimensions are indeed non-orthogonal. To accomplish this, we set up an experiment where we explored the combined search space of blocking and parallelism using both orthogonal and non-orthogonal
A. Qasem et al.
58
(a) mgrid search space
(b) swim search space Fig. 7. Orthogonal vs. Multi-dimensional search
search methods. We chose three orthogonal search methods : one where the blocking dimensions are explored first (blockfirst), one where the parallelism dimensions are explored first (parfirst), and one where blocking and parallelism dimensions are explored independently and best value obtained from each search is used (separate). For multidimensional search, we selected simulated annealing (anneal) and direct search (direct), both of which are known to be effective in exploring the search space of transformation parameters [4]. To keep the comparison fair, individual dimension in the orthogonal search were searched using simulated annealing. We allowed each search algorithm to run for 120 iterations. We instrumented the search algorithms to output the current best value every ten iterations. Fig. 7(a) shows results of exploring the combined search space of mgrid on the quad-core platform. We observe the both direct and anneal have a clear advantage over the orthogonal search methods when the number of evaluation goes beyond 50. For fewer number of iterations parfirst is able to compete with the multidimensional strategies but in the long run does not yield the desired performance. separate performs the worst, not finding a better value over the baseline until about the 60th iteration. In fact, most of the variants picked by separate lead to worse performance. Since we only display the best value so far, the speedup appears as 1. The poor performance of separate and blockfirst, and less than average performance of parfirst indicate that there is indeed interaction between the blocking and parallelization dimensions
Exposing Tunable Parameters in Multi-threaded Numerical Code
59
that are not captured by orthogonal search method. For the multi-dimensional search methods, there is no clear winner between direct and anneal. However, the overall speedup obtained by the two search methods is not that high. We speculate this could be attributed to the absence of scheduling and chunk size as a search space parameter. Fig. 7(b) presents performance comparison of different search algorithms on the swim search space. Again, we notice that the multi-dimensional search strategies outperform the orthogonal techniques. However, in this case the performance gap is not as much. We also observe that the overall performance achieved by any search algorithm is less for swim than mgrid. This can be attributed to the finer thread granularity in swim. Because swim contains only two-dimensional loops, the amount of work per thread is not sufficient to offset the synchronization overheads. Moreover, blocking proved to be less effective for the two-dimensional case.
6
Conclusions
In this study, we explored the search space of parallelism and data locality transformations for multi-threaded applications. We presented a method for identifying and exposing tunable parameters to a search tool. Our experimental results illustrate the non-orthogonality of the search spaces and reinforces the need for application tuning through integrated transformation heuristics.
References 1. Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? In: PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2010) 2. Almagor, L., Cooper, K., Grosul, A., Harvey, T., Reeves, S., Subramanian, D., Torczon, L., Waterman, T.: Finding effective compilation sequences. In: Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (2004) 3. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC 2008 (2008) 4. Qasem, A., Kennedy, K.: Profitable loop fusion and tiling using model-driven empirical search. In: Proceedings of the 20th ACM International Conference on Supercomputing (June 2006) 5. Whaley, C., Dongarra, J.: Automatically tuned linear algebra software. In: Proceedings of SC 1998: High Performance Networking and Computing (November 1998) 6. Frigo, M.: A fast Fourier transform compiler. In: Proceedings of the SIGPLAN 1998 Conference on Programming Language Design and Implementation, Montreal, Canada (June 1998)
60
A. Qasem et al.
7. Song, F., Moore, S., Dongarra, J.: Feedback-directed thread scheduling with memory considerations. In: HPDC 2007: Proceedings of the 16th International Symposium on High Performance Distributed Computing (2007) 8. Yi, Q.: The POET language manual (2008), http://www.cs.utsa.edu/~ qingyi/POET/poet-manual.pdf 9. Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2002) 10. Thies, W., Chandrasekhar, V., Amarasinghe, S.: A practical approach to exploiting coarse-grained pipeline parallelism in c programs. In: International Symposium on Microarchitecture (2007) 11. Papadopoulos, K., Stavrou, K., Trancoso, P.: Helpercore db: Exploiting multicore technology for databases. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (2007) 12. Wolf, M.E., Lam, M.: A data locality optimizing algorithm. In: Proceedings of the SIGPLAN 1991 Conference on Programming Language Design and Implementation, Toronto, Canada (June 1991) 13. Coleman, S., McKinley, K.S.: Tile size selection using cache organization. In: Proceedings of the SIGPLAN 1995 Conference on Programming Language Design and Implementation, La Jolla, CA (June 1995) 14. Ding, C., Kennedy, K.: Improving effective bandwidth through compiler enhancement of global cache reuse. In: International Parallel and Distributed Processing Symposium, San Francisco, CA (April 2001) (best Paper Award) 15. Wolf, M., Maydan, D., Chen, D.: Combining loop transformations considering caches and scheduling. In: Proceedings of the 29th Annual International Symposium on MicroArchitecture (1996) 16. Vadlamani, S.N., Jenks, S.F.: The synchronized pipelined parallelism model. In: The 16th IASTED International Conference on Parallel and Distributed Computing and Systems (2004) 17. Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: PLDI 2007: Proceedings of the 2007 ACM SIGPLAN conference on Programming Language Design and Implementation (2007) 18. Hall, M., Chame, J., Chen, C., Shin, J., Rudy, G., Khan, M.M.: Loop transformation recipes for code generation and auto-tuning. In: The 22nd International Workshop on Languages and Compilers for Parallel Computing, LCPC 2009 (2009) 19. Wonnacott, D.: Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS 2000), Washington, DC, USA, IEEE Computer Society, Los Alamitos (2000) 20. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: Hpctoolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2009) 21. Yi, Q., Seymour, K., You, H., Vuduc, R., Quinlan, D.: Poet: Parameterized optimizations for empirical tuning. In: Workshop on Performance Optimization for High-Level Languages and Libraries (March 2007) 22. Yi, Q., Whaley, C.: Automated transformation for performance-critical kernels. In: ACM SIGPLAN Symposium on Library-Centric Software Design, Montreal, Canada (October 2007)
LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors Feng Mao and Xipeng Shen Computer Science Department The College of William and Mary Williamsburg, VA, USA 23185
Abstract. To meet the needs of high performance computing, the Cell Broadband Engine owns many features that differ from traditional processors, such as the large number of synergistic processor elements, large register files, the ability to hide main-storage latency with concurrent computation and DMA transfers. The exploitation of those features requires the programmer to carefully tailor programs and simutaneously deal with various performance factors, including locality, load balance, communication overhead, and multi-level parallelism. These factors, unfortunately, are dependent on each other; an optimization that enhances one factor may degrade another. This paper presents our experience on optimizing LU decomposition, one of the commonly used algebra kernels in scientific computing, on Cell Broadband Engine. The optimizations exploit task-level, data-level, and communication-level parallelism. We study the effects of different task distribution strategies, prefetch, and software cache, and explore the tradeoff among different performance factors, stressing the interactions between different optimizations. This work offers some insights in the optimizations on heterogenous multi-core processors, including the selection of programming models, considerations in task distribution, and the holistic perspective required in optimizations. Keywords: Software cache, Heterogeneous architecture, LU decomposistion, CELL Broadband Engine.
1
Introduction
Multi-core and heterogeneousness have been the recent trends in computer development. A typical example is the IBM Cell Broadband Engine (Cell B/E) [11], an asymmetric and heterogeneous multi-core architecture. It typically consists of one general-purpose IBM PowerPC processor element (PPE) and eight independent synergistic processor elements (SPEs). The SPEs have large register files and good ability to hide main-memory latency with concurrent computation and direct memory access (DMA) transfers. These features make this heterogeneous
This work is done when Feng Mao was associated with College of William and Mary.
C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 61–75, 2010. c IFIP International Federation for Information Processing 2010
62
F. Mao and X. Shen
architecture suitable for accelerating computation-intensive applications, such as gaming, multimedia, and scientific applications. The matching between software and hardware on such an architecture is more important and also, more challenging than on traditional homogeneous systems, mainly because the architecture is more complex and more flexible in control. For instance, the SPEs have no cache but local storages, whose management has to be explicit through schemes like DMA transfers. Therefore, a suboptimal matching may easily cause factors of performance degradation than the optimal matching. On the other hand, finding the good matching requires the consideration of multiple factors at the same time, in particular, how an optimization affects data locality, load balance, communication cost, and multi-level parallelism. It is typical that those factors are inter-dependent, sometimes even causing optimization conflicts. A good understanding to those performance factors and their interactions is important for effective uses of such processors. This work concentrates on the exploration of the different performance factors and their interactions on LU decomposition. We choose LU decomposition as the focus because it is a fundamental kernel in many scientific applications, such as linear algebra and signal processing programs. The insights obtained from this work may directly benefit those applications. In addition, the computation of LU decomposition includes many data dependences, posing interesting challenges to the exploitation of parallelism, communication, and other optimizations. Although many studies have analyzed LU decomposition on traditional homogeneous systems, we are not aware of any systematic exploration to the optimizations of the problem on the Cell B/E architecture. (Existing implementations, such as [9], show no systematic explorations to the optimization space.) This work emphasizes the interactions between the different performance factors in the optimization, distinguishing it from many other case studies on Cell programming. More specifically, this work makes the following contributions: – Based on Cell SDK, we develop an extensible framework for flexibly experimenting different optimization components for LU decomposition on Cell B/E. The framework allows plugins of a set of optimization components, and reports various performance metrics, including numbers of DMA operations, software cache hit rates, branch hint hit rates, numbers of clock cycles and so forth. – We exploit different levels of parallelism supported by Cell. Our implementation exposes parallel tasks through a status matrix, leverages data-level parallelism by manual vectorization, and enables parallel communications by the use of non-blocking mailboxes and DMA. – We explore the effects of a spectrum of locality optimization techniques and four kinds of task distribution schemes for block LU decomposition. We concentrate on prefetching and software cache management to enhance the effective bandwidth and hide memory access latency. We adopt SPE-centric computation acceleration programming model, and construct three static
LU Decomposition on Cell Broadband Engine
63
task distribution models and a dynamic distribution model to explore the tradeoff between locality and load balance. – We conduct detailed analysis on the influence of the different techniques on matching LU decomposition with Cell B/E architecture. The analysis reports the influence of each optimization on individual performance factors, such as locality, load balance, communication overhead, and parallelism. More importantly, it reveals the interactions of those factors and produces insights into the holistic consideration of optimizations for heterogeneous multicore processors. For instance, the experiments show that although task distribution affects both locality and load balance, load balance should be the only consideration when a good prefetching scheme is included in block LU decomposition. The rest of the paper is organized as follows. In section 2, we introduce the background on Cell B/E architecture and block LU decomposition algorithm. In section 3, we describe our implementation and optimizations of the algorithm on Cell B/E. Section 4 reports and analyzes the influence of the optimizations and their interactions. Section 5 discusses related work, followed by a short summary.
2 2.1
Background Cell B/E Architecture
The Cell B/E is a heterogeneous architecture, designed for accelerating computationally intensive applications [7]. A Cell processor is a single-chip multi-core processor, including 1 PPE and 8 SPEs operating on a shared, coherent memory. Figure 1 shows an overview of the architecture. The PPE is the main processor. It contains a 64-bit PowerPC Architecture core with multimedia extension unit to support vector operations. Typically it runs the operating system, manages system resources, and controls the allocation and management of SPE threads. The 8 SPEs are processors designed for single instruction multiple data (SIMD) computation. Each contains a 256-KB local store controllable by software and a large (128-bit, 128-entry) register file. It relies on asynchronous DMA for data and instruction transfer to and from the main memory. It supports a special SIMD instruction set and is optimized for data-rich operations. PPE and SPEs are connected by the element interconnect bus. Each SPE has a memory flow controller (MFC) to communicate with main memory. It is the application’s responsibilty to maintain coherence between main memory and distributed local stores. User programs explicitly issue DMA command to exchange data between local store and memory. The user program code running on SPEs typically are implemented as a group of threads. 2.2
Block LU Decompostion Algorithm
LU decomposition is to transform a matrix A into a product of a lower triangular matrix L and an upper triangular matrix U , expressed as A = L ∗ U . (Pivoting
64
F. Mao and X. Shen
PPE
SPE
SPE
SPE
LS
LS
LS
SPE LS
SPE
SPE
SPE
SPE
LS
LS
LS
LS
Bus / DMA
Main Memory
Fig. 1. Cell Broadband Engine archicture
is not considered in this work.) Due to the importance of LU decomposition in numerical computing, many studies have explored the problem on the aspects of both algorithms and implementations. Block LU decompostion is a typical parallel algorithm to solve this problem in a divide and conquer stragtegy. Assume that we have a matrix A, expressed as a composition of 4 sub-matrices: L00 0 U00 U01 A00 A01 = ∗ (1) A= A10 A11 0 U11 L10 L11 The LU decomposition results can be derived as follows: ⎧ ⎧ L00 U00 = A00 L00 U00 = A00 ⎪ ⎪ ⎪ ⎪ ⎨ ⎨ L10 U00 = A10 L10 = A10 /U00 −→ L00 U01 = A01 U01 = L00 \A01 ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ L10 U01 + L11 U11 = A11 L11 U11 = A11 − L10 U01
(2)
L00 and U00 are respectively a lower and upper triangle matrix. Because A00 is usually small, L00 and U00 can be easily obtained through Gaussian elimination. The sub-matrices, L10 and U01 , can be computed subsequently. Notice that the final equation in 2 is another LU decomposition problem, with a smaller problem size. It can be reduced to an even smaller LU decomposition problem in the same manner as above. When the problem is reduced to a matrix with only 1 block, Gaussian elimination will produce the final result. This iterative strategy is the core of the block LU decomposition algorithm. We refer the reader to [4] for more details.
3
Implementation and Optimizations
The Cell B/E supports parallelism at different levels, such as task-level, data level, and communication level. In the first part of this section, we concentrate on the exploitation of various parallelisms in block LU decomposition, with the focus on task-level parallelism. We discuss the task-level dependences in block LU decomposition, and describe the use of a status matrix to help dependence analysis and expose parallel tasks. In the second part, we describe the programming model used in our implementation. Our focus in this part is on the different
LU Decomposition on Cell Broadband Engine
65
strategies for distributing tasks to SPEs. The distribution strategies are important to load balance, locality, and communication cost. In the third part, we concentrate on the use of prefetch and software cache for locality improvement. 3.1
Dependence Analysis and Parallelism Exploitation
In a typical implementation of block LU decomposition, a matrix is partitioned into many small blocks. At the first iteration of the decomposition process, the whole matrix is treated as 4 regions as represented by different depths of grey (or colors) in the left bottom graph in Figure 2. The first iteration computes the final LU decomposition results corresponding to regions A00 , A01 , and A10 using the top 3 equations in Equation 2. It also computes the right hand side of the final equation in Equation 2 as the intermediate results corresponding to the region A11 . The next iteration conducts the same computation but only on the updated region A11 . As more iterations are executed, the working set becomes smaller and smaller until the final result of the last block is attained. The bottom row of graphs in Figure 2 illustrate the whole iterative process.
0
1
2
Iteration 3 ….
A00
A10
n-2
n-1
n
A01 A11
Fig. 2. Iterative computation in block LU algorithm. The top graph shows the dependences in the computation, represented by arrows. In that graph, the matrix is represented by a single column. The bottom graph shows the part of the matrix that is manipulated in each iteration.
The top row of graphs in Figure 2 shows the dependences in the computation. Consider the computation in iteration 0. The final results corresponding to region A00 depend on the current values of region A00 . And the computed results, along with the current values of a block in region A01 , determine the computation corresponding to block A01 ; the blocks in the region A10 have the similar dependences. The computation corresponding to a block (i, j) in region A11 depends on the current values of block (i, j) and the results corresponding to blocks (0, j) and (i, 0).
66
F. Mao and X. Shen
We use a status matrix to help runtime dependence checking. Each block in the data matrix has one corresponding element in the status matrix, indicating the number of the iteration in which the computation corresponding to the block has just finished. During runtime, the PPE will use the status matrix to determine the tasks that are ready to run. For example, in the second iteration (i.e. iteration 1), if the PPE finds that the status corresponding to block (1, 3) and (2, 1) are both 1 (the row and column numbers are 0-based), it will immediately know that the computation corresponding to block (2, 3) is ready to run. This scheme exposes all task-level parallelism. Besides task-level parallelism, we also exploit other levels of parallelism supported by Cell B/E. Cell B/E supports data-level parallelism mainly through vectorization. In our implementation, we manually vectorized the program to maximize the data-level parallelism. We unroll the loops and insert branch hints to increase instruction-level parallelism (details in Section 3.4.) We exploit communication parallelism by using mailbox as the main communication scheme between SPEs and the PPE. Communication through mailbox is non-blocking. When a task is assigned to a SPE, it is given a unique DMA tag, and then all the DM A Get requests that the task needs are enqueued into the MFC. While the MFC is executing the DMA operations, the SPE is free to do computation on other requests. The status of the pending DMA operations under each tag is polled regularly. When the MFC indicates that some DMA tag has no more pending DMA operations, the corresponding task is marked ”processable”; it will be processed when the processor becomes free. The similar non-blocking scheme is used for the store of computation results to the main memory. 3.2
Programming Model and Task Distribution
As a heterogeneous architecture, Cell permits two kinds of programming models: the PPE-centric, and the SPE-centric. In the PPE-centric model, the PPE runs the main application, and off-loads individual tasks to the SPEs. The PPE waits for, and coordinates, the results returned by the SPEs. This model has some variants like multistage pipeline model, parallel stage model, and services model [8]. In the SPE-centric model, most of the application code is distributed among the SPEs. The PPE is a centralized resource manager for the SPEs. Each SPE fetches its next work item from main storage when it completes its current work. In this work, we choose SPE-centric model because it fits the property of the problem and may expose the maximum amount of task-level parallelism. Figure 3 depicts our programming model. The instructions for the computation of a matrix block reside in the local store of every SPE. The SPEs conduct all the decomposition computation, and the PPE’s job is to maintain the status matrix, find ready tasks and distribute them to the SPEs. There are two queues associated with each SPE: the ready queue and the done queue. The PPE puts the tasks that are ready to run into the ready queues for SPE to dequeue and run. The SPEs put completed tasks and the results
LU Decomposition on Cell Broadband Engine
Static task lists
67
SPE 0 available task queue Task distributor(s)
Mail box
Block operation
SPE 0
SPE 0 finished task queue In DMA queue
Status matrix SPE 7 available task queue
Check/update status matrix
Mail box SPE 7 finished task queue
PPE
Block operation
SPE7
In DMA queue
Fig. 3. Programming model for block LU decomposition on Cell B/E. The static task lists are used only in static task distributions.
into the done queues for PPE to dequeue and commit the changes to the status matrix and the output matrix. The communication between PPE and SPEs is through mailboxes, a scheme provided by Cell for exchanging 32-bit messages. Both parties use non-blocking operations (mailbox stating) to avoid unnecessary data stalls. Task Distribution. The scheme of task distribution determines which SPE will get the next ready task. It critically affects the load balance, locality, communication cost, and task-level parallelism. In this work, we implement 4 different distribution schemes. The first is a balance-driven dynamic distribution scheme. When a task becomes ready, the PPE puts it into the ready queue that contains the fewest tasks. Although this scheme may produce good load balance among SPEs, it is locality oblivious, considering no data reuses across tasks.
(a) 1D-unbalance
(b) 1D-interleaving
Processors
1
2
3
(c) 2D-interleaving
4
Fig. 4. Illustration of 3 static task distribution schemes on 4 processors
The other 3 distribution schemes are static, with different tradeoff between locality and load balance. Each of the 3 schemes determines a static mapping from tasks to SPEs. All of mappings are embodied by a partition of the matrix blocks. The first static scheme evenly partitions the matrix into 8 sub-matrices as illustrated in Figure 4 (a) (the figure uses 4 processors for illustration); the second is a 1-dimension interleaving partition as shown in Figure 4 (b); the third
68
F. Mao and X. Shen
is a 2-dimension interleaving partition shown in Figure 4 (c). Each SPE executes only the tasks corresponding to those blocks that are mapped to it. The first static partition has the worst balance: The SPE 0 has much less job to do than the other SPEs, because the blocks in the right and bottom regions have more computations than the blocks in the left and top regions, due to the iterative computation as shown in Figure 2. The second static scheme has much better balance, and the third one has the best. On the other hand, the first and the second scheme has better locality than the third one because a SPE in the two schemes is in charge of some whole columns and thus has more data reuse than the third scheme has. Better locality also suggests the need for fewer communications. Section 4 reports the quantitative measurement of the effects of these distribution schemes. The dynamic distribution uses only the runtime status matrix to discover and distribute ready tasks. Whereas, the static distribution schemes use both the runtime status matrix and the static task mapping for task distribution. In our implementation of the static schemes, the PPE creates 8 threads, each of which dedicates itself to the task distribution for one SPE. 3.3
Locality Optimizations
For locality optimizations, we concentrate on the use of prefetch and software cache. Prefetch hides the latency in data transfer, and software cache reduces the required data fetches. Prefetch. Prefetch is an effective mechanism to hide the latency in data transfer if the data accesses are predictable. In our programming models, no matter with static or dynamic task distributions, a SPE can easily predict what data it is about to use by checking the tasks in its ready queue. The prefetch in our implementation works in this way. When a SPE is about to process a task, it checks its ready queue and issues prefetch instructions for the data that the newly entered tasks may need (and not in the local store if software cache is used.) As prefetch is non-blocking, the SPE then can immediately start processing the next task in the ready queue. The capacity of the mailbox in Cell allows at most 4 tasks to be handled concurrently by the processors. So, we set the length of a ready queue to be 8. The implementation uses double-buffering to overlap computation and communication. As shown in Figure 5, the pointer of the target location for prefetch moves continuously in the local store to avoid the conflicts between the data that are used or to be used soon in the computation and the data that are newly prefetched. Software Cache. As mentioned earlier, the SPEs in Cell have no cache but only local stores. Previous work has used the local store as a software cache, such as the general-purpose software cache in the single source compiler developed by IBM [6]. In this work, we implement a software cache specific for LU
LU Decomposition on Cell Broadband Engine
69
LRU level-2 cache
Prefetch pointer
Computation pointer LFU level-1 cache
Fig. 5. Illustration of double-buffering for prefetch and 2-level software cache
decomposition. By tailoring itself to the data structure of the application, it is potentially more efficient. In our implementation of the software cache, we use an index array to record which blocks are in the local store. Each element in the index array is a triple: (row, column, iteration), where, (row, column) identifies the block in the input matrix, iteration is the iteration in which the last update to this block occurs. At a request for data, the index is first searched; only when not found, a DMA is issued to fetch the data from the main memory. When a block is fetched into the local store, the index array is updated immediately. We use a set of counters to record the number of times each block has been accessed since the latest time it is brought into the local store. The counters are used to implement the least frequently used (LFU) replacement policy, in which, when the cache is saturated, the new block replaces the block that has the smallest counter value among all the blocks currently in the cache. To explore the benefits from different replacement policies, we also implemented a second-level cache with least recently used (LRU) replacement policy. When the second-level cache is used, the blocks evicted from the LFU cache are put into it before being completely evicted from the local store. The cache scheme is used for load operations only. For store operations, the data are directly written to the main memory via DMA to update the status matrix as soon as possible. 3.4
Other Optimizions
The SPEs in Cell B/E contain no hardware branch predictors, but support the use of branch hints. In our implementation, we insert 104 branch hints (51 often and 53 seldom) according to our understanding of the program behavior. These hints help the program speculate on branches and keep the instruction pipeline properly filled. In addition, we manually unroll the loops to gain more instruction-level parallelism and reduce loop control overhead. As these are standard optimizations, the details are skipped in this paper.
70
4
F. Mao and X. Shen
Evaluation
This section reports the effects of different task distributions and the locality optimizations. To help understand the effects of the various optimizations, we use the IBM Full System Simulator to collect detailed runtime information. The simulator supports both functional simulation and cycle-accurate simulation of full systems, including the PPE, SPEs, MFCs, PPE caches, bus, and memory controller. It can simulate and capture many levels of operational details on instruction execution, cache and memory subsystem, interrupt subsystem, communications, and other important system functions [8]. We configure the simulator to simulate a Cell B/E with 8 SPEs, each of which has 256K load storage, and 2 PPEs. It runs on a Linux operating system. All the matrices used in the experiments are randomly generated; each element is a double precision floating-point number. 4.1
Single SPE Performance
This section measures the benefits from program code optimizations, which include vectorization, loop unrolling, and branch hints. We use a standalone mode to measure the computation to a 4-block matrix by a single SPE. All communication overhead is ignored. There are three versions of the program in our comparison: a scalar version with pure scalar operations, a simple vector version, and an optimized vector with loop unrolling and branch hints. Table 1. SPE performance in standalone mode
Table 1 shows the total numbers of clock cycles and instructions by the three versions. When the block size is small, 2 × 2, the performance of the vector code without branch hints and unrolling is even worse than that of scalar code. It is because the block is too small to benefit from the vector registers. When the block size increases, the vector code finds enough elements to vectorize. However, the loop control statements and branches limit the speedup of the vector code: The SPE has to execute them in scalar operations. The loop unrolling reduces loop controls and increases instruction-level parallelism. The branch hints help the vector code to decrease branch miss predictions and remove unnecessary control instructions. Together, they bring speedup of a factor of 2.7 to 5.2 compared to the simple vector version. The optimized vector version outperforms the scalar version by a factor of 2.4 to 19.8 as showed in Figure 6. The optimizations cause
LU Decomposition on Cell Broadband Engine
71
the code size to increase slightly, from 4K bytes in the scalar version to 4.8K in the vector version to 5.1K in the optimized vector version. The increase has negligible effects to the program performance.
25 scalar
20 Speedup
vector 15
vector+hint+unroll
10 5 0 2x2
4x4
8x8
16x16
32x32
BlockSize
Fig. 6. SPU SIMD speedup on standalone mode
Instruction/Byte
900 800
Scalar
700
Vector
600
Vector Hint & Unroll
500 400 300 200 100 0 2
4
8
16
32
Block Size
Fig. 7. The number of instructions required for processing one byte of a matrix element
Figure 7 shows the efficiency of instructions in the three versions. The vectorized code uses far fewer instructions to process one matrix element because of the SIMD scheme provided by Cell. The unrolling and branch hints enhance the efficiency further by removing many scalar operations and reducing the penalty of branch miss prediction. The following sections report the benefits from locality optimizations and different task distributions. All the experiments use 32x32 as the block size, with all code optimizations enabled. 4.2
Locality Optimizations
Software cache is effective in reducing the number of DMAs as shown in Figure 8 (a). The graph shows the required number of DMAs, normalized by the number when software cache is not used. As the 1D and 2D static task distribution have the best locality, the software cache reduces 45% DMA accesses. In contrast, the dynamic distribution has the worst locality, only showing 20% DMA reduction.
F. Mao and X. Shen 0.9
50
0.8
45
0.7
40 35
0.6
Clocks (M)
Normalized DMA #
72
0.5 0.4 0.3
30 25 20 15
0.2
10
0.1
5
0
0 Dynamic
2D
(a)
1D
1D-unbalanced
nPre+nCache
nPre+Cache
Pre+nCache
Pre+Cache
(b)
Fig. 8. Reduced DMAs due to software cache (a), and the impact to overall performance from prefetch and software cache (b). (“nPre”: no prefetch, “nCache”: no software cache.)
Figure 8 (b) shows how the reduction of DMA helps the overall performance. The matrix size is 256x256 and we use 2D task distribution. If prefetch is disabled, the software cache improves the performance by 43%. However, when prefetch is enabled, the benefits from software cache become unnoticeable. The prefetch itself is enough to hide the latency in data transfer, improving the overall performance by 137%. This result suggests one of the interactions between different optimization techniques: An effective optimization technique, such as the software cache, becomes unnecessary when some other optimizations like prefetch is used. So by default, we enable prefetch and disable software cache in the following experiments. 4.3
Task Distribution
Different task distribution strategies cause different data locality and load balance. Good locality helps to reduce the required data transfers (i.e., the number of DMA operations.) In last section, Figure 8 (a) already shows the different effectiveness of software cache on different task distributions. Figure 9 further shows the average numbers of DMA operations per task when different task distributions are used when the block size is 32x32. When single-level cache is used only, the dynamic and the 2D interleaving distributions require much larger numbers of DMA operations than the 1D distributions. When two level cache is used, the 2D interleaving distribution requires no more DMA operations than the 1D distributions, whereas, the dynamic distribution still requires significantly more DMA operations. This result is intuitive as the dynamic distribution is locality-oblivious, having the worst locality among all the distributions. On the other hand, the dynamic distribution and the 2D interleaving distribution have the best load balance. The poor load balance in the two 1D distributions causes up to orders of magnitude performance degradation compared to the performance by the other two distributions, even though they have better locality.
LU Decomposition on Cell Broadband Engine
73
Fig. 9. Average numbers of DMA operations per task
Besides load balance, the dynamic distribution has another advantage: It allows the full freedom for SPEs to explore task-level parallelism. While in the static distributions, a ready task may have to wait for a particular SPE to get free, even though some other SPEs may be idle. This advantage comes at the sacrifice of data locality. However, as shown in Section 3.3, prefetch is enough to hide data transfer latencies for LU decomposition. Therefore, the two advantages of the dynamic distribution make it a more appealing choice than the static distributions. The comparison between the two graphs in Figure 10 verfies the above projection. The figures show the performance of solving a 128 × 128 matrix for various block sizes by using the 2D-interleaving distribution and the dynamic distribution. The three bars for a block size show the range and the mean of the performance of the 8 SPEs. The dynamic distribution shows smaller difference between the minimum and maximum clocks when the block size is 16 and 32 than the static distribution. It indicates better balance. The relatively larger difference when the block size is 8 is likely due to the randomness in dynamic distribution. Overall, the dynamic distribution outperforms the static distribution by a factor of 1 to 7 in terms of the maximum clocks.
(a) 2D interleaving distribution
(b) dynamic distribution
Fig. 10. Overall running time with different task distributions
5
Related Work
As an important linear algebra kernel, LU decomposition has been studied extensively on traditional parallel computers. In 1993, Buoni et al. have studied the
74
F. Mao and X. Shen
different algorithms for static LU decomposition on traditional shared memory processors [4]. A scalable implementation is included in some widely used linear algebra packages, such as LAPACK [1] and ScaLAPACK [3]. However, the implementations assume the computing systems to be either a homogeneous parallel computer or a homogeneous network of workstations. Beaumont et al. studied the matrix partition schemes for LU decomposition on heterogeneous networks of workstations [2], rather than Cell B/E. We note that this current work is not the first work that implements LU decomposition on Cell B/E. But the existing implementations, such as the one in the Cell SDK library [9], although working well on a single SPE, have not systematically explored the design space and optimization issues for multiple SPEs. This work is not aimed to produce a universally applicable, fastest LU decomposition, but to use LU decomposition as an example problem to reveal the interactions among different optimizations on Cell B/E and obtain the insights in holistic optimizations for heterogeneous multicore architecture. Locality optimization has been a focus in many previous studies, especially on traditional CPU and modern multicores [13,10]. For LU decomposition, an example is the automatic blocking for improving its locality on SMP [12]. On Cell B/E, the IBM project of Single Source Compiler (SSC Research Compiler) [6, 5] has included interesting explorations to locality optimizations. These explorations are particularly for automatic conversion of general OpenMP programs to Cell programs; the performance of the generated programs is often not as good as that of the programs directly developed from Cell SDK.
6
Conclusions
In this paper, we present an exploration to tailor block LU decomposition to Cell Broadband Engine processors. The implementation exploits different levels of parallelism supported by Cell. It exposes parallel tasks through a status matrix, leverages instruction-level parallelism by manual vectorization, and enables parallel communications by the use of non-blocking mailboxes. We study the effects of different task distribution strategies and a set of locality optimizations. The exploration reveals the interactions between those optimizations, and offers some insights into the optimization on heterogenous multi-core processors, such as the selection of programming models, considerations in task distribution, and holistic perspective required in optimizations. Acknowledgments. We thank Dimitrios Nikolopoulos, Zhengyu Wu, and Stephen McCamant for their help. This material is based upon work supported by the National Science Foundation under Grant No. 0720499 and 0811791. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
LU Decomposition on Cell Broadband Engine
75
References 1. Angerson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: LAPACK: A portable linear algebra library for high-performance computers. IEEE Supercomputing, 2–11 (1990) 2. Beaumont, O., Legrand, A., Rastello, F., Robert, Y.: Static LU decomposition on heterogeneous platforms. The International Journal of High Performance Computing Applications 15(3), 310–323 (Fall 2001) 3. Blackford, L.S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK user’s guide, Society for Industrial and Applied Mathematics, Philadelphia (1997) 4. Buoni, J.J., Farrell, P.A., Ruttan, A.: Algorithms for lu decomposition on a shared memory multiprocessor. Parallel Comput. 19(8), 925–937 (1993) 5. Chen, T., Zhang, T., Sura, Z., Tallada, M.G.: Prefetching irregular references for software cache on cell. In: CGO, pp. 155–164 (2008) 6. Eichenberger, A.E., et al.: Using advanced compiler technology to exploit the performance of the cell broadband enginetm architecture. IBM Syst. J. 45(1), 59–84 (2006) 7. Pham, D., et al.: The design and implementation of a first-generation cell processor. In: Proceedings of the IEEE International Solid-State Circuits Conference, ISSCC (2005) 8. IBM. Cell be programming tutorial, http://www-01.ibm.com/chips/techlib/techlib.nsf/ products/Cell Broadband Engine 9. IBM. Cell broadband engine sdk libraries v3.0 (2008), http://www.ibm.com/developerworks/power/cell 10. Jiang, Y., Zhang, E., Tian, K., Shen, X.: Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Proceedings of the International Conference on Compiler Construction (2010) 11. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005) 12. Yi, Q., Kennedy, K., You, H., Seymour, K., Dongarra, J.: Automatic blocking of qr and lu factorizations for locality. In: MSP 2004: Proceedings of the 2004 Workshop on Memory System Performance, pp. 12–22. ACM, New York (2004) 13. Zhang, E.Z., Jiang, Y., Shen, X.: Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? In: PPoPP 2010: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 203–212 (2010)
FDTM: Block Level Data Migration Policy in Tiered Storage System Xiaonan Zhao, Zhanhuai Li, and Leijie Zeng School of Computer Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China {zhaoxn,lizhh,zenglj}@nwpu.edu.cn
Abstract. ILM and tiered storage system are designed to address the challenge of achieving balance between cost and storage performance. However, both of them are hard to implement fully automatic data migration by traditional solutions for which are mainly relying on administrators experience and need huge manual work for data migration according to storage configuration and IO access patterns. This paper proposes a novel bi-directional migration policy FDTM based on blocklevel data valuation and fully automatic migration process. FDTM aims to get a trade-off between storage QoS and migration costs by introducing double thresholds to narrow the migration scope of block-level data objects. Experiment and analysis show that FDTM is efficient at blocklevel data migration comparing with traditional migration policies. In addition, it could help pave the way to implement tiered storage system with fully automatic data migration. Keywords: Data migration policy; data valuation; feedback; Tiered Storage System.
1
Introduction
Industry research has shown that 70-80% of all storage data is inactive, and data is rarely accessed after 30-90 days[11]. So in business perspective, the data value is declining over time dramatically. SNIA proposes ILM to address this issue[18]. In addition tiered storage as “tiering ILM”, is the key component in an ILM practice, but not all. Generally, there are three types of storage in tiered storage system: online, near-line and off-line [19], in which data is classified into distinct classes and stores in different tier separately with considering performance, availability, recoverability, safety and other requirements. A tiered storage system is usually composed of expensive FC disk arrays for high performance accessing, cheaper SATA storage subsystems for data staging and high-capacity robotic tape libraries for data archiving, which could decrease the system’s TCO while keeping the storage performance at the same time. Fig. 1 shows the infrastructure of a tiered storage system. In a tiered storage management system, data classification, data placement and data migration are the core function components. Furthermore data migration is also helpful to optimize storage system performance, which includes C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 76–90, 2010. c IFIP International Federation for Information Processing 2010
FDTM: Block Level Data Migration Policy in Tiered Storage System
Applications Definition Business process
Requirement
77
Requirement Management object Policy
Data Parameters
Data Migration
Data placement Data valuation data
Execution
Data management & optimize system service
Storage devices Management
Tired storage system
Online
Nearline
Offline
Fig. 1. Infrastructure of Tiered Storage System
single-level migration and multi-level migration. The main goal of single-level migration is to balance workload, however multi-level is more common in tiered storage system in order to manage the system in the form of ILM by reducing the TCO of the system with QoS guaranteed as while. It’s clearly that data migration policy affects performance, flexibility and other properties of the whole system. Today, much researches on migration focus on load balance of distributed system or cluster system [17], and because of the obvious difference on performance, capacity and other aspects among storage devices, data migration is more complex in tiered storage system. However traditional migration policies took little consideration about robustness. Moreover, the trigger of migration today is heuristic and coarse-grained, and almost all of them pay too much attention to one way migration, to migrate from high performance storage to low performance storage, without taking enough consideration of down to up migration. Based on our study of traditional policies for automated tiered storage, this paper proposes a novel bi-directional migration policy FDTM(Feedback based Double Threshold data Migration) at block level. It is fully self-adaptive, deals with the down to up migration carefully as well as up to down migration, and could avoid unnecessary data migration under “false overload” situation by setting of double thresholds and spill over interval.
2
Data Placement by Block-Level Data Valuation
Data classification is a precondition for data migration, and migration policy is usually based on data classification and placement. In addition data valuation is
78
X. Zhao, Z. Li, and L. Zeng
the core of data classification. In this paper, FDTM considers the specification of storage system and the characteristics of data objects, to get data objects value based on block-level evaluation, which is used to evaluate the importance of the data in a tiered storage system. Obviously, it is not hard to achieve couple of metrics related to data value at file-level, because metadata contains all the necessary information. However, in storage subsystem at data center, most data are stored and managed at blocklevel, within one data block there maybe couple of files, so value metric is hard to achieve. But data blocks still have some attributes which are helpful for blocklevel data valuation: 1. READ FREQUENCY. Read times in given interval to a certain data block. 2. WRITE FREQUENCY. Write times in given interval to a certain data block. Usually, it has different performance in write and read within same storage device, and another fact is that two storage devices maybe have nearly same read performance but have huge difference in write performance. Ri and Wi represent the read/write frequency at the ith time-step. 3. R/W GRANULARITY. The granularity represents the data amount ratio which is related to an IO (whether read or write) to a fixed size data block. Pi is the average R/W granularity in the ith time-step. 4. DATA DISTRIBUTION. It is the location information of accessed data, measured by statistical result which comes from all the operations in data blocks in a given interval. Di means the data placement value at the ith time-step. 5. RELEVANCE BETWEEN BLOCKS. Similar to relevance between files [5], it means if the IO operations on a data block is similar to the IO operations on another block in a given time interval, in other words these two blocks are associated, then the data value of these blocks is also considered as being correlated. MIf there are M blocks relate to block d within the same storage device, j=1 RLdj represents block relevance of d and other M blocks. And the association coefficient RLdj which is indicated association degree between data block d and j is defined as equation 1: Ed · Ej (1) |Ed ||Ej | N N 2 in which, Ed · Ej = i=1 qdi qji , |Ed | = i=1 qdi here, T means the valuation time interval, and it is divided into N time-steps equally, each time-step’s length is ti − ti−1 = L. If t indicates the current time, then T is the time period of [t − N × L, t]. Ed (qd1 , qd2 , . . . qdN ) is an IO vector during T of data block d (Ej is IO vector of data block j as well ), it is used to record R/W times information of d during T , and component qdi is R/W times in the ith (i = 1, 2 . . . N ) time-step. as following, the data blocks valuation could be calculated by equation 2: M N 1 + j=1 RLdj Vt (d) = (λw Wi + λr Ri )Pi Di (2) N i=1 RLdj =
FDTM: Block Level Data Migration Policy in Tiered Storage System
79
Vt (d) is the value of data block d at time t. In which Ri and Wi represent the read/write frequency at the ith time-step [t−i×L, t−(i−1)×L] in the valuation interval [t−N ×L, t] respectively; Pi is the average R/W size at the ith time-step; Di is the data placement value at the ith time-step. If there are M data blocks relate to block d within the same storage device, then sum all M factors as one of valuation factor for block d. λr and λw are the compensation coefficients for read and write respectively, which are used to measure the time cost difference for the R/W operation between two kinds of storage (mainly include time cost ratio in accessing the same size of data with same times of R/W operations). For more detail about data valuation, please refer to[25].
3
Starting Conditions for Data Migration
In this section, we will discuss when migration should start. In a tiered storage system, the data migration process got to consider many preconditions in the management system for special requirements. Generally, there are four main trigger conditions need be taken into account for the applications. 1. FREE SPACE OF STORAGE DEVICES This is the spare capacity threshold of storage devices to keep the specific application running well. 2. DIFFERENCE OF DATA VALUES There are two attributes to determine the value of data. One is the characteristic of the application itself, such as requirements of security, reliability of the system, so some data are more important than others inherently sometimes, the other attribute is the data value itself which is declining as time goes by. 3. UTILIZATION RATE OF DATA It depends on the data class. Some class is declining when it is created; some class is increasing at first and declining after it reaches the peak utilization rate of itself, and utilization rate of another class changes periodically. 4. RULES DEFINED BY LAW There are some rules defined by law in data management. Such as call history of cell phone for billing would only be reserved for several months, people could forecast the trend of some data changing and so on. Moreover, these conditions are related to each other. So in order to get a more reasonable migration policy, it is necessary to consider these conditions comprehensively. At current, there are two well-used basic polices[24]: 1. FIXED THRESHOLD WITH ACCESS FREQUENCY It has both up-migration and down-migration, and setting min and max thresholds for data access frequency. Migration will be triggered when reaching specific threshold. 2. HIGH-LOW WATER LEVEL OF STORAGE Thresholds are setting according to the capacity usage ratio. When the actual capacity usage ratio exceeds the threshold, the system will migrate some data according to their access frequencies.
80
X. Zhao, Z. Li, and L. Zeng
In above policies there are couple of demerits: although it’s an up-down migration, it will become time-consuming badly when the system has huge amount of data objects. Moreover, if a system with average low workload is changing dramatically at capacity ratio (called “false overload”), it will trigger migration operations many times and maybe introduces oscillation migrations.
4
Feedback-Based Double Threshold Data Migration
Data migration policy got to address four questions that are known as when, where, what and how much, by which to determine the migration candidates, migration quantity, and migration target. Migration target is obvious in tiered storage system, either is higher performance storage or lower performance storage. So the migration candidates and the quantity are key elements. Based on the summary introduction in early paper [26], an extended discussion of this policy will be given out completely in following sections. Other 3 migration situations will be proposed, however we just discuss two-tier storage structure for better description and understanding. 4.1
Parameters Definition and Initialization
Suppose the defined capacity usage ratios are same across different storage tiers. And using letters h and l at subscript of parameters to represent high performance storage and low performance storage. If CT = Ch + Cl represents total capacity and Ch is the capacity of high storage, Cl is the capacity of low storage, then high storage’s usage ratio is HC = Ch /CT , and low storage’s usage ratio is LC = Cl /CT . Actual data size at each tier has 4 thresholds: base-high, base-low thresholds, limit-high and limit-low thresholds, they will be described in detail later. Here, β represents high threshold and α represents low threshold, superscript 0 means “base” thresholds, superscripts max and min represent “limit” thresholds. We use two queues, the up-migration queue and the down-migration queue which are represented by two vectors Qu and Qd respectively for each tier. The elements in queue vectors are metadata of each migration candidates. The information about data value will be collected periodically after the system startup, and the interval is same with data valuation but without any calculation of data value until the enqueue condition was triggered. 4.2
When to Migration
An ideal storage management system should keep online for applications during data migration, and the storage performance should not be impacted badly at any time. So, the migration timing depends on bandwidth, migration speed and migration time window. The D-value of high and low thresholds δ should
FDTM: Block Level Data Migration Policy in Tiered Storage System
81
be the function of E(t) and g(Bi ). E(t) is the function of time cost of data migration, and g(Bi ) represents the migration speed, Bi is the bandwidth for data migration. In other words, D-value δ is a function of expected time cost and system migration bandwidth: δ → f (g(Bi ), E(t)). The system could be regarded as stable, if high storage data size βh (t) at time t is not bigger than βh0 , in which βh0 = βT × HC , βT (CT > βT ) is the total data size of storage devices, and the limit-high threshold is βhmax = Kβh0
(3)
K = min(γ, 0.5 + 0.5Ch /βh0 ),
(4) HC
in equation 4, γ is the ratio of actual data capacity usage ratio and the original HC when the high storage provides same QoS with low storage. And when the ratio is much larger than 2, a ratio which can reach the right mid-point of βh0 and Ch should be used. As to the base-low threshold α0h = βh0 − δ of high performance storage, the = α0h × LC . The 4 thresholds for low limit-low threshold is defined as αmin h performance storage could be derived with similar approach with considering two-tiered storage, δ and K are defined as following:
4.3
δ = δ × Cl /Ch
(5)
K = min(γ , 0.5 + 0.5Cl /βl0 )
(6)
Migration
FDTM has up-migration and down-migration conditions for both high storage and low storage. The up-migration means data migrating from low storage to high storage, which could be passively migration triggered by the actual data size on low storage exceeds its base-high threshold βl0 , or be proactive migration triggered by the actual data size on high storage lower than its base-low threshold α0h . In reverse, the down-migration is data migrating from high to low performance storage, which could be triggered by the actual data size on high storage exceeds βh0 , or be triggered by the actual size on lower than α0l in low storage. Storage device specification determines the timing of enqueue and migration, to answer the “when” question, and the characteristics of data objects determine data valuation. If the base thresholds were triggered, then each storage tier starts to enqueue migration candidates into Qu and Qd based on data blocks’ value, which answers the “what” and “how much” questions. Adjusting the entry queue conditions by each migration operation, can improve the queue length (or the migration quantity), enhance the migration policy efficiency, and keep the stability of the tiered storage.
82
X. Zhao, Z. Li, and L. Zeng
A1
β h0
A
t1 t2 t0
B
β hm a x
C
β h0
B
B
β hm a x
β h0
A
t m tn
t0
t
tm
Case1:timing from t0 to tm after from t1 to t2, migrate
t0
t
Case3: timing from t0 to tn , migrate
A
tm
t
tn
Case5: timing from t0 to tn , migrate
β
β
β hm ax
β h0
β
β
β
β hm a x
A
t0
tm
B
β hmax
C
β h0
t
β
B
A
t0
B
βhmax
C
tm
tn
Case2:timing from t0 to tm , not Case4:timing from t0 to tn , not migrate, stop monitoring migrate,go on monitoring
βh0
t
A
t0
C
tm
tn
t
Case6: timing from t0 to tn,not migrate if area less than S△ABC
Fig. 2. Condition cases for trigger migration. The area formed by three points A,B,C and whether the special case need to be migrated according to the conditions discussed in algorithm 2.
Situation 1: Down-migration for Exceeds βh0 . There are several input parameters used in the algorithm 2. L, checkpoint interval during monitoring; t, checkpoint; βh (t), actual data size d of high storage at time t; Adh , data amount in down queue of high storage; Vlast , d u data value of the last data object in Qh ; Vmin , data value of the min data t object in up queue of low storage; C1 , represents expression t0n (βh (t) − βh0 )dt ≥ 1 (t − t0 )(βhmax − βh0 ); Dmig , data amount has been migrated till current time. 2 n In fact, the determined conditions of FDTM are based on the fitting of a linear function to the trend of data amount in the tiered storage. When the growth trend of the curve is less than the slope of the linear function, also known as the curve integral is less than the area of the triangle (SABC ) in the determined condition (the area composed by the slash and dash between two time points and y = βh0 , as shown in case 4 of the Fig.2.), we regard the system as stable and doesn’t need migration. Or else, the data amount growth could be considered as linear growth and the data migration should be implemented at once to reduce the data amount in high storage. It is the passive down-migration in high storage. Situation 2: Up-migration Exceeds βl0 . Situation 2 and situation 1 are similar, but the length of up queue is limited by δ instead of δ which is related to low storage, for the usage stability of high storage is the precondition of up-migration, data amount of migration is determined by high storage’s capacity.
FDTM: Block Level Data Migration Policy in Tiered Storage System
83
Algorithm 1. ImpMigration(Qdh ) 1: 2: 3: 4: 5:
while (Dmig < δ)and(βl (t) < βl0 ) do migration data objects in Qdh end while d record last data value migrated by Vmax ; d return Vmax ;
At time t0 and βl (t0 ) = βl0 , start timing and enqueue data objects into Qul according to their value descending, the length of Qdh is determined by the amount of data objects in it, it should be less than δ or the value of last data object in d . Continue to monitor the system at interval L, and do steps Qdh less than Vmax similar to situation 1, the differences are as following: All the conditions which related to the high storage in situation 1 (such as βh0 , βhmax , βh (t)) will be replaced by the corresponding ones of the low storage (βl0 , βlmax , βl (t)) here, and vice versa; The inequation used to judge whether the tiered system needs to migrate is changed at the right side: give a value of LC as an coefficient when calculate the area of triangle, because with the same data changing rate, the performance of low storage is lower; u , Furthermore, the min data value in up-migration should be recorded as Vmin which is used for a new migration cycle after migration. Situation 3: Up-migration Below α0h . Moreover, to make a complete migration policy to achieve higher storage efficiency it should implement migration when the actual data size of high storage or low storage is less than base-low threshold, in situation 3 and situation 4, we’ll discuss it in detail. At time t0 , when high storage capacity usage size declines to less than α0h , start timing and enqueue block-level data objects into the queue Qul in low storage, data amount in Qul should not exceed δ or include all the active data objects in low storage, and do as following steps: , goto 1. If it raises above α0h again before reaching the limit-low threshold αmin h step 3, else continue timing until the actual data size drops to αmin , then h record the current timing as tm , and goto step 2; 2. Implement up-migration until all the data objects in Qul are migrated, and goto step 3; 3. Stop timing and enqueuing monitoring. Waiting for next condition trigger αh (t) = α0h , then goto step 1 for next cycle. Situation 4: Down-migration Below α0l . At time t0 , the data amount in low storage declines to the base-low threshold α0l and at the same time the data in high storage exceeds base-high threshold βh0 , then start timing and enqueuing block-level data objects into the queue Qdh in
84
X. Zhao, Z. Li, and L. Zeng
Algorithm 2. Down Migration for exceeds βh0 1: while 1 do 2: keep monitoring the tiered storage system; 3: if βh (t) = βh0 then 4: t0 ⇐ t; start to enqueue Qdh by block-level data value, length limited by Adh < δ d u or Vlast < Vmin 5: end if 6: while βh0 ≤ βh (t) do 7: if βh (t) = βhmax then 8: tm ⇐ t; t 9: if t0m (βh (t) − βh0 )dt ≥ 12 (tm − t0 )(βhmax − βh0 ) then 10: ImpMigration(Qdh ); 11: break; 12: else 13: t = t + L; 14: if βh (t) > βhmax then 15: ImpMigration(Qdh ); 16: break; 17: else 18: for t; t − tm ≤ tm − t0 ; t + L do 19: tn ⇐ t 20: if (βh (t) ≥ βhmax )or((tn − tm = tm − t0 orβh (t) ≤ βh0 )and t ( t0n (βh (t) − βh0 )dt ≥ 12 (tn − t0 )(βhmax − βh0 ))) then 21: ImpMigration(Qdh ); 22: break; 23: end if 24: end for 25: end if 26: end if 27: end if 28: end while 29: stop timing and enqueuing; 30: end while
high storage, data amount in Qdh should not exceed (βh (t) − βh0 ) · LC or δ (take the smaller one from them), and do as following steps: , goto 1. If it raises above α0l again before reaches the limit-low threshold αmin l step 3, else continue timing until the data amount drops to αmin , then record l the current timing as tm , and goto step 2; 2. Implement down-migration until all the data objects in queue Qdh are finished migrating or the actual data size of low storage reaches βT · LC , and goto step 3; 3. Stop timing and enqueuing monitoring. Waiting for next condition triggers αl (t) = α0l , then goto step 1 for next cycle.
FDTM: Block Level Data Migration Policy in Tiered Storage System
85
It’s obviously that situation 1 and 2 would happen in higher probability than situation 3 and 4, if there are a reasonable data placement method in the tiered storage system which is out of discussion in this paper. 4.4
Discussion on Better Migration Effect
FDTM introduces feedback mechanism to improve its thresholds parameters for the parameters used at beginning maybe not very appropriate. So the parameters of migration policy could be more reasonable and the application could get more performance benefit with time going by. d which indicates the length of down-migration For example, the value of Vmax d queue Qh of high storage, could be adjusted by its history values. Here, a simple method is introduced: Fi , (i = 1 . . . n, n is the migration times) represents the last block-level data object’s value in the ith migration queue Qdh , after the first d d 2 migration Vmax = F1 , after the second migration, V should be F1 +F . . . and 2 max n
F
i d after the nth migration, the adjusting is Vmax = i=1 , which is the average n value of Fi in n times of migrations. Certainly, with considering the different importance of its history value, we can get some different adjusting method, such as equation 7
d (n + 1) = Vmax
n i=1
d θi Vmax (i),
n ( θi = 1)
(7)
i=1
d . here, θ represents the importance of history values of Vmax u At the same time, if Vmin is up-migration enqueue condition, after the nth time of migration, it should be adjusted by all values of the last data objects in the queues at each migration. In order to make FDTM more robust and efficient, the other parameters or thresholds could be adjusted according to real migration u monitoring and feedback, such as Vmin , α, β, δ and etc.
4.5
Experiment
In this section, we validate and analyze the FDTM policy with DiskSim simulator [3] for we dont have a real tiered storage environment. In addition we choose Seagate Cheetah 15.5K FC disk and SSD disk to compose a two tiered storage system within DiskSim. The Seagate Cheetah 15.5K FC disk specification is extracted by DIXtrac disk characterization tool, and in other side the SSD DiskSim extension by Microsoft research is an idealized SSD that is parameterized by the properties of NAND flash chips instead of a simulator for any specific SSD. We show the device capabilities difference by dollar cost in Fig. 3 according to the latest research on the tradeoffs between FC disk and SSD disk [13] for better understanding of the storage devices in the experiment environment. It’s obviously that SSD disk could provide better IO performance, which could be around 100 times [13], than FC disks under same workload but cost more dollars. Obviously in enterprise data center it’s very worthy to adopt tiered storage
86
X. Zhao, Z. Li, and L. Zeng Table 1. Storage Device Characters
Tier 1 storage
Tier 2 storage
SSD Disk SSD extension for DiskSim by Microsoft research Size 32 GB Block Size 256 KB Block Number 131072 The SSD is composed by 8 elements, each el ement has 8 planes, and each plane has 2048 blocks FC Disk Seagate Cheetach 15.5K FC Disk Size 146.8 GB Block Size 512 B Block Number 28679487
,236
*%
0HPRULJKW05 66' 6HDJDWH&KHHWDK
0%V Fig. 3. Device capabilities normalized by dollar cost
infrastructure by FC disks and SSD disks, to provide improved IO performance with reasonable storage cost. Tab. 1 is the basic configuration for this simulated environment. The workload IO trace for experiment is randomly generated by using DiskSim at start, moreover, in order to play the same workload trace on four kind of storage systems (the SSD disk, the FC disk and the tiered storage system) respectively, so the block access range in this trace is limited by the size of the SSD disk, which is the smallest in capacity. Besides, we have two assumptions at here: Assumption 1: We assume that the initial data placement has little performance impact on experiment results, so we randomly place the data block objects across the tiered storage system to avoid the initial data placement issue. Assumption 2: We assume that the overhead of data block objects migration between SSD and FC disks could be ignored for the experiment, for it’s hard to simulate the overhead of data migration in this simulated environment and we’ll pay more attention on policy itself. So we only change the IO request target devices between SSD and FC disks in the workload trace to simulate the environment with data migration, in fact the data migration doesn’t happen in the experiment.
Average Response Time (ms)
FDTM: Block Level Data Migration Policy in Tiered Storage System
9 8 7 6 5 4 3 2 1 0
87
SSD FC SSD+FC TS-with-FDTM
Read-ART
Write-ART
IO-ART
Fig. 4. Average response time of 4 kinds of storage system
70000
SSD FC SSD+FC TS-with-FDTM
I/O Request
60000 50000 40000 30000 20000 10000 0 <
5
< 10
< 20
< 40
< 60
< 90
<120
<150
<200
200+
I/O Response Time (ms)
Fig. 5. Overall IO system response time distribution of the experiment
In experiment the workload trace is played 4 rounds on the 4 kinds of simulated storage devices, which are SSD disk, FC disk, tiered storage composed by SSD and FC disks, and tiered storage composed by SSD and FC disks with the FDTM policy control. The experiment results are collected from the DiskSim outputs. Fig. 4 shows the average response time results of the experiment. We also give out overall IO system response time distribution in Fig. 5 It could be concluded from the experiment results that the tiered storage system with FDTM policy could provide better IO performance and it’s easy to be implemented in current storage system. However, we just valuated FDTM with a simulated environment and synthetic workload trace, in which we take little consideration on the overhead of FDTM, so this experiment is just used to validate FDTM policy and we aim to valuate it in real environment with ongoing work.
88
5
X. Zhao, Z. Li, and L. Zeng
Related Work
As one of the most important concepts, DMF(Data Management Forum) in SNIA promotes ILM as a special topic in the field of data management system research and application. In [14,20], an deep analysis and discussion was done to the relationship between ILM and tiered storage system. There are several systems providing tiered storage system management functions transparently, such as SGI InfiniteStorage Data Migration Facility [19], VERITAS’ NetBackup Storage Migrator [1] and IBM’s HPSS[23]. Additionally, early research mainly focuses on the management and employment of tertiary storage, an offline storage tier in the tiered storage system, such as the applications in VoD(Video on Demand) system and DBMS[12,15], researches [6,16] mainly focus on the problem of performance in tertiary storage. With the development of tiered storage applications, studies on data classification and data placement are more popular, in [21] the necessity and importance in data classification and its basis are discussed, and [4,18] give out some specific methods for classification. References [8,9] focused on data placement under different storage infrastructure. There are also a lot of researches on data migration, such as to implement online migration by leveraging control theory [10] to reduce the overhead mostly, to describe the migration algorithm with edge-coloring theory [2,7]. All of above researches aim to achieve workload balance under a homogeneous storage environment, and suppose that data objects are independent with each other. However, it is almost impossible that happens in real environment. [22] researches on block-level data migration, but it just takes data access frequency into consideration, which affects its accuracy of candidate migration objects.
6
Conclusion and Future Work
This paper proposes the novel bi-directional block-level data migration policy FDTM for tiered storage system, the migration with double thresholds based on feedback, which could help improve the IO performance of tiered storage system. In addition, we also give out the detail process of block-level data migration, and how to select the migrated candidates, in which the data valuation mechanism could narrow the candidates’ scope and avoid the oscillation migrations by leveraging the relevance between different data blocks. The result of experiment shows that the tiered storage system with FDTM policy could provide better IO performance, and it’s easy to be implemented in current storage systems. As ongoing and future work, we’ll focus on the study of data placement, the interaction between data placement and data migration in tiered storage system. Acknowledgments. This work was supported by a grant from the National High Technology Research and Development Program of China (863 Program) (No. 2009AA01A404).
FDTM: Block Level Data Migration Policy in Tiered Storage System
89
References 1. What is storage virtualization. Veritus white paper, http://file.doit.com.cn/upfiles/2006/1027/0_230404_f1.pdf 2. Anderson, E., Hall, J., Hartline, J.D., Hobbs, M., Karlin, A.R., et al.: An experimental study of data migration algorithms. In: Proceedings of the 5th International Workshop on Algorithm Engineering, pp. 145–158. Springer, Heidelberg (2001) 3. Bucy, J.S., Schindler, J., Schlosser, S.W., Ganger, G.R.: The disksim simulation environment version 4.0 reference manual. Technical report cmu-pdl-08-101, carnegie mellon university (2008) 4. Golubchik, L., Khanna, S., Khuller, S., Thurimella, R., Zhu, A.: Approximation algorithms for data placement on parallel disks (2000) 5. Jin, H., Xiong, M., Wu, S.: Information value evaluation model for ilm. In: Proceedings of the 2008 Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and parallel/Distributed Computing, pp. 543–548. IEEE Computer Society Press, Los Alamitos (2008) 6. Johnson, T., Miller, E.L.: Performance measurements of tertiary storage devices. In: Proceedings of the 24rd International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco (1998) 7. Khuller, S., Kim, Y.A., Wan, Y.C.: Algorithms for data migration with cloning. In: Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 27–36. ACM, San Diego (2003) 8. Li, C., Zhou, L.Z., Xing, C.X.: A cost model for data placement and access path selection problem in fc-san. Journal of Software (05) (2004) 9. Li, J.T., Prabhakar, S.: Data placement for tertiary storage. In: Proceeding of the 10th NASA Goddard Conference on Mass Storage Systems and Technologies/19th IEEE Symposium on Mass Storage Systems (MSS 2002), Adelphi, Maryland, USA, pp. 193–207 (2002) 10. Lu, C., Alvarez, G.A., Wilkes, J.: Aqueduct: Online data migration with performance guarantees. In: Proceeding of the USENIX Conference on File and Storage Technologies (FAST), Monterey, pp. 219–230 (2002) 11. Massiglia, P.: Exploiting multi-tier file storage effectively. Snia tutorial, SNIA (2009), http://www.snia.org/education/tutorials/2009/spring/file/ paulmassiglia exploiting multi-tier file storagev05.pdf 12. Myllymaki, J., Livny, M.: Disk-tape joins: Synchronizing disk and tape access. In: Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, Ottawa, Ontario, Canada, pp. 279–290. ACM Press, New York (1995) 13. Narayanan, D., Thereska, E., Donnelly, A., Elnikety, S., Rowstron, A.: Migrating server storage to ssds: analysis of tradeoffs. In: EuroSys 2009: Proceedings of the 4th ACM European Conference on Computer Systems, pp. 145–158. ACM Press, New York (2009) 14. Peterson, M.: Ilm and tiered storage. Snia tutorial (January 2006), http://www.snia.org/forums/dmf/knowledge/ dmf-ilm and tiered storage 20060221.pdf 15. Prabhakar, S., Agrawal, D., Abbadi, A.E., Singh, A.: A brief survey of tertiary storage systems and research. In: Proceedings of the 1997 ACM Symposium on Applied Computing, pp. 155–157. ACM, San Jose (1997) 16. Reiner, B., Hahn, K.: Optimized management of large-scale data sets stored onttertiary storage systems. IEEE, Distributed Systems Online 5(5) (2004)
90
X. Zhao, Z. Li, and L. Zeng
17. Seo, B., Zimmermann, R.: Efficient disk replacement and data migration algorithms for large disk subsystems. ACM Transactions on Storage 1(3), 316–345 (2005) 18. Shah, G., Voruganti, K., Shivam, P., del Mar Alvarez Rohena, M.: Ace: Classification for information lifecycle management. NASA Mass Storage Systems and Technologies (2006) 19. Shepard, L.: Sgi infinitestorage data migration facility(dmf) a new frontier in date lifecycle management. White paper, sgi, http://www.sgi.com/pdfs/3631.pdf 20. SNIA: Ilm definition and scope an ilm framework (July 2004), http://www.snia.org/forums/dmf/programs/ilmi/dmf-ilm-vision2.4.pdf 21. SUN: Best practices in data classification for information lifecycle management. Sun white paper, http://www.sun.com/storage/white-papers/ best practices data classification ilm.pdf 22. Wang, D., Shu, J.W., Xue, W., Shen, M.M.: Self-adaptive hierachical storage management in san based on block-level. Chinese High Technology Letters (02) (2007) 23. Watson, R.W.: High performance storage system scalability: Architecture, implementation and experience. In: Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies, pp. 145–159. IEEE Computer Society, Los Alamitos (2005) 24. Xu, N.: A frequency-based self-adaptive data hierarchy policy. SCIENCE and TECHNOLOGY ASSOCIATION FORUM (03) (2009) 25. Zhao, X.N., Li, Z.H., Zeng, L.J.: A hierarchical storage strategy based on blocklevel data valuation. In: The Proceeding of the Fourth International Conference on Networked Computing and Advanced Information Management 2008 NCM 2008, Korea, pp. 36–41. IEEE, Korea (2008) 26. Zhao, X., Li, Z., Zhang, X., Zeng, L.: Block level data migration in tiered storage. In: Proceeding of International Conference on Computer and Network Technology (ICCNT 2010). IEEE Computer Society Press, Bangkok (2010)
Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System Xiao Xu, Weizhe Zhang, Hongli Zhang, and Binxing Fang School of Computer Science and Technology, Harbin Institute of Technology, Harbin China [email protected], {zwz,zhl}@pact518.hit.edu.cn, [email protected]
Abstract. Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site’s content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system’s real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results. Keywords: Distributed Web crawling, search engine, incremental crawl, DHT.
researches on the Web crawlers published in the past[7-11] , we assumed the system is handling a periodic crawl[12] . Under this assumption, the crawling system visits the whole web until its collection reaches a desirable number of pages. However, this assumption is not quite practical for the commercial search engines which, instead, adopt the idea of incremental crawl[12] . The current main information publishers: news sites, BBS sites and blog sites are generating new Web content in a very high speed. Even the content on a fixed URL changes constantly due to the changes of the DOM structure and the increase of comments and replies[13] . Therefore, in order to keep its indexed data up-to-date, the search engine has to apply different recrawl intervals to different Web site, instead of periodically crawl all the Web sites. Our goal in this paper is to design a fully distributed Web crawling system with the ability to incrementally crawl the Web so that the stored content of all the discovered Web sites can be continuously updated. The existing incremental crawl strategies[12, 14-18] are adopting recrawl intervals according to the change frequency, relevance and information longevity of the Web pages. The intervals may range from minutes to days so that the most important sites (such as news sites which has high update rate) can be scanned frequently meanwhile the rests are not in order to save the networking, storage and computing costs. When calculating the recrawl intervals, the existing works are all under the assumption that the system has a fixed scale (most of the works are using single-machine system). Things are different in the context of internet computing. As the real-time scale of the system is difficult to derive, the recrawl intervals calculated by the existing strategies are very likely to mismatch the system’s current capacity which leads to two kinds of consequences: Requirement exceeds system’s capacity (REC). The recrawl tasks are submitted so frequently that the crawlers within the system cannot handle all of them. Many tasks are queued, timed out even removed due to the crawlers’ abnormal departure. The actual update rate is below the expected recrawl rate. Requirement underestimates system’s capacity (RUC). The system has enough capacity to execute all the recrawl tasks. But the capacity is not fully used, which prevents the system from achieving a higher update rate. In this paper, we firstly make a brief summary of our former works. Unlike the traditional dedicated crawlers, our crawler is designed to be run on contributed personal computers. We have to restrict the crawler from consuming too much bandwidth and local resources. A more detailed crawler implementation is presented including basic data structures involved in the recrawl. Secondly, based on the system design, we propose two scale-adaptable recrawl strategies. The concept of the two strategies is periodically using the system’s successful update frequency (indicating the system scale) to derive a new recrawl rate. The strategies can be used as the additional steps to the existing recrawl strategies which solely focus on the characteristics of the Web pages. The two methods are evaluated through a series of experiments simulating the two situations: REC and RUC. The results show our strategies’ effectiveness. The rest of the paper is organized as follows: Section 2 presents our system design. Section 3 proposes the self-adaptive update strategies. Section 4 provides the evaluations. Section 5 we conclude the paper.
Scale-Adaptable Recrawl Strategies for DHT-Based DWC System
93
2 System Design First, we briefly outline our existing works presented in [6] . The paper mainly describes the system architecture of a new DWC system. The system is proposed to solve two issues:1) Scalability which is crucial for a distributed system; 2) Download time which various due to the network localities.
Fig. 1. The architecture of the proposed DWC system. From left to right, the first figure shows the main system modules; the second figure shows the 3-tier mapping.
As shown in Fig.1, the system consists of a Manager node and a network-locationaware overlay containing all the crawlers. The task of the manager is to maintain the Web site (or channel which will be introduced later in this paper) database and submit recrawl tasks to the overlay. The network-location-aware overlay is achieved by adopting a 3-tier mapping. As is demonstrated in Fig.1, from bottom to top, the 3 tiers are physical tier, coordinate tier and DHT tier. On the coordinate tier, a network coordinate service (NC) maps the physical locations (measured using the network latencies) of the crawlers and the Web hosts to the coordinates in a multi-dimensional network coordinate space. Using the coordinates as IDs and keys, on the DHT tier, all the crawlers join a DHT overlay (currently implemented on CAN), while the Web hosts are inserted to the overlay. By combining the network coordinate space and DHT, we finally achieve a network-location-aware task scheduling method. Because both the DHT tier and coordinate tier are self-organized, the manager doesn’t need to choose to which crawler a task should be assigned, as well as monitor the state of the crawling tasks and the behavior of the crawlers. The Web pages downloaded are not stored on the crawler side. They are sent to a distributed storage system (the storage module) after a series of pre-processing. Because the detailed design of the manager and the crawler which is crucial for this paper is not presented in [6] , we describe them in the following literature. 2.1 Task Design A task is the description of a or a part of a Web host which we call a channel. A channel may represent 3 kinds of Web entries: 1) the whole Web host (such as “sports.sina.com.cn”); 2) a Web host’s sub-directory (such as “news.sina.com.cn/w/”);
94
X. Xu et al.
3) a set of sub-directories on a Web host. The manager maintains a database of channels. When a channel needs to be recrawled, the manager patch the channel into a recrawl task and submit the task the overlay which finally appoints a crawler to run the task. On receiving a recrawl task, the crawler simply run the task according to the task’s fields. A task contains the following fields: The list of seed URLs. The seed URLs are used as the starting point of the crawl. Mostly these URLs points to the index pages. The list also contains the URLs (pointing to this channel) that were discovered during the crawl of other channels (we call these URLs the inter-site URLs). The list of regular expressions to be used as the URL filter. The crawler only downloads the URLs which matches the regular expressions. In case there are many kinds of expressions, we use a list to include all of them. The number of Web pages derived according to historical record. The number is used to decide the size of URLseen data structure on the crawler. As all the downloaded data are transferred to it, the storage module knows the completion of each crawl task and feeds back the latest statistics of the channels to the manager. The feedbacks are also used later in Section 3. The maximum depth to crawl. Defines how the crawl process stops. The array of page content digest. The array contains the 32bit SHA-1 digests of all known pages which is also fed back by the storage module. The array is used to filter the pages that haven’t changed since the last crawl so that the crawler doesn’t have to extract their contents and send them back to the storage module. According to our experience, the array occupies at most 400kB. The digests of discovered inter-site URLs. On discovering inter-site URLs the crawler first checks if they are in this discovered set. If true, the URLs are ignored, else the URLs are transferred back to the storage module with the downloaded contents and finally fed back to the manager. 2.2 Crawler Design Different from the dedicated crawlers, our crawler is designed to be run on the machines contributed by the ordinary internet users who may want to do other things besides crawling. As a result, small data structures, restricted downloading scheme have to be applied. The whole program is run on a low system priority. As is shown in Fig.2, a crawler consists of 3 kinds of modules: task queue, task manager and task threads. Task queue. Crawl tasks submitted to the crawler are first put into the task queue by the task manager. Task thread. Each task thread is a complete crawler thread responsible to download a channel. As the target is only one channel, the size of each data structure involved is quite small compared with the dedicated crawlers. A task thread contains 4 main submodules: URLtodo, URLseen, Pageseen, Robotsfilter and downloader.
Scale-Adaptable Recrawl Strategies for DHT-Based DWC System
95
1) Downloader. The Downloader is responsible for downloading the Web pages. In order to limit the bandwidth usage of each task thread, the downloader opens one TCP persistent connection to the target Web host and doesn’t use pipelining. At the first stage of downloading, the downloader first downloads the Web site’s robots.txt and builds a robots filter. After downloading each Web page, the downloader extracts the URLs in it. All the extracted URLs are matched with the Robotsfilter and the URLseen. If the URLs can pass the two filters, they are inserted into the URLtodo. After downloading, the Web pages are filtered by Pageseen. Then the downloader write the textual content to a temporary disk file. 2) URLtodo. The URLtodo is a FIFO queue used to store the URLs to be downloaded. In order to avoid the queue becoming too long, we restrict the length of the queue to 1000. The URLs exceeds the limit is temporarily written into disk files. 3) URLseen. The URLseen is an array of integers storing the SHA-1 hash of the already discovered URLs in the current crawl process. 4) Pageseen. The Pageseen is the array of page content digest mentioned in 2.1. We use sequential search to find if a Web page is in the Pageseen. 5) Robotsfilter. The Robotsfilter is a list of forbidden paths on the Web host. When the downloading process reaches the maximum depth to crawl, the downloader stops, and transforms to an uploader. The uploader uploads the extracted Web contents (stored in the disk files by the downloader) and the statistics of the channel to the storage module. After that, the task thread terminates itself and release the space of all its data structures.
Fig. 2. The data structures on a single crawler (machine). A crawler consists of 3 main part: the task queue, the task manager and the task threads. Each task thread implements a crawler responsible to download a certain channel.
Task manager. Task manager is responsible for allocating the task threads. It ensures that at any time there are only a restricted number of task threads running in order to save the cost of system resources. When a task thread terminates, task manager fetches a new task from the task queue and allocate a new task thread to run it.
96
X. Xu et al.
3 The Scale-Adaptable Recrawl Strategies (SARSS) In our system, channels are classified according to their change frequency into several ranks we call the priority groups (PGPs). Channels with higher change frequency are assigned to PGP with higher rank. The average change frequency of the channels in each PGP is used as the PGP’s initial recrawl time interval. Therefore, channels with higher change frequency should have shorter recrawl intervals than the others. Here, we don’t focus on how the change frequency is derived as there’s previous works[12, 1418] concerning the issue. Instead, we focus on how the system scale affects the PGP’s initial recrawl time interval. As is discussed in Section 1, the recrawl intervals solely derived from the characteristics of the Web sites may mismatch the system’s capacity (the proof is presented as the simulation result in Section 4). As a result, we have to develop an algorithm to automatically adjust the recrawl intervals to the system’s scale. The initial time intervals are only used as the start point of the adjusting process. However, one cannot directly calculate the PGP’s recrawl intervals according to the scale of the system as there are no reliable real-time statistics on the system scale. Thus the adjustments have to be done according to the records of the manager. In this section, we propose two scale-adaptable recrawl strategies to automatically adjust all the channels’ recrawl intervals in order to maintain high update rate. In the following literature, an update of a channel indicates a successful recrawl in which the content of the channel stored in the storage module can be updated. Due to the node churn within the system, not all recrawls are successful. Therefore, the update rate does NOT equal to the recrawl rate. Instead it indicates the rate of successful recrawls. 3.1 Iterated Interval Reassignment (IIR) The basic idea of the IIR strategy is to periodically reassign each PGP a new update interval derived from the PGP’s latest update record. The reassignment process is called when 50% channels with the highest rank has completed more than 5 successful recrawls (updates) since the last reassignment. The process is run in an isolated thread so that it doesn’t stop the system’s crawl sequence. First, we add 3 records to each channel. 1) historyInterval (hInterval): the average time gap between updates since the last reassignment; 2) historyRecrawlCount (hSubCount): records the number of recrawl tasks (on that channel) submitted to the crawlers since the last reassignment; 3) historySuccessCount (hSucCount): records the number of updates since the last reassignment. Then, each PGP is reassigned a new recrawl interval which is the mean of all channel’s historyIntervals. To this point, we only get intervals from the historical records. In order to achieve a higher update rate, we add a encourage mechanism to the above process. We define the successRate of a PGP as quotient of (hSucCount / hSubCount) and add a static variable lastSuccessRate to record the successRate calculated during the last reassignment. If the successRate is larger than lastSuccessRate, the system would make a positive attempt by decreasing the update interval by a pre-configured parameter δ (<1). The final step is to check the value of each PGP’s new recrawl interval to ensure that the PGPs with lower ranks never have smaller recrawl intervals than the PGPs with higher ranks, at the same time, to prevent the recrawl intervals from growing too large. The above process is written as psuedocode below.
Scale-Adaptable Recrawl Strategies for DHT-Based DWC System
97
The Psuedocode of IIR. The code implements the main function InheritedIntervalReassignment of the IIR. We assume K PGPs, and In PGP[K][], the first dimension indicates the priority number 0..K-1, the second dimension stores the list of channels under a certain priority number const EncourageRate = ¥; static LastSuccessRate[K] = {0, 0, …, 0}; program InheritedIntervalReassignment ( PGP[K][] ) var P:-1..K-1;SuccessRate[K];ActualExecInterval[];NewIntervals[K]:Real; RecrawlCount, SuccessCount, C: Integer; begin P := -1; repeat P:= P + 1; C := -1; repeat C := C + 1; ActualExecInterval[C] := PGP[P][C]. hInterval(); RecrawlCount := RecrawlCount + PGP[P][C].hSubCount; SuccessCount := SuccessCount + PGP[P][C].hSucCount; until C = number of channels in PGP[P] NewIntervals[P] := mean(ActualExecInterval[0..C-1]); SuccessRate[P] := SuccessCount / RecrawlCount; if SuccessRate[P] > LastSuccessRate[P]; then { encourage }; NewIntervals[P] :-= NewIntervals[P]*EncourageRate; end if LastSuccessRate[P] = SuccessRate[P]; until P = K-1; call checkPriorities(NewIntervals[]); Set NewIntervals[] as manager’s recrawl interval; end.
The Psuedocode of checkPriorities. The code is the implementation of function checkPriorities which is called by InheritedIntervalReassignment. const SmallestMultiple = ³; const BiggestMultiple = ; program CheckPriorities ( NewIntervals[K] ) var upperLimit: Real; P: Integer; begin P := 0; repeat P := P + 1; if NewIntervals[P] < NewIntervals[P-1]; then NewIntervals[P]:=NewIntervals[P-1]*SmallestMultiple; end if upperLimit := NewIntervals[P-1] * BiggestMultiple; if NewIntervals[P] > upperLimit; then NewIntervals[P] = upperLimit; end if until P = K-1; end.
3.2 Iterated Capacity Reassignment (ICR) The basic idea of the ICR strategy is to periodically evaluate the system’s capacity and distribute the capacity to each PGP. In this method, the recrawl interval of each PGP shares a common divisor ΔT0PGP which is the recrawl interval of the PGP with the highest rank (rank 0) and ωi which is the multiple between the initial recrawl interval
98
X. Xu et al.
of the ith and the (i+1)th PGP. Then, the recrawl interval of the PGP with rank k k −1
equals to ΔTk
PGP
=
∏ω × ΔT
PGP 0
j
. ΔT0PGP is set to Δt at the system’s initial stage. We de-
j =0
fine ΔT0PGP = θ × Δt . So in each ICR process our goal is to derive a new θ . In the following literature, we present the derivation of how to calculate θ during each assignment process. All the records used in 3.1 are inherited. We assume that there are N channels in the manager’s database; the channels are classified into S ranks (PGPs); the jth channel has an update interval ΔT j ; the channel’s update rate is λ j = 1 ΔT j ; the channel’s average download time is tl . Here we add a record to each channel to reserve its download times during the updates. According to [6] , our scheduling algorithm guarantees the channel’s recrawl task is always assigned to a crawler with low latency. Therefore, unless some exceptional event such as congestion occurs, there won’t be significant difference between the download times. We also assume that during ΔT j , there are on average M crawlers in the system; the number of task threads on the ith crawler equals to CLi . If we want to update each channel j on schedule, the best choice is to complete all the recrawl tasks submitted during ΔT j . Under the assumption that the system’s DHT-based scheduling can sufficiently balance the load of all crawlers, we have the following formula.
∑(λ × t ) × ΔT × N
l
l
j
l =1
1
∑
≤ ΔT j
M
(1)
CLk
k =1
We replace “ ≤ ” with “=”, then (1) can be transformed to:
∑(λ × t ) = ∑CL N
M
l
l
l =1
(2)
k
k =1
The right side of (2) is not an accurate value as the set of crawlers subjects to constant change. But, it can be treated as an approximation of the latest system capacity. To implement an iteration process, we replace the right side of (2) with a history-related value derived from the record of each channel’s previous L updates. 1 × θ × Δt
S −1
∑( k =0
∏ω
−1
∑t ) = history(∑ (λ × t )), ∏ω N
1 k −1
l
l
l
l =1
PGP ( k )
j
=1
j
=1
j =0
(3)
j
j =0
S −1
θ=
∑( k =0
1 k −1
∏
∑t ) l
ωj
PGP ( k )
−1
∑ (λ × t ))) , ∏ω N
(Δt × history(
l
l =1
l
j =0
(4)
j =0
Then θ can be calculated according to (4). In the implementation, we also add the encourage mechanism and checkPriority function at the end of the process.
Scale-Adaptable Recrawl Strategies for DHT-Based DWC System
99
4 Experiments
20%
100%
15%
75% Crawlers %
Percentage of Host Pairs
All the experiments are done under complicated simulations. First, we adopt P2Psim’s King dataset[20] to simulate the network latencies between the crawlers and the Web hosts. Because the size of the dataset (only 1740 nodes) is relatively small, we map the nodes into a coordinate space using Vivaldi[21] and replicate a set of new nodes around each original node. The distance between the original nodes and its replica is a random value scaling from 10% to half of the distance between the original node and its nearest neighbor. Accordingly, we generate two larger datasets: 1) 16000 Web hosts and 1400 crawlers. The original 1740 nodes are divided into 1600 and 140 and we replicate 10 new nodes around each node. 2) 16000 Web hosts and 2800 crawlers. Replicating 1 more node around each of the 1400 crawler nodes.
10%
5%
50%
25%
0%
0%
0
50
100
150
200
250
300
350
400
512Kb/s
1Mb/s
2Mb/s
Bandwidth
20%
50.0%
15%
37.5% Channels %
Percentage of Web hosts
Network Latency (ms)
10%
5%
25.0%
12.5%
0%
0.0%
0k
10k
20k Number of Pages
30k
40k
0
1
2
3
4
Priority Number
Fig. 3. Statistics on the datasets. left to right, top to bottom, the figures show the distribution of: network latencies, bandwidth, page numbers and number of channels on each PGP.
Second, the bandwidth assigned to each crawler has 3 levels: 512Kb/s, 1Mb/s and 2Mb/s. The number of crawlers under each level is decided according to standard normal distribution 1.5*N(1, 0.5). Each Web host is treated as a channel. The number of Web pages on each channel is assigned according to a Web dataset we collected in 2008. The dataset is the result of 1738 crawling tests containing 15,187,511 URLs. Because the dataset only contains 1738 Web hosts, we also have to increase its size using replication. We divide the set of Web hosts (channels) into 5 PGPs. The number of channels under each rank is decided by 1.25*N(2, 1). To each channel in the ith PGP, we assume a change frequency obeying Poisson distribution with λ = 1 θ i , i = 0,1,2,3,4 . θ i indicates the average time interval when most of the Web pages have changed on the channel. For the 5 PGPs, we assume θ 0 = 60 min , θ1 = 120 min , θ2 = 240 min , θ3 = 480 min , θ 4 = 960 min . All the above setups are illustrated in Fig.3. The simulation is based on time steps. Each step represents 1 minute. The whole simulation involves 90000 steps (1500 hours, 62.5 days). The crawlers are all added to the system at the first step. The living time of each crawler obeys the normal distribution
100
X. Xu et al.
N(1000, 200) which means 95% crawlers’ living time is between 608 steps and 1392 steps. After the living time, 1/3 crawlers leave the system and notify the leaving event to their neighbors; the tasks both running and waiting are migrated to the neighbors; the rest 2/3 crawlers leave the system without notifying. The tasks on these crawlers are completely lost. Each dead crawler will rejoin the overlay within a time gap obeying the normal distribution N(600, 200). The maximum number of task threads on each crawler is set to 5. The maximum length of task queue is set to 5. When the task queue is full the newly submitted task is forwarded to a less loaded crawler according to [6] . If a task’s waiting time exceeds the time limit which equals to the channel’s recrawl interval, it is canceled by the crawler. The download time of each Web page is calculated by adding the RTTs, data transfer time and a wait time (fixed to 200ms) involved. The RTT equals to 2 times the network latency in order to simulate the time cost of a GET request under HTTP persistent connection without pipelining. The data transfer time is calculated through dividing the size of the Web page by the speed. The speed equals to the bandwidth of the crawler divided by the number of task threads. Because the 1-minute-per-step setup shields us from directly simulating each downloading process. Therefore, instead, the total download time of all pages on a Web host is pre-calculated once it is submitted to a certain crawler. 4.1 Evaluation Criterias In the simulation, 3 recrawl strategies are compared. They are 1) FIXED: the recrawl interval is fixed to the change frequency of the Web site, ignoring the fluctuation of the system’s capacity. 2) IIR. 3) ICR. We use the following evaluation criterias: The total number of recrawl tasks. The number indicates the load of the manager. Under both REC and RUC, the value should be minimized in order to reduce the manager’s unnecessary load. The number of queued tasks. The number indicates the number of tasks contained in the crawlers’ task queue. Under both REC and RUC, it should be minimized. The system’s throughput. The value indicates the total download rate of the system. Under both REC and RUC, the value should be maximized so that all the capable crawlers are efficiently utilized. Update quality. To each channel, its change rate (the reciprocal of the channel’s change frequency, and the FIXED strategy is using this change rate as the channel’s recrawl ected . In the real run, due to the node churn, the actual update rate rate) is labeled as λexp i actual ected ) is always lower than λexp . Assuming that each PGP i has ( λi i a weight Ri , then the update quality calculated by Formula(5) indicates how well the recrawl strategy can perform against the channels’ actual change frequency. Here we apply weights (used in Table 2 and 4 ) to PGPs (4~0): 2 4 = 16 , 23 = 8 , 2 2 = 4 , 21 = 2 , 20 = 1 . Under both REC and RUC, the update quality should be maximized.
∑ (R × λλ N
updateQuality =
i
i =1
actual i exp ected i
) N
(5)
Scale-Adaptable Recrawl Strategies for DHT-Based DWC System
101
In addition, there are other metrics such as the number of task migrations, load variations, etc. Due to the size of the paper, we don’t present them. 4.2 The Case of REC We firstly examine the performance of IIR and ICR under the case of REC. Here we choose the dataset containing 16000 channels and 1400 crawlers. Table 1. Number of recrawl tasks and average recrawl intervals (steps) under REC. The left 4 colums show the number of recrawls. The rest 4 colums show the average re-recrawl intervals. Rank 4 3 2 1 0 Total
The left side of Table 1 shows that IIR and ICR dramatically reduce the number of recrawl tasks. To explain the reason, the right side of Table 1 shows the recrawl intervals. As the system’s capacity cannot match the channels’ change frequency (used as the recrawl frequency of the FIXED strategy), the IIR and ICR all increase the recrawl intervals of each rank. From the table, we observe that the IIR narrows the gaps between different ranks; meanwhile, the ICR widens them. However, the above characteristics don’t cause continuous narrowing or widening. We observe that during the whole simulation process, the recrawl intervals don’t change in linear patterns. From Table 1, we observe that there are much more tasks submitted under FIXED strategy. However, such high recrawl rate doesn’t lead to high efficiency. Under FIXED strategy approximately 70% tasks are waiting in the task queue. According to Fig.4, if we assume that the average number of online crawlers is 700, then the length of each crawler’s task queue under FIXED strategy is about 17. In such situation, most queued tasks are timed out and cancelled. On the other hand, the lengths under IIR and ICR are about 2.8 and 1.4 which indicate 83% and 92% reduction on the task queue’s memory cost. IIR has less queued tasks because IIR’s recrawl intervals on rank 0 to rank 2 are larger than that of the ICR’s. The throughput in Fig.4 is calculated through dividing the total size of pages downloaded in 10000 steps by the total time cost (including RTTs, data transfer time and wait time). Because the task threads are not always downloading pages, the throughputs don’t equals to the number of crawlers multiplied with the bandwidth. Instead, they only equals to half of the product. In addition, the curves all present obvious decline in the first 20000 steps. This is due to the decline of the number of crawlers, since we added all the crawlers to the system at the first step. Moreover, although the number of tasks is significantly reduced under IIR and ICR, the throughput of the system doesn’t change dramatically. The throughput under the ICR almost matches that under FIXED strategy. On the other hand, as IIR’s recrawl intervals on rank 4 to rank 2 are larger, the throughput under IIR declines about 14%.
102
X. Xu et al.
FIXED IIR ICR
16000 14000
Queued Tasks
12000 10000 8000 6000 4000 2000 0 0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Steps FIXED IIR ICR
1000 900
Throughput (Mb/s)
800 700 600 500 400 300 200 100 0 0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Steps
Fig. 4. The system performance throughout the simulation under REC. From left to right, the first figure shows the number of queued tasks; the second figure shows the system’s throughput. Table 2. Update quality under REC. The left three columns labeled “w” use the weights assumed in 4.1; the right three columns use 1 as each PGP’s weight. Rank FIXED (w) IIR (w) Update Quality 1.9987 2.0266
ICR (w) 2.0176
FIXED 0.5571
IIR 0.4936
ICR 0.4331
Table 2 shows the update quality of the fixed recrawl strategy, IIR and ICR. As the FIXED strategy represents a very radical approach, its update quality can be treated as the maximum capacity the system can achieve under REC. IIR and ICR’s weighted update qualities are very similar to that of FIXED, which means the update quality has been maximized. Meanwhile, their non-weighted update qualities are lower. We find that the reason is that, as the system’s capacity cannot meet the required recrawl rate, both IIR and ICR decrease the recrawl rates of the lower ranked channels (which accordingly decreases the update rates) in order to ensure high update rate of the higher ranked channels. Since the non-weighted update qualities are calculated without concerning the channels’ importance, they become inevitably smaller. We also notice that even under IIR and ICR which has lower recrawl rate, the average success rate of the tasks is only 30% (the fact is also true under the case of RUC). The phenomenon is mainly caused by the crawler’s non-notified departure in which all the tasks assigned to the crawler are completely lost. We consider this low rate a necessary cost under a fluctuating distributed environment. On the other hand, under FIXED strategy, since there are a lot more tasks timed out in the queue, the success rate become even worse and is less than 15%.
Scale-Adaptable Recrawl Strategies for DHT-Based DWC System
103
4.3 The Case of RUC We secondly examine the performance of IIR and ICR under the case of REC. Here we choose the dataset containing 16000 channels and 2800 crawlers. Table 3. Number of recrawl tasks and average recrawl intervals (steps) under RUC. The left 4 colums show the number of recrawls. The rest 4 colums show the average recrawl intervals. Rank 4 3 2 1 0 Total
Different from the case of REC, the number of recrawl tasks under IIR and ICR in Table 3 don’t decline. Instead, they are increased by 8% and 11.5% compared with the FIXED strategy. The fact indicates that, under the case of RUC, in order to fully utilize the system’s capacity, there’s no way to reduce the load of manager. The bottleneck should be solved by either upgrading the manager’s capacity or deploy a number of managers to distribute the loads. We also found that IIR and ICR perform differently on deciding the recrawl intervals. On one hand, the IIR increases the intervals of higher-ranked (rank 4) channels and decrease that of lower-ranked (rank 3 to 0) channels. On the other hand, ICR decreases the intervals of higher-ranked (rank 4 to 2) channels and increase that of lower-ranked (rank 1 to 0) channels. Through the comparison we conclude that ICR is more likely to ensure the high re-scrawl rate of high-ranked channels, while IIR performs better if taking all ranks into consideration. Both Fig.5 shows the increase in the number of queued tasks and system throughput under IIR and ICR. To our surprise, the increase in Fig.5 (300%-400% in the left figure and 200% in the right figure) significantly exceeds the increase (8%-11.5%) in Table 3. On one hand, we consider the increase in queued tasks no harm to the crawlers. If we assume that the average number of online crawlers is 1400 (half of the 2800 crawlers), then the length of each crawler’s task queue under FIXED, IIR and ICR is about 0.36, 2 and 3. The numbers are low enough to prevent the time out. On the other hand, by re-adjusting the recrawl intervals of each rank, the system’s download capacity is fully used. Under IIR and ICR, the system’s throughput reaches 2 times of the throughput achieved under the case of REC (which also indicates the growth in the number of crawlers). But under the FIXED strategy, the system’s throughput doesn’t change compared with REC. Table 4 shows that IIR and ICR significantly increase the weighted update quality by 91% and 72.6% compared with FIXED strategy, which is satisfactory. We further notice that the non-weighted update qualities of IIR and ICR exceeds 1, which means the update rate exceeds the channels’ change frequency. Nevertheless, we believe the freshness of the Web contents can still be improved since the Web pages on the channels don’t change all at once (instead, their change frequency is supposed to obey the Poisson distribution[12] ). Another important issue is that if a crawler visits a Web host too often, it brings additional load to the Web host. A feasible solution is to add
104
X. Xu et al.
Queued Tasks
an upper limit to the recrawl rate (e.g. 2 times of the channel’s change frequency) and, if the system’s capacity permits, submit more channels to crawl. FIXED IIR ICR
Fig. 5. The system performance throughout the simulation under RUC. From left to right, the first figure shows the number of queued tasks; the second figure shows the system’s throughput. Table 4. Update quality under RUC. The left three columns labeled “weighted” use the weights assumed in 4.1; the right three columns use 1 as each PGP’s weight. Rank FIXED (w) IIR (w) Update Quality 4.7873 9.1474
ICR (w) 8.2611
FIXED 1.0013
IIR 1.9827
ICR 1.9701
5 Conclusions In this paper, we propose the design of a distributed incremental crawling system. The system is designed to utilize the machines contributed by ordinary internet users as the crawlers. Based on this design, we propose two new recrawl strategies to adapt the recrawl frequency to the system’s capacity in order to make full use of the contributed machines. The strategies show their efficiency under a series of simulations. Acknowledgments. This paper was partially supported by the National Natural Science Foundation of China under Grant No.60703014; the National Grand Fundamental Research 973 Program of China under Grant No.G2005CB321806; the Specialized Research Fund for the Doctoral Program of Higher Education, SRFDP No. 20070213044; the National High-Tech Research and Development Plan of China under Grant Nos. 2007AA01Z442; the National High-Tech Research and Development Plan of China under Grant Nos. 2009AA01Z437.
Scale-Adaptable Recrawl Strategies for DHT-Based DWC System
105
References 1. Foster, I.: Internet Computing and the Emerging Grid. Nature (2000) 2. Werthimer, D., Cobb, J., Lebofsky, M., Anderson, D., Korpela, E.: SETI@HOME— Massively Distributed Computing for SETI. Comput. Sci. Eng. 3, 78–83 (2001) 3. YaCy Distributed Web Search, http://yacy.net 4. FAROO Real Time Search, http://www.faroo.com 5. Majesti-12: Distributed Web Search, http://www.majestic12.co.uk 6. Xu, X., Zhang, W.Z., Zhang, H.L., Fang, B.X., Liu, X.R.: A Forwarding-based Task Scheduling Algorithm for Distributed Web Crawling over DHTs. In: The 15th International Conference on Parallel and Distributed Systems (ICPADS 2009), pp. 854–859. IEEE Computer Society, Shenzhen (2009) 7. Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2, 219–229 (1999) 8. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A Scalable Fully Distributed Web Crawler. Software—Practice & Experience 3(8), 711–726 (2004) 9. Loo, B.T., Cooper, O., Krishnamurthy, S.: Distributed Web Crawling over DHTs. Technical report, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley (February 1, 2004) 10. Singh, A., Srivatsa, M., Liu, L., Miller, T.: Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: The SIGIR Workshop on Distributed Information Retrieval, pp. 126–142 (2004) 11. Lee, H.T., Leonard, D., Wang, X., Loguinov, D.: IRLbot: Scaling to 6 Billion Pages and Beyond. In: The 17th International Conference on World Wide Web, pp. 427–436 (2008) 12. Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: The 26th International Conference on Very Large Data Bases (VLDB 2000), San Francisco, pp. 200–209 (2000) 13. Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web Changes Everything: Understanding the Dynamics of Web Content. In: The 2nd ACM International Conference on Web Search and Data Mining (WSDM 2009), pp. 282–291. ACM, Barcelona (2009) 14. Edwards, J., McCurley, K., Tomlin, J.: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In: The 10th International Conference on World Wide Web, pp. 106–113. ACM, Hong Kong (2001) 15. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal Crawling Strategies for Web Search Engines. In: The 11th International Conference on World Wide Web, pp. 136–147. ACM, Honolulu (2002) 16. Cho, J., Molina, H.G.: Effective Page Refresh Policies for Web Crawlers. ACM Trans. Database Syst. 28, 390–426 (2003) 17. Pandey, S., Olston, C.: User-centric Web Crawling. In: The 10th International Conference on World Wide Web, pp. 401–411. ACM, Chiba (2005) 18. Olston, C., Pandey, S.: Recrawl Scheduling based on Information Longevity. In: The 17th International Conference on World Wide Web, pp. 437–446. ACM, Beijing (2008) 19. Brin, S., Page, L.: The Anatomy of A Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems 30, 107–117 (1998) 20. P2PSim-Kingdata, http://pdos.csail.mit.edu/p2psim/kingdata/ 21. Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A Decentralized Network Coordinate System. In: SIGCOMM 2004, pp. 15–26. ACM, Portland (2004)
Power Efficient Scheduling for Hard Real-Time Systems on a Multiprocessor Platform Peter J. Nistler and Jean-Luc Gaudiot University of California, Irvine Irvine, CA, USA [email protected], [email protected]
Abstract. An online, real-time scheduler is proposed to minimize the power consumption of a task set during execution on a multiprocessor platform. The scheduler is capable of handling the spectrum of task types (periodic, sporadic, and aperiodic) as well as supporting mutually exclusive, shared resources. The solution presented is a user adjustable scheduler which ranges from producing an optimal schedule which requires the minimum power during the worst case execution scenario to producing a suboptimal schedule which aggressively minimizes power during the typical execution scenario. Keywords: Real-Time, Scheduler, Power Management, Multiprocessor.
1 Introduction Multi-core systems are designed to bring the processor power consumption and heat density down to a manageable level without sacrificing performance. This claim is based on the dynamic power characterization of a CMOS device: (1), where P is power, C is the device capacitance, V is the voltage, and f is the frequency. At the same time, we are entering a world of ubiquitous computing.: there are microprocessors in watches, phones, televisions, simple kitchen appliances, cars, and power grids. Many of these systems need to operate within certain time bounds to ensure proper function. Embedded systems are often designed within a prescribed power envelope. This power constraint might be defined by thermal boundaries or by energy constraints, such as in a battery or solar cell-operated environment.. Performance and power requirements must be carefully weighed against each other to ensure that the system functions as desired. With the current trend towards multiprocessors and the need to minimize power, there is a demand for a hard real-time multiprocessor scheduler that optimizes for minimal power usage.
Power Efficient Scheduling for Hard Real-Time Systems
107
require the maximum possible execution time possible. Under this assumption, the scheduler scales back the operating frequency of the processor so that the task completes exactly at its deadline. This is an optimal solution for scheduling a given task set on a multiprocessor while providing hard real-time guarantees. The second type of scheduler, (e.g., Gruian [2] and Malani et al. [3]), assumes that the task will execute in an average time and sets out to minimize the power. The scheduler sets the operating frequency such that the task on average will finish execution by a set time . If the task does not finish executing by then the scheduler increases the operating frequency such that the task can finish the task execution by . Fig. 1 depicts the differences in the speed scaling schemes.
Fig. 1. Scheduling Strategies
Finally, Chen and Hsueh [4] introduced a novel framework for executing periodic real-time tasks on uniform multiprocessors. They developed an optimal, simple, clear, and easily visualized scheduler.. Their framework is a core component of this work. The Precaution Cut Greedy (PCG) scheduling algorithm is fluid in that each task executes at a constant rate during its executable period. It is also P-fair in that the work allocation and work executed are within one quantum at all times.The basic quantum employed by the PCG algorithm is a novel construct called a Time and Local Execution Requirement (T-Ler) plane. The T-Ler plane can be visualized as a right isosceles triangle where the y-axis is the system execution requirement (work) and the x-axis is time. The plane is created by overlaying each task’s fluid schedule inside the triangle and aligning the time boundaries with scheduling events. By assuming the period of every task is equal to its deadline, the only scheduling events in Chen and Hsueh’s model are the task periods. The entire system schedule is broken into a series of T-Ler planes. Inside each plane, the tasks can be executed in any fashion, provided that all the required work is completed by the end of the plane. The PCG algorithm provides an optimal solution for scheduling on uniform multiprocessors, through the application of the fluid scheduling model.
3 Problem Definition The existing power efficient real-time schedulers are not well suited for use in their current form. If a scheduler can only handle task sets where all the tasks share one common deadline; if it can only handle periodic tasks; if it cannot deal with even the
108
P.J. Nistler and J.-L. Gaudiot
simplest of locks or other shared resources; then the scheduler has limited applicability to actual implementations. The focus of this work is to produce that well-rounded scheduler, a scheduler designed with an actual system in mind. The first part of this work establishes the hardware and power model used in the system. The second part describes the basic scheduler and supporting framework that minimizes the power consumption on a multiprocessor. This approach is strictly for hard real-time systems. The system will contain support for any mix of periodic, sporadic, and aperiodic tasks. The support for these non-periodic tasks requires that the scheduler and framework are online algorithms. The scheduler shall also maintain a strict ordering of accesses to shared resources such as memory mapped IO structures and locks. The result is an online scheduler and supporting framework that minimizes the power consumption of a multiprocessor while scheduling hard real-time tasks. The third part of this work explains how to use the scheduler and supporting framework to support the two different approaches to scheduling. The first approach describes a high-reliability system. The system is simple, clean, and guarantees an optimal solution for the worst case execution scenario. It is a combination and further extension of the work done by [1] and [4]. The second approach describes a more commercial solution, optimized for the typical execution scenario. It combines the high-reliability approach with the idea of stochastic scheduling. The work is broken up into three different sections. Section 4 discusses what the power optimal configuration of the hardware looks like. Section 5 extends Chen and Hsueh’s scheduling framework for supporting periodic, sporadic, and aperiodic tasks while supporting shared resources. Section 6 describes how to tailor the scheduler so that it can proactively reduce the system power.
4 Power Optimal Hardware Configuration This section addresses the problem of selecting processor operating frequencies so that the system can support the requirements of a given task set. The power optimal hardware configuration and a simple algorithm for determining it are provided. 4.1 The Processor / Task Relationship A relationship needs to be established between the processor capabilities and the task execution requirements. First a multiprocessor model is selected to establish a foundation for the work. Based on this model’s relationship between the software and hardware some bounds can be established to find the minimum operating frequencies required to support the task set. There are two different high-level multiprocessor models that need to be explained before the real work can begin. The first model is identical parallel machines. Each of the processors in the system is identical to all the other processors and operates at the same frequency. Thus all the processors in the system perform identically, allowing tasks to freely migrate without affecting the task’s execution. This model is the easiest to support at the operating system and scheduler level.
Power Efficient Scheduling for Hard Real-Time Systems
109
The second model is uniform parallel machines. Similar to the identical parallel machines the tasks can freely migrate as an identical instruction set used on all the processors. The difference is that the processing capabilities of the processors are / where is the only linearly related to each other. The linear relation is amount of work executed and is the time elapsed. Each processor in this system is allowed to take on any positive value for . Regardless of the scheduler’s goals it must be aware of how the processors compare to each other to make a good decision. The multiprocessor model being used in this research is a special case of the uniform parallel machine model. The processors are physically implemented as identical parallel machines. However, each of the processors can independently operate at a unique clock frequency. There are several reasons for taking this approach. The first reason is that the software designer only needs to optimize the code to execute on a single processor microarchitecture. The second reason is that this approach relates processing capability and operating frequency to a simple, linear formula. If the processing capability is normalized to the maximum operating frequency, then it follows that . The final reason is that running separate clock frequencies to each core allows a fine-tuning of the power consumption. As described earlier the amount of work (machine instructions) that a processor can complete in time t is defined by: . The tasks being scheduled are de, , where task has a worst case execution time at scribed by the triplet: the maximum frequency , a relative deadline , and a period . Note that is a measure of time, rather than a measure of work. It follows that the work required to execute must be where is the processing capability of a processor at its maximum frequency. Recall with that 1, the relationship simplifies to . The work of a task equals its worst case execution time. A common metric for determining schedule feasibility in real-time systems is task utilization of a processor . The utilization on a fixed frequency, uniprocessor is described as / . This equation needs a slight update in order for it to work on a variable frequency system; the actual execution time must meet the equality . When the utilization is combined with the equation for it follows that . This is to say that a processor must have a processing capability in order for the task to finish by its deadline. Further, it becomes apparent that the minimum operating frequency of the processor executing is simply . This equation is a lower bound on the selectable frequencies that can be used while executing a given task. This utilization model can be extended to the system level, where it can be seen that the total utilization of the system is characterized by 1
(2)
where is the number of processors and is the number of tasks in the system. As a quick check for task set feasibility, note that if, at any point in time either 1 or 1 then the task set can never be successfully scheduled on the system.
110
P.J. Nistler and J.-L. Gaudiot
4.2 The Ideal Processor Frequencies This section shall investigate the optimal processor speeds for a given workload, without regard for the real-time constraints. The only goal of this section is to determine the optimal operating frequency of each processor such that a given amount of work can be accomplished in a set amount of time. As mentioned earlier, the power of a system is quadratic relative to the operating voltage of the device and linear in terms of . Furthermore, the relationship between the required operating voltage and the frequency on dynamic voltage scaling hardware is assumed to be a linear relationship ( ). Since is linear with the result is that when frequency and voltage are scaled together. The power model can be easily extended to support multiprocessor systems. By summing the power of each individual processor, the system’s power is modeled by (3) This is not simply the power of one processor multiplied by , because each of the and processors can take on a unique frequency. From the earlier equality equation (2), it can be seen that a system must maintain a total processing capability Σ to ensure that all the tasks in a given span will complete by their deadlines. If , then the system is capable of processing more work than the task set demands. On the other hand, if then the system cannot process all the task set demands, which means that the task set is infeasible by the system. Theorem 1. The system power to achieve a total processing capability when , .
is minimized
Case 1: 1 This case describes a uniprocessor system. All the work must be executed on the one processor. Case 2: 2 In this case the total processing capacity er characterization of
, which leads to the system pow-
. 1 holds This system functions correctly as long as the original assumption 0 true. The result is that the minimum power is achieved when the work is evenly distributed across the processors. 2 Case 3: 2 The optimal solution for this case can be established with a proof by contradiction. Assume that there is a system where the optimal power is achieved by unevenly distributing the work among all the processors. This assumption can be expressed by the formula
Power Efficient Scheduling for Hard Real-Time Systems
111
1 . Where 1 1 The first equation states that there exists a state where one processor deviates from the average execution speed / by while other processors adjust their frequency to ensure that . The deviation must be bounded to ensure that none of the processors are set to a negative frequency. Solving the first equation for gives the result 3 . 1 When this equation is evaluated across the bounds of it follows that 3 / . This result violates the bound on ; one of the processors must run at a negative frequency in order for the first equation to hold true. This is nonsense. Therefore by contradiction, the minimum system power is achieved at ,
1
.
(4)
The result is that the hardware consumes the least power when the processors all operate at . This is to say the optimal solution from the hardware’s perspective is to adjust each processor such that . Therefore, the power optimal scheduler must try to produce a schedule where all the tasks can be feasibly scheduled with this average operating frequency. 4.3 Power Optimal Software Configuration This section shall address the problem of selecting processor operating frequencies such that the system consumes the minimum power while maintaining task set feasibility. The system must find a solution that is as close to the hardware optimal solution as possible while producing a feasible configuration. This section will establish the requirements for the scheduling algorithm. The job of selecting optimal processor frequencies can be reduced to arranging all the task utilizations into bins such that the total value of each bin is as close to as possible. It is important to note that tasks in the system are assumed to be nondivisible. One task cannot be split into parallel threads and simultaneously executed on multiple processors to shorten the execution time. The task is only allowed to execute on one processor at a time. However the task is allowed to migrate between any processors without penalty. The feasibility constraint for uniform multiprocessors has already been described by Funk et al. [5]; their result follows. Theorem 2. Feasibility Condition on Uniform Multiprocessors by Funk et al. [5] Consider a set , ,…, of tasks indexed with non-increasing utilization for all ). Let Σ for all . Let denote a system of (i.e., where for all . uniform processors with processing capabilities , , … , Σ for all . Task set can be scheduled to meet all deadlines on uniform Let multiprocessor platform if and only if the following constraints hold: and , for all 1, … , .
112
P.J. Nistler and J.-L. Gaudiot
The speed scaling algorithm needs to select processor frequencies as close to as possible while meeting the constraints of Theorem 2. The first and easiest constraint is to ensure that only processors are provided to the scheduler. This is accomplished by idling or powering down any excess processors. Next the algorithm must ensure that the list of tasks is sorted by . If the most demanding task in the set is less than or equal to , then it is possible to schedule all the tasks when the active processors operate at . The solution for this case meets the constraints of Theorem 2 and is optimal from a power perspective. When the value is greater than a slightly different approach needs to be taken. In order to ensure that Theorem 2 is upheld set . In this situation one of the processors in the system must operate at in order to meet the constraints. Further, if runs at then it is operating at the minimum frequency possible for meeting the deadline of . In this situation the deviation from the necessary frequency for maintaining task set feasibility has been reduced to zero. This processor is running at the optimal frequency for the current condition. Once the frequency of has been thus set it can be ignored for the remainder of the speed scaling algorithm. Likewise, since the work required by has been accounted for, this task is ignored as well. The speed scaling algorithm is then recursively called on the reduced set of processors and tasks, where . By induction, this speed scaling algorithm produces a minimum power system configuration achieving task set feasibility. This speed scaling algorithm will be implemented as a piece of the LTF-M algorithm developed by Chen et al. in [1].
5 Building the Scheduler The scheduler being proposed is composed of three components: the framework, the speed scaling algorithm, and a T-Ler plane scheduling algorithm. The framework is the mechanism which breaks the timeline into a series of T-Ler planes. It defines the boundaries of each plane and keeps track of which task needs to be executed in the plane. The framework provides the current task set to the speed scaling and T-Ler plane scheduling algorithms. The speed scaling algorithm, in turn, sets the operating frequency of each processor such that the system consumes the least power while maintaining a feasible hardware configuration. The task set, the length of the current plane, and the processor configuration are finally passed to the plane scheduling algorithm. The plane scheduling algorithm then performs its role of executing the tasks to meet the deadline on the current hardware configuration.
Fig. 2. Breakdown of the Scheduler
Power Efficient Scheduling for Hard Real-Time Systems
113
The framework based on Chen and Hsueh is overly constrained. It can only execute a subset of periodic tasks, it does not dynamically adjust to reduce power when a task completes early, it assumes that the scheduler can run in zero time, and it cannot handle shared resources, major hindrances to an actual implementation. This section describes the changes needed to allay these shortcomings. 5.1 Supporting More Task Types The first shortcoming of the framework proposed by Chen and Hsueh in [4] is that it only supports a highly restricted set of task types. The authors assumed that the period of every task equals its deadline, . This assumption is mutually exclusive with the ability to support sporadic and aperiodic tasks. The solution is to extend the framework so that it supports task sets where . The plane scheduler requires that every task’s period and deadline line up with a T-Ler plane boundary. The framework will set the plane boundaries accordingly. The downside of allowing the deadline and period to take on different values is that there may be more T-Ler planes within the hyperperiod. Naturally, more T-Ler planes directly correspond to more invocations of the scheduler. A sporadic task can be viewed as a specific type of periodic task. It has unique values and , an arrival time , and ∞. However the sporadic task cannot be injected directly into the current task set or deadline violations might ensue. This is because the current T-Ler plane has already been configured for the existing task set. Either the processor frequencies would require adjustment within the plane, or the sporadic task must wait to begin execution until the subsequent plane. The framework will wait for the subsequent plane. There are two reasons driving this decision. The first reason is the high execution cost required to adjust the schedule mid-plane. The second reason is to minimize the number of changes to the processor frequencies. This decision to wait can result in a large number of rejected sporadic tasks which might have been executable, if only they began executing earlier. This rejection rate can be reduced by having the framework insert extra plane boundaries into the system, thereby reducing the average length of the plane. The trade-off is an increased number of scheduler invocations. The sporadic task acceptance test for this scheduler is based on Theorem 2. If this rule holds true, while including
in , for all T-Ler planes in which the sporadic task executes (from the first boundary encountered after the sporadic task arrived, through the task’s deadline), then it can be accepted. With this the scheduler is capable of handling sporadic tasks while providing real-time guarantees. In a similar manner aperiodic tasks can be executed whenever the system has free resources. This means that provided , for all 1, … , remains true, 1 the scheduler can assign the aperiodic task a utilization of and schedule the task within the current T-Ler plane.
114
P.J. Nistler and J.-L. Gaudiot
5.2 Exploiting Hidden Speed-Ups The second shortcoming in Chen and Hsueh’s work is that the work allocation and execution is assumed to be P-fair. In order to enforce this P-fair model, each task would need to be heavily instrumented to monitor its progress. The solution is to use a weaker model which states that at all times the actual work executed is greater than or equal to the actual work allocation. This model gives the flexibility to run a task faster than the worst case execution time. This is important for implementation on a real system as there are many hidden factors which can improve the execution time. Several common examples are branch predictors, caches, and break statements nested inside of a loop. The weaker model allows the system to exploit these hidden speed-ups and does not require instrumentation of the code. This weaker fairness model is only used in the framework. It cannot be driven into the scheduling algorithm without risking a significant increase in complexity. When a task finishes its execution it informs the framework that it has completed. The framework will remove the task from the task set until its next period, if applicable. The system utilization for the new plane is: where is the utilization of the completed task. Inside of a given T-Ler plane the scheduling algorithm will not be aware that a task is executing faster than worst case. The tasks will be scheduled onto the processors according to their utilizations which at this point is dictated by the worst case execution time. The result is that the schedule for the plane can be determined at the beginning of the plane. This pre-determined schedule will be followed for the duration of the plane without any further invocations to the scheduling algorithm.. The framework can further minimize the system power by inserting extra plane boundaries into the hyperperiod. The extra plane boundaries reduce the delay from when the task finishes to when the hardware configuration will be adjusted, at the next invocation of the speed scaling algorithm. As the time between boundaries approaches zero, the tasks are removed as soon as they finish. This means that the scheduler never requests more work from the system than is absolutely necessary. The price for this power savings is the overhead of calling the scheduler more frequently. This price is not as dramatic as it first seems if the framework inserts boundaries to break up a larger T-Ler plane into identical length smaller T-Ler planes. In the case where no task finishes early the schedule produced for each plane will be identical. remain unchanged which guarantees the same result from the The task set and speed scaling and scheduling algorithm. The framework should recognize this and use the same schedule, saving significant execution time. 5.3 A Lightweight Scheduling Algorithm Recall that the speed scaling and scheduling algorithms will be invoked prior to the start of every T-Ler plane. The large execution complexity will become unwieldy as the number of planes increase. The LTF-M algorithm developed by Chen et al. in [1] is a lightweight algorithm which optimally sets the processor frequencies and schedules tasks, provided that all the tasks share a common start time and a common deadline. These requirements are met at the level of the T-Ler plane. The LTF-M algorithm
Power Efficient Scheduling for Hard Real-Time Systems
115
only has a complexity yet is still optimal both from the feasibility perspective and from a power perspective. 5.4 Shared Resources
local remaining execution requirement
The next improvement for the scheduler is sharing resources. Currently the scheduler requires that every task executes independently with no ordering between them. However, a resource often allows only one task to access it at a time. This section shall explain how the scheduler framework supports mutually exclusive, shared resources and introduces time based locks to enforce the ordering between tasks. Support for shared resources will be handled through mutex locks. These locks require a minor modification to the scheduler framework and do not violate any realtime guarantees. The change is most clearly explained with the help of Fig. 3.
T1
T
2
0
t
a
t
b t
t c
d time
t
f
Fig. 3. Shared Resources in the Fluid Schedule
The system is executing two tasks, and , where the shaded regions represent the time at which the shared resource is held by each task. The location and execution time of these critical sections must be provided to the framework. The framework will need to know the last possible moment at which the critical section may begin, and , and the last possible moment that the critical section may be exited, and . There will be no conflicts provided that or while the tasks execute according to their fluid schedules. However, within a T-Ler plane each task does not necessarily execute according to its fluid schedule. For example, may not begin execution in a given plane until 0. This delayed start may result in a critical section finish time ′ . In this case would enter its critical section at , violating the deadline guarantee. The simple solution is to force the tasks to acquire and release the resource at fixed points in time: e lock must be requested and released at the granularity of a whole TLer plane. This is accomplished by having the framework insert plane boundaries at the requesting edge of a critical section. Provided that in the original fluid schedule, setting only as a boundary ensures that ′ .
116
P.J. Nistler and J.-L. Gaudiot
As the framework prepares to pass the upcoming plane’s task set and deadline to the LTF-M algorithm it will perform a quick check. If there are multiple calls to a common lock, then the framework will insert a boundary at the beginning of the second request. Effectively the current plane will be shortened to run from the current time to the newly inserted boundary. (It is at this point that the framework can introduce extra, evenly spaced boundaries which can be used to reduce power and lower the sporadic task rejection rate.) The smaller, conflict-free plane will then be passed onto the LTF-M algorithm. This solution ensures that at no point can multiple tasks be given access to a single, shared resource. As mentioned in the last section a task may execute faster than its fluid schedule, resulting in a task requesting a resource early. There are two possible cases that might be encountered. The first case is when requests the resource while is holding it. The outcome is that will be forced to block its execution until after releases the lock. There is no problem here, because is not required to be executing yet. The lock naturally solved the problem.The second case is when executes significantly faster than such that it requests the lock before requests the lock. If is allowed to acquire the lock first, then the system can no longer guarantee the deadline for . In order to satisfy the hard real-time constraints, the system needs a method for selectively granting locks to the tasks. The task cannot employ the “compare and swap” instruction on the processor as the system would not be able to enforce the desired order. And, directing all lock requests through the operating system introduces overhead. The solution to the problem is using time based locks. This lock is constructed such that only the scheduler / operating system can set the value of a lock to a nonzero value. Moreover, the value of the lock will correspond to the task ID. In order to use the lock, the task will verify that its ID corresponds to the lock’s value. If the values match, then it proceeds with the assurance that it has sole possession of the lock. Otherwise, the task will be required to continually check the value of the lock until the value matches the task ID. When the task wishes to release the lock it will set the lock value to 0. Notice that once the task releases the lock it will not be able to reacquire the lock on its own. In the system described here, the framework shall be responsible for granting the locks. Because the framework has used the requesting edge of a resource as a T-Ler plane boundary, it can be shown that only one task will require a given resource within a plane. If multiple tasks require the same resource in a plane, then the tasks are not schedulable because . Therefore, if the tasks were originally implemented with no resource conflicts, the scheduler can set the lock value to the one task that requires the resource. This time based lock only works because the execution time of each critical section has been well defined. The critical code sections will be strictly ordered according to the task utilization values to ensure that all deadlines are met. The tasks can directly check the value of the lock with a read instruction and release the lock with a write instruction. Therefore, if the fluid task schedule does not show any resource conflicts then the system will function correctly.
Power Efficient Scheduling for Hard Real-Time Systems
117
5.5 Running the Scheduler This section shall focus on evaluating the execution requirement for the scheduler and how to invoke the scheduler while meeting the system’s requirements. The execution requirement of the scheduler can be described by the complexity of the framework and the LTF-M algorithms. With bounds on the computational complexity, the worst case execution time of the scheduler can be determined for a given task set. This is to say · where is jointly determined by the hardware implementation and the actual code implementing the scheduler. Using the previously derived complexity of the LTF-M algorithm and the complexity of the Framework algorithm the total scheduler complexity is lg lg where is the number of locks in the system. That means lg lg for each plane. This bound on the worst case execution time for the scheduler, allows the system to schedule the scheduler as a task. The only real-time constraint on the scheduler is that it finishes execution before the start of the T-Ler plane. The assumption hitherto has been that the scheduler ran between the end of the previous plane and start of the subsequent plane. However, the power consumption can be reduced by running the scheduler in the background. The system should set . By running the scheduler in the pre, vious plane it ensures that it completes the schedule before the next plane begins, while providing the largest possible deadline without overlapping the execution periods of any scheduler invocation. In this case the scheduler appears to be a recurring sporadic task. In each plane the task arrives with unique and for the plane. The average power consumption for running the scheduler in the previous plane is lower than running the scheduler between the planes. This can be shown by equations (3) and (4). The average power required for running the scheduler between planes is ,
. The power required for the scheduler itself is based on one processor operating at for time. The execution of both the task set and scheduler must happen within the original for the plane. However, to calculate the average power when running the scheduler in the previous plane requires the assumption that the work required to schedule the next plane is the same as the work required for the current plane. Setting the work requirements to be equal creates a fair comparison with as both cases execute the same , amount of work within the same overall timeframe. The power required for running the scheduler in the previous plane is 1 ,
.
The equation shows that in the ideal case the power consumed in the task set increases in the same fashion as adding a sporadic task to the system.
118
P.J. Nistler and J.-L. Gaudiot
When 0 the average power consumed by each approach is going to be moves towards its upper bound of both equations monotonidentical. As ically increase. The difference in the approaches lies in what the equations increase towards. Taking the limits of both equations lim lim
,
∞ 1
,
.
Since when 0 and they both monotonically , , . It is more power efficient to run the scheduler increase, , , inside the previous T-Ler plane than to run it between the planes.
6 Unique Solutions for Unique Problems The scheduler developed up to this point is the high-reliability scheduler. It optimizes the task set execution for the task’s worst case execution, assuming that the task always requires execution time during the period. Consequently, it sets the task to have a fluid schedule where / . The system adjusts the power reactively, only adjusting once certain that the task is performing faster. / In an ideal situation each task in the system would execute with , corresponding to the minimum speed required to finish the task by the deadline. The cannot be known in advance. The problem with this scenario is that the value , instead, which is assumed to be relatively solution is to use a predicted value , . The trade-off is that an incorrect value may cause the system to close to , consume more power than the high-reliability approach. This section will extend the previously developed high-reliability scheduler to proactively reduce system power. Ideally the scheduler would know the value , before the task began execution. But the reality is that this value cannot be known in advance with complete certainty. No prediction is completely accurate. Therefore, the scheduler must be capable of optimizing the task execution based on an inaccurate prediction. The scheduler cannot trust the inaccurate prediction without inheriting the potential risk of missing a deadline. The optimization must be made while ensuring that it is possible for to execute for by the deadline, . The task utiliza, tion does not need to be constant during execution, provided that on average . The resulting schedule is similar to the stochastic model depicted in , Fig. 1. If the task finishes execution early it will require less power than the WCET model. The scheduler must ensure that it is possible for the worst case execution to be completed exactly at the deadline. . At this point the scheduler is provided with a predicted execution time , for For the scheduler to set a task utilization it must generate a deadline , completing the predicted amount of execution. Using these two values the scheduler / , from time 0 to , . If the task does can execute with , not complete by , then it will set higher for the remaining execution.
Power Efficient Scheduling for Hard Real-Time Systems
119
The scheduler must make a correct decision when it selects , . The value should be large enough to minimize between time 0 to , while ensuring that / 1. It must rein the latter half of execution , , main possible for a processor to execute the remaining work. This results in an upper . The scheduler can also develop a bound for the value , , lower bound on the predicted deadline. The lower bound provided by the high/ . As long as the predicted deadline reliability design is , , meets these bounds the deadline can be met. Additionally when the , the system power is at least as low as the high-reliability design. , falls inside of its boundaries determines the reliabilThe location at which , ity of the system. At the lower bound the value , has no impact on the system. In this case the schedule is non-aggressive and is optimized for the worst case. The result is a schedule identical to the high-reliability design. Conversely when , has a significant impact on the system. is at its upper bound, the accuracy of , A correct prediction uses the least power possible while guaranteeing the deadline. Meanwhile an incorrect prediction will require more power than the high-reliability schedule. This approach is not without its penalty. The average power can be worse with an inaccurate predictor. Even more terrible, the instantaneous power consumption can spike. In the case where , , the utilization suddenly increases at , corresponding to an increase in the required power. However, if all the tasks , share a common deadline, every task is poorly predicted, and each task requires the complete worst case execution time , then the total system power will dramatically increase. The utilization of each task directly before the deadline would be 1. The power required to meet this demand is huge. More importantly it means that if then the system cannot guarantee task set feasibility when it aggressively selects . When the prediction is at the upper bound the scheduler is non-optimal, , reduced to . is This worst case feasibility scenario only happens when both the value , is selected aggressively. Conversely when inaccurate and the value , is correct or , is at its lower bound the system is optimal. Using this , bit of intuition the system designer can tailor the system for his needs. This commercial scheduler is not that different from the high-reliability scheduler. The only significant change is that the new design has the task execute with a variable utilization. The actual utilization value used is selected as a trade between potential power savings and system reliability. As the system is adjusted to reduce power more aggressively the guaranteed feasibility bound reduces from Theorem #1 to .
7 Conclusion and Future Work A power efficient, hard real-time scheduler for multiprocessor platforms is developed in this paper. The original goal to develop a system that would be useable in an actual system implementation imposes several constraints on the scheduler. It needs to support the spectrum of real-time tasks: periodic, sporadic, and aperiodic. The scheduling
120
P.J. Nistler and J.-L. Gaudiot
mechanism also needs to be scalable while supporting mutually exclusive, shared resources. In the end all of the goals are achieved while maintaining a low complexity of lg lg where is the number of tasks and is the number of mutually exclusive, shared resources in the system. The solution is also an optimal scheduler when the lower bound of , is chosen. In this case it optimizes the schedule for minimum power during the worst case execution scenario. However, the scheduler allows for more aggressive power reductions according to the needs of the system designer. The trade-off is that the feasibility bound is reduced. This scheduler is useful for an actual implementation. It takes into consideration that processor speeds should be changed sparingly, context switches are minimized, and it takes its own execution time into consideration. The future work is to implement and test this scheduler on an actual hardware platform and get feedback from industry’s real-time community. Acknowledgements. This work is partly supported by the National Science Foundation under Grant No. CCF-0541403. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References 1. Chen, J., Hsu, H., Chuang, K., Yang, C., Pang, A., Kuo, T.: Multiprocessor EnergyEfficient Scheduling with Task Migration Considerations. In: 16th Euromicro Conference on Real-Time Systems, pp. 101–108. IEEE Computer Society, Washington (2004) 2. Gruian, F.: Hard Real-Time Scheduling for Low-Energy Using Stochastic Data and DVS Processors. In: 16th International Symposium on Low Power Electronics and Design, pp. 46–51. ACM, New York (2001) 3. Malani, P., Mukre, P., Qiu, Q., Wu, Q.: Adaptive Scheduling and Voltage Scaling for Multiprocessor Real-Time Applications with Non-Deterministic Workload. In: DATE 2008: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 652–657. ACM, New York (2008) 4. Chen, S., Hsueh, C.: Optimal Dynamic-Priority Real-Time Scheduling Algorithms for Uniform Multiprocessors. In: Proceedings of the 2008 Real-Time Systems Symposium, pp. 147–156. IEEE, Washington (2008) 5. Funk, S., Goossens, J., Baruah, S.: On-line Scheduling on Uniform Multiprocessors. In: Proceedings of the 2001 Real-Time Systems Symposium, pp. 183–192. University of North Carolina at Chapel Hill, North Carolina (2001)
Storage Device Performance Prediction with Selective Bagging Classification and Regression Tree Lei Zhang1 , Guiquan Liu1 , Xuechen Zhang2 , Song Jiang2 , and Enhong Chen1 1
Abstract. Storage device performance prediction is a key element of self-managed storage systems and application planning tasks, such as data assignment and configuration. Based on bagging ensemble, we proposed an algorithm named selective bagging classification and regression tree (SBCART) to model storage device performance. In addition, we consider the caching effect as a feature in workload characterization. Experiments indicate that caching effect added in feature vector can substantially improve prediction accuracy and SBCART is more precise and more stable compared to CART. Keywords: Performance prediction, Storage device modeling, CART, Ensemble learning, Bagging.
1
Introduction
Today’s high-end storage systems are highly complex and configurable, and the automation of storage management is a critical research challenge. One key issue in the automation of storage management is the placement of data sets onto various devices in the storage system, or how to map a workload of specific characteristics onto a appropriate storage device for high service quality and system utilization. To this end, the system must be able to predict the performance a device can provide in its service of a particular workload. Performance prediction for storage systems have long been studied. Among them are three methods which are particularly useful and efficient. They are analytic device modeling, simulation and emulation, and black-box modeling. It is a cumbersome task to build accurate analytic models for disk drives because of their nonlinear and state dependent behavior. Ruemmler and Wilkes [1] developed analytic disk models that take into account disk head positioning, platter rotation, and data caching and read-ahead. The model is further improved by Worthington et al. [2], resulting in a widely used disk simulator, DiskSim [3], which represents the state of the art in disk simulation. DiskSim simulates almost C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 121–133, 2010. c IFIP International Federation for Information Processing 2010
122
L. Zhang et al.
all performance-relevant components of a disk, including device drivers, buses, controllers, adapters, and caches. Emulators go one step further than simulators. In addition to modeling performance, they can interoperate with real systems. For example, MEMS devices can interact with existing system [4]. While disk arrays are widely used in the high-end storage systems, a lot of research work focuses on the modeling and simulation of disk arrays [5–7]. Among the work, the Pantheon storage system simulator [7, 8] was built to support the rapid exploration of design choices of storage systems in HP AutoRAID advanced disk array technology [9] and TicherTAIP parallel RAID architecture [10]. Uysal et al. developed a composite analytic model of mid-range disk array and reported its accuracy within 15% of actual measurements [6]. Compared with simulations and emulations, analytic models are much faster. However, they cannot capture as many details as simulators and emulators. Both methods rely heavily on human expertise on the targeted system and thus are called white-box approach. Given sufficient time and expertise, the white-box approach can work well in exploring design space for a particular device. Unfortunately, such time and expertise is not always available for high-end storage systems because the systems are usually complex and opaque. In addition, some information, such as patented use of algorithms and optimizations, is not disclosed. Furthermore, the technical trend towards storage consolidation in large data centers hints that building an accurate model or simulator using white box method cannot be a general solution in serving a variety of very different workloads. In contrast, the so-called black box approach treats the storage system as a black box without knowing the internal components or algorithms and can accommodate workloads of different characteristics. In this approach, the training data sets, which contain quantified description of characteristics of input I/O requests and their corresponding response times from the system, are recorded in a table [12] and fed into a statistic model [13] , or a machine learning model [14, 15]. Wang et al. [15] proposed to use classification and regression tree (CART) method as a black-box model for performance prediction as it is easy to fit for different workloads, has good interpretability, and provides good approximations to highly nonlinear mappings. However, the CART model has its critical drawback – it is not stable, in the sense that a small change of the data set can lead to a drastic change of the result (more details will be described in section 2.1). This paper addresses the issue by using ensemble algorithms to improve and enhance the accuracy and stability of the basic CART model. In this paper, we propose the selective bagging CART (SBCART) model, in which we modify the bagging algorithm for regression ensemble. In the model built on top of the CART model, we first train N models by bagging and then select n representative models from N models, where n < N . Compared with the CART model, SBCART can provide more precise and more stable performance predictions for modeled storage devices. In addition, an important measure missing in the feature vector designed by Wang [15] is about caching effect, which
Storage Device Performance Prediction with SBCART
123
Fig. 1. (a) Regression tree constructed with the Financial I/O trace containing 60 instances; and (b) regression tree of 56 instances, which is produced by removing four randomly selected instances from the tree in (a). The shaded nodes represent leaf nodes containing the predicted values. The Nmin is set to 25. More details on the Financial trace can be found in Section 3.
makes a substantial difference on prediction accuracy. We include the measure in the vector to make good predictions with the SBCART model. The remainder of this paper is organized as follows. Section 2 describes the SBCART model. Section 3 presents our experimental results, and Section 4 concludes the paper.
2 2.1
The SBCART Models The CART Model
CART [16] is a nonparametric model which uses historical data to construct so-called decision trees. Trees are built top-down recursively beginning with a root node. At each step in the recursion, the CART algorithm determines which predictor variable and its value in the training data best split the current node into child nodes. The best split should minimize the difference among the instances in the child nodes. In other words, a good split produces child nodes with instances that contain similar values. Trees are grown to be excessively large with hundreds of levels if there is no pruning step. Furthermore, a tree of too large size may cause over-fitting, in the sense that it may perform poorly in predicting independent data. There are two pruning algorithms: optimization by minimum number and cross-validation. In the first pruning algorithm, splitting is stopped when the number of instances in the node is fewer than predefined required minimum Nmin . This approach is efficient, easy to apply and can produce consistent results. However, it requires the calibration of new parameter Nmin . In the second pruning algorithm, the procedure of cross validation is mainly based on the optimal proportion between
124
L. Zhang et al.
the complexity of the tree and the misprediction error. With the increase of tree size, the misprediction error is decreasing and reaches 0 when the tree grows into maximum tree. Unfortunately, it usually generates complex decision trees that perform poorly on independent data. Therefore, a critical operation is to find the optimal proportion between the tree complexity and misclassification error. Cross-validation does not require adjustment of any parameters. However, it is expensive to apply this pruning algorithm. Once the tree is built, an instance can travel the pruned tree to make a prediction. At each tree node, either left branch or right branch is taken according to the outcome of comparison of the instance with the split variable and its value of the node. Finally, the instance reaches the leaf node whose value will be the predicted value. However, CART may produce unstable decision trees. Insignificant modification of learning instances, such as elimination of a few instances or changing split variables and values, could lead to radical changes in decision trees. As Figure 1 shows, the decision tree constructed with 60 instances in Figure 1(a) is very different from the tree in Figure 1(b), where only four randomly selected instances are removed. 2.2
Ensemble Learning
The goal of the ensemble learning method is to construct a collection (an ensemble) of individual models to improve the accuracy and performance of a single model. Many researchers have demonstrated significant performance improvements through ensemble methods [17, 18]. Figure 2(a) shows the basic steps involved in the training of a series of models with training data and in the using of voting strategy to predict new data samples. Two of popular techniques for constructing ensembles are bagging [19] and the adaboost family of algorithms [20]. Both methods invoke a base learning algorithm many times with different training sets. In bagging, a training set is derived by forming a bootstrap replica of the original training set, and each training record has the same weight. Compared with the bagging, the adaboost algorithm maintains a set of weights over the original training set and adjusts these weights after each model is trained with a base learning algorithm. The adjustments increase the weights of examples which are poorly predicted by the base learning algorithm and decrease the weights of examples which are well predicted. Bagging generates diverse classifiers or models only if the base learning algorithm is unstable, that is, small changes to the training set lead to significant changes of the learned classifier or model. Bagging can be viewed as an approach of improving prediction accuracy by exploiting the instability, because the composite model can efficiently reduce the variation of individual models. Adaboost requires less instability than bagging, because it can make much larger changes in the training set. As CART is instable, we propose to use the ensemble of bagging to improve the prediction accuracy of individual CART models.
Storage Device Performance Prediction with SBCART
125
Fig. 2. (a) Using ensemble learning to improve the accuracy of individual basic CART models; and (b) using selective ensemble learning to improve the accuracy of a basic model. In (a), bagging or adaboost can construct a series of models M1 , M2 , ..., Mk and then predict the unknown samples by using voting strategy; In (b), we first construct k models by bagging or adaboost, then select some representative models from k models, and use voting strategy to predict new samples.
2.3
The SBCART Method
Bagging is one of the widely used ensemble learning algorithms. Each training set is constructed by forming a bootstrap replica of the original training set. Thus, some samples in the original training set may appear many times in bootstrap data set while other samples may not appear. Prior research indicates that bagging can substantially improve the effectiveness of the unstable basic learning models [19]. However, as the number of ensemble models increases, the space and time cost will increase linearly. Many methods have been proposed to address the issue by using different classification methods [21–23]. Zhou et al. proposed a method to select a portion of a whole model tree and use genetic algorithms to prune the scale [21]. Bakker et al. proposed to cluster all models and then select representative models in each class to prune whole models [22]. Martinez-munoz et al. proposed a method to prune trees in the ordered bagging ensembles [23]. However, those pruning methods are pretty complicated. In contrast, our SBCART algorithm is proposed to solve the problem on regression and the pruning method is simple. We adopt CART as the basic model. First, we create k models by bagging. Second, we sort the k models by median relative error on the training set. Finally, we select the first 20%-50% of whole models to prune the scale. Figure 2(b) shows the basic steps involved in the training of a series of models on the training data, the selection of a part of whole models, and the using of voting strategy to predict new data samples based on the pruned models. Compared with figure 2(a), figure 2(b) adds a selective (pruning) function to prune the k models. The SBCART algorithm is described as follows using pseudo code.
126
L. Zhang et al.
Input: D: the dataset containing d samples; M: CART(the basic model); k: the number of models; s: the number of pruned models Output: Pruned models M ∗ ; Training Phase: (1)for i=1 to k do //Bagging (2) Sampling with replacement, yield Di (remove duplicated instances); (3) Create the model Mi based on Di ; d |M (X )−y | (4) Compute error(Mi ) (error(Mi ) = d1 j=1 i yjj j ); //median relative error (5)end for Pruning Phase: (1)Order k models by error(Mi ) in ascending order; (2)Get the first 20%-50% of the ordered models Prediction Phase: Using the pruned models to predict testing data X (1)for i=1 to s do i )) (2) Wi = log( (1−error(M + 1); //assign weight for each model error(Mi ) (3) Vi = Mi (X); //predicted values (4)end for (5)Normalize n the Wi ; (6)return i=1 Wi ∗ Vi In the algorithm, we remove duplicated instances in Di while sampling with replacement in training phase, because the duplicated instances in the training set can have a negative effect on choosing the best split variable and lead to significant changes of the structure of the tree. In our experiments, the size of the data set is reduced by half (5000 instances to 2500 instances) and the time of tree construction is reduced from 7 seconds to 1.25 seconds after deleting duplicated instances in Di . In the algorithm, the weight function in the prediction i )) + 1), this can guarantee the weight to be posiphase is set to log( (1−error(M error(Mi ) tive. Furthermore, in order to get enough representative models, we choose small pruning proportion (20% for example) if the k is large, and choose large proportion (50% for example) if the k is small. Compared with unpruned bagging ensembles of CART model, the SBCART has big advantages in both space cost and computation time because the scale of models is reduced to about 20%-50% of the whole model.
Storage Device Performance Prediction with SBCART
127
Fig. 3. (a) Training a SBCART model based on observed response times; and (b) using the model to predict response times
2.4
Predicting Performance with SBCART
Our goal is to build a model for a given storage device to predict device performance as a function of I/O workload. We use the UMass traces [24] which define a workload as a sequence of I/O requests. Each request Ri is characterized with five attributes: application specific unit (ASU), logical block address (LBA), size (SIZE), opcode (OPCODE), and timestamp (TIMESTAMP). The ASU is a positive integer representing the application specific unit; The LBA field is a positive integer that describes the ASU block offset of the requested data; The SIZE field is a positive integer that describes the number of requested bytes, where the size of a block is contained in the description of the trace file; The OPCODE field is a single, case insensitive character that defines the direction of the transfer, R or r indicates a read operation, W or w indicates a write operation; The TIMESTAMP field is a positive real number representing the offset in seconds for this I/O request from the beginning of the trace. Our approach uses SBCART to approximate the function. We assume that the model construction algorithm can take any workloads on a device for model training. Figure 3 shows the basic steps involved in the training of a model based on the observed response times and using the model to predict system response, which is per-request response time in this study. Model construction does not require any information about the internals of the modeled device. Therefore, the methodology is generally enough to model any device. We compared our SBCART model with CART model in Table 1. Various aspects are listed, including prediction error, stability, interpretability, robustness to outliers, ability to handle irrelevant input, model construction time, and prediction time. We list these aspects in the order of their importance to the storage performance prediction. Good stability indicates that a small change of the training data set cannot lead to significant change of the prediction results. Interpretability describes a model’s ability to infer the importance of input
128
L. Zhang et al.
Table 1. Comparison between SBCART and CART models when they are used to predict per-request response time
variables. Robustness describes a model’s ability to respond to noisy data sets. Irrelevant input refers to features that have little predictive value. We only compare SBCART with CART, and the comparison of other regression methods can be found in [15]. The two models are constructed using the first 5000 instances of Financial user4 trace and run on another 5000 instances of the same trace for testing (More details on the trace can be found in Section 3). The parameter k in SBCART is set to 20 and Nmin in CART is set to 10. As shown in Table 1, the prediction error (median relative error) of SBCART is lower and the stability is better compared to CART, as the composite models can reduce variance of individual models. The construction of SBCART model takes a longer time period and the space overhead of this model is higher, because SBCART needs to build k different models. However, the higher costs are well affordable in the systems for storage device performance prediction. Furthermore, the model construction time for SBCART can be reduced with parallel execution as each bootstrap modeling is independent. Overall, the SBCART method proposed for storage device performance prediction is more stable and more precise than CART.
3 3.1
Experiments Request Feature Vector
Our request Feature Vector (FV) for Ri contains the following variables: Request Vector Ri = [T imeDif fi(1), ..., T imeDif fi(k), LBNi , LBN Dif fi (1), ..., LBN Dif fi (m), Sizei , RWi , Seq(i), Hit(i)] where T imeDif fi(l) = T imeStampiT imeStampi−l (l = 1, 2, ..., k), LBN Dif fi (k)=LBNi -LBNi−k (k = 1, 2, ..., m); The first k variables measure the temporal burstiness of the workload when Ri arrives. The next m + 1 variables measure the spatial locality in terms of the distance of two continuous requests. Seq(i) indicates whether the request is a sequential access; Sizei and RWi is related to the data transfer time. Hit(i) indicates whether a request is hit in the cache.
Storage Device Performance Prediction with SBCART
3.2
129
Devices and Traces
We use DiskSim [3] to simulate a disk (Seagate ST32171W) of 7200RPM. We replay all the traces on the device to obtain the training data set. We use the UMass traces [24] consisting of Financial traces and WebSearch traces. The Financial traces are from OLTP applications at two large financial institutions (relatively more sequential) and the WebSearch traces are from a popular search engine (relatively more random). There are several fields in the record for a request in UMass trace file. The first field is the ASU, which is related to application. In our experiments, we assume that one user runs one application on the server. Therefore, ASU number can be considered as a user ID. We randomly chose two ASU numbers and filtered out all the requests for each of these ASUs, respectively. Accordingly, we obtained WebSearch-user1 and WebSearch-user2 traces from WebSearch1.spc, Financialuser2 and Financial-user4 traces from Financial1.spc. We built our models based on those traces. 3.3
Evaluation Methods
For evaluation, we use the trained device models to predict response time for ˆ a single request. We define the relative prediction error as |Y Y−Y | to show the accuracy of different modeling algorithms. We also show the average, 90th, 80th, and 70th percentile relative errors of response time for different data sets. Based on the above four users’ traces, we trained four models: M odelF in2 , M odelF in4 , M odelW eb1 , M odelW eb2 respectively. One hundred thousand requests are obtained for each user from the original trace and half of the requests are used for training while half of them are used for testing. In our experiments, k in T imeDif fi(k) is set to 3 and m in LBN Dif fi (m) is set to 5. The k in SBCART is set to 20, the pruning proportion is set to 50% and the Nmin of CART is set to 10. 3.4
Experiment Results
Figure 4 compares the median relative errors of the two models (SBCART and CART) in modeling the Seagate ST32171W disk on Financial (M odelF in2 , M odelF in4 ) and WebSearch (M odelweb1 , M odelweb2 ) traces, respectively. Overall, the SBCART-based device models provide better prediction accuracy in predicting the average, 90th, 80th and 70th percentile response times than CART. We can make several observations from the experiment results. First, feature vector must be designed to include all relevant measures. An important measure missing in the feature vector designed by Wang et al. [15] is about caching effect, which makes a substantial difference on prediction accuracy. As hitting in the buffer cache is basically determined by temporal locality of accessed blocks [25], we propose to maintain an approximate LRU stack to efficiently track recency of requested blocks and use it as a measure in the vector. As shown in Figure 4(a), CART-cache can reduce the error from 25.12%, 25.04%,
130
L. Zhang et al.
Fig. 4. Comparison of SBCART and CART on four traces: M odelF in2 , M odelF in4 , M odelW eb1 , M odelW eb2 . CART-nocache shows that cache information (Hit) is not considered in feature vector and CART-cache shows that the cache information is used as a measure in feature vector.
91.64%, 23.18% to 15.01%, 14.15%, 15.99%, 13.48% on M odelW eb1 , M odelW eb2 , M odelF in2 , M odelF in4 , respectively. We can see that the median relative error is reduced by about 10% on M odelW eb1 , M odelW eb2 , M odelF in4 , and by about 75% on M odelF in2 . We also observed that the traces of M odelW eb1 , M odelW eb2 , and M odelF in4 are relatively more random and the trace of M odelF in2 is relatively more sequential, Therefore, sequential workloads like Financial-user2 are more sensitive to the caching effect, and addition of the cache information can greatly reduce the prediction error. Second, SBCART can improve the accuracy and stability of CART. As shown in Figure 4(a), the SBCART-nocache can improve prediction accuracy by about 5% compared to CART-nocache, and the SBCART-cache can improve the accuracy by about 3% compared to CART-cache. We can see that by using the measure of Hit in the feature vector and the ensemble method, the prediction accuracy can be increased by about 13% for the relatively random workloads and about 70% for the relatively sequential workloads. As shown in Figure 5, the SBCRAT is more stable than CART when the training data set changes, because selective ensemble models can reduce the variance of individual models.
Storage Device Performance Prediction with SBCART
131
Fig. 5. Comparison of stability between CART and SBCART. X-axis shows the number of training records missing from the first 5000 instances of WebSearch-user1 trace and its testing data from another 5000 instances from the same trace.
Finally, it is more difficult to predict response times at high percentiles. As shown in Figures 4(b), (c) and (d), the median relative errors are reduced by about 5%, 8% and 11%, respectively, compared to Figure 4(a). We can observe that SBCART can consistently produce more precise predictions than CART. In summary, the SBCART model as well as the workload characterization (feature vector) used in the modeling can produce more accurate predictions and is more stable than the CART model.
4
Conclusions
Storage device performance modeling is an important component in self-managed storage systems, especially in high-end storage systems. Our SBCART model takes a workload as input and predicts its performance on the modeled device efficiently and accurately compared to the CART model. Based on bagging algorithms, we proposed a selective bagging classification and regression tree (SBCART) model using the basic model CART model. Our experiment results show that the SBCART model as well as the workload characterization (feature vector) used in the modeling can produce more accurate predictions and is more stable than the CART model. Acknowledgements. The authors gratefully acknowledge the support of the Fundamental Research Funds for the Central Universities, the National Natural Science Foundation of China (No.60833004 and No. 60775037), the National High Technology Research and Development Program of China (863 Program,No.2009 AA01Z123), and Specialized Research Fund for the Doctoral Program of Higher Education (No.20093402110017). This research was also partially supported by US National Science Foundation under grant CAREER CCF 0845711.
132
L. Zhang et al.
References 1. Ruemmler, C., Wilkes, J.: An introduction to disk drive modeling. IEEE Computer 27(3), 17–18 (1994) 2. Worthington, B., Ganger, G., Patt, Y.: Scheduling algorithms for modern disk drives. In: Proc. of the ACM SIGMETRICS Conference, vol. 22, pp. 241–251. ACM, New York (1994) 3. The DiskSim Simulation Environment (v3.0), Parallel Data Lab, http://www.pdl.cmu.edu/DiskSim/ 4. Griffin, J.L., Schindler, J., Schlosser, S.W., Bucy, J.S., Ganger, G.R.: Timingaccurate storage emulation. In: FAST 2002 on File and Storage Technologies, pp. 75–88. USENIX Assoc., Monterey (2002) 5. Barve, R., Shriver, R., Gibbons, P.B., Hillyer, B.K., Matias, B.K., Vitter, J.S.: Modeling and optimizing i/o throughput of multiple disks on a bus. In: ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pp. 83–92. ACM, New York (1999) 6. Uysal, M., Alvarez, M., Merchant, A.: A modular, analytical throughput model for modern disk arrays. In: 9th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems Conference, MASCOTS, Cinncinnati, pp. 183–192 (2001) 7. Wilkes, J.: The Pantheon storage-system simulator. Technical Report HPL-SSP95-14, Storage Systems Program, Hewlett-Packard Laboratories (1996) 8. Aicheler, U.: A visual user interface for the pantheon storage system simulator. Technical Report HPLSSP961, Storage Systems Program, Hewlett-Packard Laboratories (1996) 9. Wilkes, J., Golding, R., Staelin, C., Sullivan, T.: The HP AutoRAID hierarchical storage system. ACM Transactions on Computer Systems 14(1), 108–136 (1996) 10. Cao, P., Lim, S.B., Venkataraman, S., Wilkes, J.: The TickerTAIP parallel RAID architecture. ACM Transactions on Computer Systems 12(3), 236–269 (1994) 11. Schindler, J., Ganger, G.R.: Automated disk drive characterization. CMU SCS Technical Report CMU-CS-99-176 (1999) 12. Andenson, E.: Simple table-based modeling of storage devices. Technical Report HPL-SSP-2001-04, HP Laboratories (2001) 13. Kelly, T., Cohen, I., Goldszmidt, M., Keeton, K.: Inducing models of black-box storage arrays. Technical Report HPL-SSP-2004-108, HP Laboratories (2004) 14. Mesnier, M.P., Wachs, M., Sambasivan, R.R., Zheng, A.X., Ganger, G.R.: Modeling the relative fitness of storage. In: Joint International Conference on Measurement and Modeling of Computer Systems. ACM, New York (2007) 15. Wang, M., Au, K., Ailamaki, A., Brockwell, A., Faloutsos, C., Ganger, G.R.: Storage device performance prediction with cart models. In: 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS, USA (2004) 16. Breiman, L., Friedman, J., Stone, C.J., A., O.R.: Classification and regression trees. Chapman and Hall CRC, Boca Raton (1984) 17. Kohavi, R., Kunz, C.: Option decision trees with majority votes. In: 14th International Conference on Machine Learning, Morgan Kaufman, San Francisco (1997) 18. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning 36(1-2), 105–139 (1999) 19. Breiman, L.: Bagging predictors. Machine learning 24(1), 123–140 (1996)
Storage Device Performance Prediction with SBCART
133
20. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: 13th International Conference on Machine Learning. Morgan Kaufmann, San Francisco (1996) 21. Zhou, Z.H., Tang, W.: Ensembling neural networks: Many could be better than all. Artificial Intelligence 137(1-2), 239–263 (2003) 22. Bakker, B., Heskes, T.: Clustering ensembles of neural networks. Neural Networks 16(2), 261–269 (2003) 23. Mart nez2mu noz, G., Su rez, A.: Pruning in ordered bagging ensembles. In: 23th International Conference on Machine Learning, pp. 1266–1273. IEEE, Piscataway (2006) 24. Umass trace repository, http://traces.cs.umass.edu/index.php/Storage/Storage 25. Jiang, S., Zhang, X.: LIRS: an effcient low inter-reference recency set replacement to improve buffer cache performance. In: ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pp. 31–42. ACM, New York (2002)
Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs HyeongOk Lee1, Hyun Sim1, JungHyun Seo2, and Mihye Kim3,* 1
Department of Computer Education & Engineering, Sunchon National University, 413 Jungangno Suncheon Chonnam, South Korea {oklee,simhyun}@scnu.ac.kr 2 Research & Development Team, Jeonnam Technopark, 315 Maegokdong Suncheon Chonnam, South Korea [email protected] 3 Department of Computer Science Education, Catholic University of Daegu, 330 Hayangeup Gyeonsansi Gyeongbuk, South Korea [email protected]
Abstract. Bubble-sort, macro-star, and transposition graphs are interconnection networks with the advantages of star graphs in terms of improving the network cost of a hypercube. These graphs contain a star graph as their sub-graph, and have node symmetry, maximum fault tolerance, and recursive partition properties. This study proposes embedding methods for these graphs based on graph definitions, and shows that a bubble-sort graph Bn can be embedded in a transposition graph Tn with dilation 1 and expansion 1. In contrast, a macro-star graph MS(2, n) can be embedded in a transposition graph with dilation n, but with an average dilation of 2 or under. Keywords: Interconnection network, Embedding, Dilation.
1 Introduction Applications in engineering and scientific fields such as artificial intelligence, CAD/CAM (Computer-Aided Design and Computer-Aided Manufacturing), and fluid mechanics require hundreds of operations for data processing, which has led to increased interest in high-performance computers with a large number of processors. Consequently, interest in parallel processing, in which more than one processor simultaneously executes multiple tasks or a part of one program, has risen dramatically. However, it has proven challenging to design effective parallel algorithms. Parallel algorithms are usually designed for a particular parallel computer architecture, so to optimize algorithm design it is necessary to understand parallel computer architectures. There are a number of major architectures, and several methods exist for classifying them. One of the most well-known methods is Flynn’s taxonomy, which categorizes architectures into four groups based on the number of instruction and data streams available in the architecture. Of the four types, MIMD (Multiple Instruction, *
Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs
135
Multiple Data streams) is the most common and it can simultaneously execute multiple instruction and data streams within one system. MIMD parallel computers can be classified into two types based on memory architecture: multi-processor systems with shared memory and multi-computer systems with distributed memory. In a multi-computer system, each process has its own memory and is connected to other processors via an interconnection network. In a multi-computer system, each process has its own memory and is connected to other processors via an interconnection network. Inter-processor communication is achieved by sending messages among computers through the network, and a data-driven system is used for computations [1], [2]. The performance of the multicomputer system depends on the performance of each processor as well as the structure of the interconnection network and the applied algorithms. The characteristics of the interconnection network greatly influence overall performance and scalability of a multi-computer system. The most well-known topologies of interconnection networks are the mesh, hypercube, and star graph. The commonly used parameters for evaluating the performance of interconnection networks are degree, diameter, symmetry, scalability, fault tolerance, and embedding. The embedding of interconnection networks is intended to analyze the interrelationship between graphs to observe whether a certain graph G is included in or interrelated with another graph H. The evaluation of embedding is significant: if graph G can be efficiently embedded in graph H with less cost, then the method developed in the interconnection network with graph G can be used in the interconnection network with graph H at less cost [3], [4]. In previous works [5], [6], we analyzed embedding methods for some star variations such as star, matrix-star, Rotator-Faber-Moore, and pancake graphs. In this paper we analyzed embedding methods for bubble-sort, macro-star, and transposition graphs, which are well-known as variations of the star graph. This work extended earlier work to develop a method for embedding between bubble-sort and transposition graphs [7].
2 Related Work An interconnection network can be represented as an undirected graph G = (V, E), with each processor presented as a node (vertex) v of G, and the communication channel between those processors presented as an edge (v, w). V(G) and E(G) represent the set of nodes and edges of graph G, respectively. That is, V(G) = {0, 1, 2, …, n–1} and E(G) consists of pairs of distinct nodes from V(G). There exists an edge (v, w) between two nodes v and w of G if and only if a communication channel between v and w exists [8]. If we classify the interconnection networks proposed up to now, we can divide them into the mesh variation with n×k nodes [9], the hypercube variation with 2n nodes [2], [3], [10], the star graph variation with n! nodes [8], and the odd graph variation with combination 2nCn nodes [4]. A variation of the star graph represents nodes using n distinct symbols, and the number of nodes is approximately n! nodes. Star [8], [11], bubble-sort [12], pancake [11], transposition [13], macro-star [14], rotator [15], and Faber-Moore [16] graphs have been proposed as variations of the star graph. The graphs have a smaller node degree and diameter than a hypercube with a similar number of nodes.
136
H. Lee et al.
A macro-star graph MS [14] is an interconnection network that improves the network cost of a star graph by generalizing the star graph. The size and degree of MS are determined by parameters l and n. A macro-star graph MS(l, n) has (nl+1)! nodes, (n+l-1) degree, and (nl+1)! * (n+l-1) edges. The address of each node is represented as a permutation of k (=nl+1) distinct symbols. In other words, a node corresponds to a permutation. An edge exists between nodes u and v in MS(l, n) if and only if the permutation of node v can be obtained from that of node u by applying each of, as defined below, two edge generators Tj and Si where 2 ≤ j ≤ n+1 and 2 ≤ i ≤ l. A macrostar graph MS(l, n) can be defined as shown in Eq. (1), where k distinct symbols = {1, 2, .., k}, and a permutation of , U = u1:k = u1u2...ui...uk, ui .
∈
∈ ∈
V(MS(l,n))={U=u1:k│ui,uj , ui≠uj, i≠j, 1≤i,j≤k}, E(MS(l,n))={(U,V)│U,V V(MS(l, n)) satisfying U=Tj(V) or U=Si(V), 2≤j≤n+1, 2≤i≤l}. (1) Two edge generators Tj and Sn,i are defined in the macro-star graph MS(l, n) to formulate a link (edge) from a node to another node. The edge generator Tj is defined to create a permutation by interchanging the first symbol u1 with the jth symbol (uj) of a given node. With a given node U = u1:k = u1u2...ui...uk, the permutation of the node generated by Tj will be Tj(U) = uju2:j-1u1uj+1:k. Another edge generator Sn,i is defined to create a permutation by interchanging the sequence of symbols u(i-1)n+2:in+1 with the sequence of symbols u2:n+1 in a given node. With a given node U = u1:k, the permutation of the node generated by edge Sn,i will be Sn,i(u1:k) = u1u(i-1)n+2:in+1un+2:(i1)n+1u2:n+1uin+2:k. The edge generator Sn,i is represented simply as Si. Here, the symbol sequence u(i-1)n+2:in+1 is referred to as a cluster [14]. Fig. 1a shows a top view of an MS(2, 2) graph, while Fig. 1b presents the details of the level 2 cluster ‘23’. Each circle corresponds to a cluster. The smaller circles in the internal domain of the inclusive circle in Fig. 1b are nodes whose second clusters consist of ‘23’; that is, the full permutation of node 145 is 14523, that of 541 is 54123, and so on. The permutation generated by the edge generator T2 is 41523 in node 14523; that is, T2(14523) = 41523 and T3(14523) = 54123. When the edge generators Tj and Si are sequentially applied to the permutation of a certain node U, they are represented as Si(Tj(U)), and simply SiTj(U). For example, the sequence of generators S2(T2(14523)) will create the permutation 42315 in 14523. First, T2 generates 41523 and then S2 provides 42315 [5]. a)
12
21 31 41
13
14
15
23
24
25
32 42
34 43
541
35
52
53
54
23
451
45
T2
145
41
51 51
45
b)
15 415
S2 514
154
54
Fig. 1. Example of a macro-star graph MS(2, 2)
14
Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs
137
An n-dimensional bubble-sort graph Bn [12] consists of n! nodes and n(n–1)!/2 edges. The address of each node is represented as a permutation of n symbols {1, 2, 3, ..., n}. An edge exists between two arbitrary nodes v and w if and only if the corresponding permutation of the node w can be obtained from that of v by interchanging two adjacent symbols v and w in the permutation. The bubble-sort graph Bn can thus be defined as shown in Eq. (2), where n distinct symbol sets = {1, 2, .., n}, and a permutation of , B = b1b2...bn, bi ∈ . As the number of i- dimensional edges adjacent to B is equal to n–1, the bubble-sort graph Bn is a regular graph of degree n–1 and has a diameter of n(n–1)/2. It is also a hierarchical interconnection network because it can partition the graph with the edge as the center. It is node- and edge-symmetric as well as bipartite and includes Hamiltonian cycles.
An n-dimensional transposition graph Tn [13] consists of n! nodes and n(n-1)n!/4 edges. The address of each node is represented as a permutation of n distinct symbols, and an edge exists between two nodes v and w if and only if the corresponding permutation of the node w can be obtained from that of v by interchanging the positions of any two arbitrary symbols from {1, 2, .., n} in v. A transposition graph Tn can be defined by Eq. (3) with n distinct symbols = {1, 2, .., n}, and a permutation of , P = p1p2...pn, pi . The transposition graph Tn is a regular node symmetric graph with n(n-1)/2 degree, because an edge exists between the permutation that consists of n symbols and that in which two arbitrary different symbols are interchanged. It has maximum fault tolerance with a diameter of n–1 and a fault diameter of n. It also includes Hamiltonian cycles.
3 Embedding Analysis The embedding of one graph G into another graph H is a mapping mechanism for examining whether graph G is included in the structure of graph H, and how they are interrelated. This can be interpreted as simulating one interconnection topology using another. The embedding of graph G into a graph H is defined as a function f = (ø, ρ) where ø maps the set of vertices in G, V(G) one-to-one into the set of vertices in H, V(H), and ρ corresponds to each edge (v, w) in G to a path in H that connects nodes ø(v) and ø(w). Parameters for evaluating the efficiency of an embedding method include dilation, congestion, and expansion. The dilation of edge e in G is the length of the path ρ(e) in H, and the dilation of embedding f is the maximum value of all dilations in G. The congestion of edge e' in H is the number of ρ(e) included in e', and the congestion of embedding f is the maximum number of all edge congestions in H. The expansion of embedding f is the ratio of the number of vertices in H to the number in G [6].
138
H. Lee et al. 1
Embedding edge e (3, 6) into G2
2
with dilation 2
3 e
4
5
6
Graph G1
7
1
2
5
6
3
4
8
7
congestion 2
Graph G2
Fig. 2. Mapping example of G1 into G2
For instance, let each node (vertex) in the set V(G1) be mapped onto the corresponding same node number in the set V(G2) in Fig. 4. Then, edge e (3, 6) in G1 can be mapped onto edges (3, 1) and (1, 6) or edges (3, 7) and (7, 6) in G2 (i.e., to the path from 3 to 6 in G2). Let us assume that edge e (3, 6) in G1 is mapped onto edges (3, 1) and (1, 6) in G2. Here, the dilation of this embedding is 2 because the length of the path ρ(e) in G2 is 2. As well, we can see that the congestion is 2 because edge e (1, 3) in G2 is routed by two edges (1, 3) and (3, 6) in G1, and the expansion is 8/7 [6]. When we assume that R(=r1r2...ri...rj...rn) is a node in an arbitrary graph and V is adjacent from R via dimensional edge J in the graph, it is represented as V = J(R). And when we assume that node V is reached from node R by applying dimensional edges J and K in sequence, we denote the edge sequence as <J, K> and V = K(J(R)). Sequentially applying edge sequence <J, K> to node R means that at the first time unit, the permutation of node J(R) is generated from node R via dimensional edge J, and at the second time unit, the permutation of K(J(R)) is created from node J(R) via dimensional edge K (i.e., V = K(J(R))). The basic principles of embedding applied in this study are as follows. Node mapping to bubble-sort, transposition, and macro-star graphs is based on one-to-one mapping with identical node numbers. When mapping two adjacent nodes (U, V) of a source graph to a target graph, the dimensional edge sequence is defined using the edge definition of the target graph. This sequence is formulated with dimensional edges of the target graph used for the shortest path from ø(U) to ø(V). The dilation of embedding is represented as the number of dimensional edges required for the shortest path. Theorem 1. A bubble-sort graph Bn can be embedded into a transposition graph Tn with dilation 1 and expansion 1. Proof. In the bubble-sort graph Bn, if the permutation of a node B is b1b2b3...bi1bibi+1...bn, then the permutation of node B' adjacent from node B is b1b2...bi+1bibi-1...bn. The edge that connects nodes B and B' is called the i-dimensional edge (2 ≤ i ≤ n). There exist (n-1) edges of dimension i in Bn. When mapping nodes B and B'(=b1b2...bi+1bibi-1...bn) in Bn onto nodes T(=t1t2...titi+1...tn) and T'(=t1t2...ti+1ti...tn), respectively, in Tn, we analyze the dilation of this mapping by referring to the length of the shortest path routing from node T to node T' in Tn. In the transposition graph Tn, there exists an edge between nodes V and W if the corresponding permutation to the node W can be obtained from that of V by interchanging the positions of any two arbitrary symbols from {1, 2, .., n} in V. Here, we can see that the nodes T(t1t2...titi+1...tn) and T'(t1t2...ti+1ti...tn) in Tn are adjacent to each other by the edge
Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs
139
definition of transposition graph Tn. Therefore, a bubble-sort graph Bn can be embedded into a transposition graph Tn with dilation 1 and expansion 1. Corollary 2. A bubble-sort graph Bn is a sub-graph of a transposition graph Tn. Theorem 3. The dilation cost of embedding a transposition graph Tn into a bubblesort graph Bn is O(n). Proof. The transposition graph Tn and bubble-sort graph Bn are both node-symmetric. When mapping the transposition graph Tn and the bubble-sort graph Bn, node T(=t1t2t3...ti...tj...tn) in Tn maps onto node B(=b1b2b3...bi...bj...bn) in Bn, and each node T' of n(n–1)/2 nodes, which are adjacent to node T via edge T(i, j), maps onto B', whose address is the same as that of each corresponding node T'(i < j). We prove Theorem 3 using the mapping case of two nodes T(=t1t2t3...ti...tj...tn) and T'(=tnt2t3...ti...tj...t1), adjacent to each other via edge T(1, n), onto a bubble-sort graph. The node T', which is adjacent to node T via edge T(1, n), is the permutation tnt2t3...ti...tj...t1 in which the first symbol and the nth symbol are interchanged with each other from the permutation of node T. Let us map nodes T(=t1t2t3...ti...tj...tn) and T'(=tnt2t3...ti...tj...t1) in Tn onto nodes B(=b1b2b3...bi...bj...bn) and B'(=bnb2b3...bi...bj...b1) in Bn. Here, nodes T and T' are adjacent to each other, but nodes B and B' in Bn are not. Thus, we analyze dilation based on the length of the shortest path routing from B and B' in Bn. In the bubble-sort graph Bn, the dimensional edge sequence required for the shortest path routing from node B to node B' is <1,2,3,...,n-1,n-2,n-3,...,3,2,1>. First, the first symbol b1 of node B can be moved to the last position (nth position) using the dimensional edge sequence <1,2,3,...,n-1>, since only an edge exists between two nodes in which two adjacent symbols are interchanged with each other. That is, by sequentially applying the dimensional edge sequence <1,2,3,...,n-1> to node B(=b1b2b3...bi...bj...bn), we can obtain the permutation b2b3...bi...bj...bnb1. Next, the permutation bnb2b3...bi...bj...bn-1b1 (i.e., the permutation of destination node B') is obtained by orderly applying the edge sequence to the node of the permutation b2b3...bi...bj...bnb1. Here, the number of dimensional edges applied for routing from B to B' in Bn is equal to 2n-3. Therefore, we can say that the dilation cost for this embedding process is O(n). Theorem 4. A macro-star graph MS(2, n) can be embedded into a transposition graph T2n+1 with dilation n. Proof. We prove Theorem 4 by dividing it into two edges, Ti and Sj, which connect two arbitrary nodes in the macro-star graph MS(2, n). Case 1. Edge Ti, 2 ≤ i ≤ n+1 In the macro-star graph MS(2, n), the node adjacent to node U(=u1u2u3...ui...un+i...u2n+1) via edge Ti is Ti(U)(=uiu2u3...u1......un+i...u2n+1) and it is denoted as U'. When we map nodes U(=u1u2u3...ui......un+i...u2n+1) and U'(=uiu2u3...u1......un+i...u2n+1) in MS(2, n) onto nodes T(=t1t2t3...ti...tn+i...t2n+1) and T'(=tit2t3...t1...tn+i...t2n+1) in T2n+1, we can see that the nodes T and T' in T2n+1 are adjacent to each other through edge T(1, i) according to the edge definition of the transposition graph. Hence, a macro-star graph MS(2, n) can be embedded into a transposition graph T2n+1 with dilation 1.
140
H. Lee et al.
Case 2. Edge Sj, j = 2 Edge Sj, which is incident on node U(=u1u2u3...ui......un+i...u2n+1), can exist only where j = 2, because the macro-star graph MS(2, n) consists of two modules. In MS(2, n), the node adjacent from U(u1u2u3...ui......un+i...u2n+1) via edge S2 is S2(U)(=u1un+1un+2un+3...un+i...u2n+1u2u3u4...ui...un+1), and it is denoted as U'. When mapping nodes U and U'(=u1un+1un+2un+3...un+i...u2n+1u2u3u4...ui...un+1) in MS(2, n) onto nodes T(=t1t2t3...ti...tn+i...t2n+1) and T'(=t1tn+1tn+2tn+3...tn+i...t2n+1t2t3t4...ti...tn+1) in T2n+1, it can be seen that the nodes T and T' in T2n+1 are not adjacent to each other by the edge definition of the transposition graph. Thus, we analyze the dilation of this mapping using the number of edges used for the shortest path routing from node T to node T' in T2n+1. The dimensional edge sequence required for routing from node T(t1t2t3...ti...tn+i...t2n+1) to node T'(tit2t3...t1...tn+i...t2n+1) in T2n+1 is . In other words, edge T(i, n+i), which interchanges the symbols ti and tn+i, which are in the same position of the first and second modules in the permutation of node T(t1t2t3...ti...tn+i...t2n+1), is used n times, so the permutation identical to node T'(tit2t3...t1...tn+i...t2n+1) is generated. Therefore, its dilation is n. When we map a macro-star graph MS(2, n) onto a transposition graph T2n+1, the worst dilation of this embedding is n, but most edges are mapped by dilation 1. Accordingly, it may be advisable to prove that the average dilation reaches to the smallest constant. Theorem 5. A transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 5. Proof. In this embedding, we map node T(t1t2t3...ti...tj...t2n+1) in the transposition graph T2n+1 onto node U(u1u2u3...ui...uj...u2n+1) in the macro-star graph MS(2, n), and node T' onto node U', which has the same permutation with T' among the nodes in MS(2,n). The permutation of the node adjacent to node T(t1t2t3...ti...tj...t2n+1) via edge T(i, j) in T2n+1 is T'=t1t2t3...tj...ti...t2n+1. Here, nodes U(u1u2u3...ui...un+i...u2n+1) and U'(u1u2u3...uj...ui...u2n+1) in graph MS(2, n) are not adjacent to each other, thus we analyze dilation using the number of edges used for the shortest path routing from node U to node U' in MS(2, n). We prove Theorem 6 by dividing it into three cases depending on the values of i and j in T(i, j). Case 1. i, j ≤ n–1, i < j In the edge T(i, j), which connects nodes T(t1t2t3...ti...tj...t2n+1) and T' in the transposition graph T2n+1, values of i and j smaller than (n+2) mean that the two symbols can be interchanged only from the first symbol to (n+1)th symbols. The occurrence of the interchange only from the first symbol to the (n+1)th symbol in node U(u1u2u3... ui... uj... u2n+1) of MS(2,n), in which node T(t1t2t3... ti... tj... t2n+1) of T2n+1 is mapped, means that a symbol interchange occurs among the symbols that consist of the first cluster of node U. Therefore, the edge sequence required for the shortest path routing from node U to node U' is <Ti, Tj, Ti>, because the permutation of node U' is u1u2u3...uj...ui...un+2un+3...u2n+1, and nodes U and U' are not adjacent to each other. The routing process from U to U' using this edge sequence <Ti, Tj, Ti> is as follows. First, node Ti(U)(=uiu2u3...u1...uj...un+2un+3...u2n+1) is reached from node U(u1u2u3...ui...uj...un+2un+3...u2n+1) via edge Ti, which interchanges the first symbol with the ith symbol ui in node U; that is, the node Ti(U) is adjacent to node U through edge Ti. We then get to node TjTi(U)(=uju2u3...u1...ui...un+2un+3...u2n+1) from node Ti(U)
Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs
141
via edge Tj, which interchanges the ith and jth symbols in node Ti(U). Next, we reach node TiTjTi(U)(=u1u2u3...uj...ui...un+2un+3...u2n+1) from TjTi(U) via edge Ti, which interchanges symbols uj and u1 in node TjTi(U). Now, we can see that the permutation of node TiTjTi(U), which is obtained by sequentially applying the edge sequence <Ti, Tj, Ti> to node U, is the same as the permutation of node U'. Therefore, two nodes T(t1t2t3...ti...tj...t2n+1) and T' adjacent via edge T(i, j) in the transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 3. Case 2. i ≤ n+1, j ≥ n+2 In the edge T(i, j) which connects nodes T(t1t2t3...ti...tj...t2n+1) and T' in T2n+1, i ≤ n+1 and j ≥ n+2 mean that two symbols are interchanged with each other, and of these two, the one based on the (n+1)th position is positioned before it, and the other locates next to it. The occurrence of the interchange based on the (n+1)th position of a symbol in node U(u1u2u3...ui...uj...u2n+1) of MS(2,n), in which node T(t1t2t3...ti...tj...t2n+1) of T2n+1 is mapped, means that an interchange occurs between one symbol of the first cluster and one symbol of the second cluster in node U. Since the permutation of node U' is u1u2u3...ui...un+2un+3...uj...u2n+1, and nodes U and U' are not adjacent to each other, the edge sequence required for the shortest path routing from node U to node U' is <Ti, S2, Tj, S2, Ti>. The routing process from U(u1u2u3...ui...un+1un+2un+3...uj...u2n+1) to U'(u1u2u3...ui...un+1un+2un+3...uj...u2n+1) using this edge sequence <Ti, S2, Tj, S2, Ti> is as follows. First, node Ti(U)(=uiu2u3...u1...un+1un+2un+3...uj...u2n+1) is adjacent to node U(u1u2u3...ui...un+1un+2un+3...uj...u2n+1) via edge Ti, which interchanges the first with the ith symbol ui in node U; then node S2Ti(U)(=uiun+2un+3...uj...u2n+1u2u3...u1...un+1) is adjacent to node Ti(U) through edge S2, which swaps the first and the jth cluster uj in Ti(U). Next, node TjS2Ti(U)(=ujun+2un+3...ui...u2n+1u2u3...u1...un+1) is adjacent to node S2Ti(U) through edge Tj, which interchanges the first symbol ui with the symbol uj in the first cluster in S2Ti(U). After that, node S2TjS2Ti(U) (=uju2u3...u1...un+1un+2un+3...ui...u2n+1) is reached from TjS2Ti(U) via edge S2, which exchanges the first cluster in which the symbol ui exists with the second cluster in node TjS2Ti(U). Then, we get to node TiS2TjS2Ti(U) (=u1u2u3...uj...un+1un+2un+3...ui...u2n+1) from node S2TjS2Ti(U) via edge Ti, which interchanges the symbols uj and u1 in node S2TjS2Ti(U). Because the permutation of node TiS2TjS2Ti(U), which is obtained by sequentially applying the edge sequence <Ti, S2, Tj, S2, Ti> to node U, is identical to the permutation of node U', we can see that two nodes T and T' adjacent via edge T(i, j) in the transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 5. Case 3. i, j ≥ n+2 In edge T(i, j), which connects nodes T(t1t2t3...ti...tj...t2n+1) and T' of the transposition graph T2n+1, because the values i and j are larger than (n+1), the two symbols can be interchanged with each other only from the (n+2)th symbol to the (2n+1)th symbols. The interchanges from the (n+2)th symbol to the (2n+1)th symbols in node U(u1u2u3... ui... uj... u2n+1) of MS(2,n), in which node T(t1t2t3... ti... tj... t2n+1) of T2n+1 is mapped, represent the interchanges that occur among the symbols that consist of the second cluster of U. Because the permutation of node U' is u1u2u3...un+1un+2...uj...ui...u2n+1, and nodes U(u1u2u3...un+1un+2...ui...uj...u2n+1) and U' are not adjacent to each other, the edge sequence required for the shortest path routing from node U to node U' is <S2, Ti, Tj, Ti, S2>. The routing process from node U and node U' is as follows.
142
H. Lee et al.
We first use edge generator S2 to exchange the first and second clusters in which symbols ui and uj exist, because the interchange between the symbols in node U occurs only between the symbols positioned in the first column and the symbols positioned in the first cluster. Thus, node S2(U)(=u1un+2...ui...uj...u2n+1u2u3...un+1) is adjacent to node U(u1u2u3...un+1un+2...ui...uj...u2n+1) via edge S2, and node TiS2(U)(=uiun+2...u1...uj...u2n+1u2u3...un+1) is adjacent to node S2(U) via edge Ti, which places symbol ui of S2(U) at the first position. Following this, node TjTiS2(U)(=ujun+2...u1...ui...u2n+1u2u3...un+1) is reached from node TiS2(U) via edge Tj, which interchanges the first symbol ui with symbol uj in the first cluster in S2Ti(U). Next, node TiTjTiS2(U)(=u1un+2...uj...ui...u2n+1u2u3...un+1) is connected to TjTiS2(U) by edge Ti, which interchanges the first symbol uj with u1 in node TjTiS2(U). Then, we use edge S2 to swap the first and second clusters in node TiTjTiS2(U), and reach node S2TiTjTiS2(U)(=u1u2u3...un+1un+2...uj...ui...u2n+1) from node TiTjTiS2(U) through edge S2. Here, we can seen that the permutation of node S2TiTjTiS2(U), which is obtained by sequentially applying the edge sequence <S2, Ti, Tj, Ti, S2> to node U, is identical to the permutation of node U'. Therefore, the edge T(i, j) that connects nodes T(t1t2t3...ti...tj...t2n+1) and T' in the transposition graph T2n+1 can be embedded into a macro-star graph MS(2, n) with dilation 5. Consequently, all nodes in a transposition graph T2n+1 can be mapped one-to-one onto a macro-star graph MS(2, n), and an edge in T2n+1 can be embedded into MS(2,n) with dilation of 5 or under.
4 Conclusion The star graph, a well-known topology of MIMD multi-computer systems with distributed memory, is a small diameter, node-symmetric, hierarchical, and maximum fault-tolerant interconnection network. The transposition graph not only improves the fault tolerance of multi-computer systems, but also shares the advantages of a star graph. It also contains a star graph as its sub-graph. The bubble-sort and macro-star graphs also share the advantages of the star graph. In this paper, we proposed methods for embedding bubble-sort, transposition, and macro-star graphs into one another, which have been introduced as variations of the star graph. These graphs have the same number of nodes and also the same number of symbols in a node. The proposed embedding methods are based on one-to-one mapping of two arbitrary nodes U and U' of a source graph G onto two nodes in a target graph G'. We assumed that two mapped nodes in G' are connected with a minimum of edges based on the edge definition of target graph G'. Then, we analyzed dilation by the number of edges used for the shortest path routing between two mapped nodes in G'. Embedding analysis using the edge definition of graphs is possible because the bubble-sort, transposition, and macro-star graphs are all node-symmetric. The results of this study indicate that bubble-sort graph Bn can be embedded into transposition Tn with dilation 1 and expansion 1. In addition, macro-star graph MS(2, n) can be embedded into transposition graph T2n+1 with dilation n, but with an average dilation of 2 or less. Acknowledgement. This research was supported by Basic Science research program through the National research Foundation of KOREA(NRF) funded by the Ministry of Education, Science and Technology(2009-0086676).
Embedding Algorithms for Bubble-Sort, Macro-star, and Transposition Graphs
143
References 1. Feng Athas, W.C., Seitz, C.L.: Multicomputers: Message-Passing Concurrent Computers. IEEE Computer, 9–24 (August 1988) 2. Saad, Y., Schultz, M.H.: Topological Properties of Hypercubes. IEEE Trans. Comput. 37(7), 867–872 (1988) 3. Azevedo, M.M., Bagherzaeh, N., Latifi, S.: Low Expansion Packing and Embeddings of Hypercubes into Star Graphs: A Performance-Oriented Approach. IEEE Parallel and Distributed Systems 9(3), 261–274 (1998) 4. Ghafoor, A., Bashkow, T.R.: A Study of Odd Graphs as Fault-Tolerant Interconnection Networks. IEEE Trans. Computers 40(2), 225–232 (1991) 5. Kim, J.S., Seo, J.H., Sim, H., Lee, H.O.: Embedding Algorithm Between the Macro-star Graph and the Matrix-star Graph. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Application. PDPTA, pp. 418–426 (2006) 6. Kim, M.H., Kim, D.W., Lee, H.O.: Embedding Algorithms for Star, Bubble-Sort, RotatorFaber-Moore, and Pancake Graphs. In: Hsu, C.-H., Yang, L.T., Park, J.H., Yeo, S.-S. (eds.) ICA3PP 2010. LNCS, vol. 6082, pp. 348–357. Springer, Heidelberg (2010) 7. Seo, J.H., Lee, H.O., Jang, M.S.: Node Mapping Algorithm Between Macro-star Networks and Transposition Networks. In: Proceedings of the Korean Information Science Society Conference 2008, vol. 35(1B), pp. 584–587 (2008) 8. Akers, S.B., Krishnamurthy, B.: A Group-Theoretic Model for Symmertric Interconnection Network. IEEE Trans. Comput. 38(4), 555–565 (1989) 9. Yang, X., Tang, Y.Y., Cao, J.: Embedding Torus in Hexagonal Honeycomb Torus. IET Computers & Digital Techniques 2, 86–93 (2008) 10. Park, J.H., Lim, H.S., Kim, H.C.: Embedding Starlike Trees into Hypercube-Like Interconnection Networks. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 301–310. Springer, Heidelberg (2006) 11. Berthome, P., Ferreira, A., Perennes, S.: Optimal Information Dissemination in Star and Pancake Networks. IEEE Trans. on Parallel and Distributed Syst. 7(12), 1292–1300 (1996) 12. Chou, Z.T., Hsu, C.C., Sheu, J.P.: Bubblesort Star graphs: A New Interconnection Network. In: 9th International Parallel Processing Symposium, pp. 41–48 (1996) 13. Latifi, S., Srimani, P.K.: Transposition Networks as a Class of Fault-Tolerant Robust Networks. IEEE. Trans. Comput. 45(2), 230–238 (1996) 14. Yeh, C.H., Varvarigos, E.A.: Macro-Star Networks: Efficient Low-Degree Alternatives to Star Graphs. IEEE Trans. Parallel and Distributed Systems 9(10), 987–1003 (1998) 15. Corbett, P.F.: Rotator Graphs: An Efficient Topology for Point-to-Point Multiprocessor Networks. IEEE Trans. Parallel Distributed Syst. 3(5), 622–626 (1992) 16. Faber, V., Moore, J. W.: High-degree, Low-diameter Interconnection Networks with Vertex Symmetry: The Directed Case. LA-UR-88-1051, LosAlmos National Lab., Los Almos (1988)
An Efficient Simulation Algorithm for Cache of Random Replacement Policy Shuchang Zhou Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China [email protected]
Abstract. Cache is employed to exploit the phenomena of locality in many modern computer systems. One way of evaluating the impact of cache is to run a simulator on traces collected from realistic work load. However, for an important category of cache, namely those of random replacement policy, each round of the na¨ıve simulation can only give one out of many possible results, therefore requiring many rounds of simulation to capture the cache behavior, like determining the hit probability of a particular cache reference. In this paper, we present an algorithm that efficiently approximates the hit probability in linear time with moderate space in a single round. Our algorithm is applicable to realistic processor cache parameters where the associativity is typically low, and extends to cache of large associativity. Experiments show that in one round, our algorithm collects information that would previously require up to dozens of rounds of simulation. Keywords: Simulation, Cache memories, Stochastic approximation.
1
Introduction
Modern computer systems depend heavily on the efficacy of cache in exploiting locality for improving performance. There are hardware cache built into processors and disk drivers, as well as software cache used in operating system kernels, proxy and file servers. One way of evaluating cache’s impact is to run a cache simulator on traces collected from realistic work load. The simulator will mimic the configuration, topology and replacement policy of the simulated cache. However, for an important category of cache, namely those of random replacement policy, as the replacement policy randomly determines one among multiple candidates for eviction, the na¨ıve simulation only constitutes a Monte Carlo simulation, therefore can only give one out of many possible results. In particular, in a single round, na¨ıve simulation cannot give the hit probability of each cache reference, which is of particular interest in program analysis[3][4]. If we maintain n copies of possible cache states and simulate access sequences on these states simultaneously, then the time and space requirement of the simulation will be equal to running n copies of na¨ıve simulation in parallel, with no C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 144–154, 2010. c IFIP International Federation for Information Processing 2010
An Efficient Simulation Algorithm
145
gain in efficiency. In contrast, we present an algorithm that directly approximates the hit probability. Experiments show that in one round, our algorithm collects information that would previously require up to dozens of rounds of simulation. The rest of paper is organized as follows. Section 2 gives some background for our algorithm. Section 3 describes our algorithm. Section 4 evaluates our algorithm by simulating realistic work load. Section 5 overviews related work.
2
Background
Cache maps addresses to values. In a fully-associative1 cache under random replacement policy, when the cache receives a request for an address A, the cache will look up all its content to see if the cache line containing A is already in the cache. If the matching succeeds, the request causes a hit event, and the corresponding value in cache is returned to the request. Otherwise, the request causes a miss event and the cache will fall back to some backup mechanism to satisfy the request. In case of miss event, a random place from the cache is picked, regardless of whether it originally contains valid data or not.2 The original content at the place will be evicted to allow storing the address A and its corresponding value just retrieved through backup mechanism. Two nearest references to the same cache line forms a reuse window, with references to other cache lines in between. Cache of random replacement policy3 are found in Translation Look-aside Buffers[6], and processor cache in ARM [13] processor family. The Monte Carlo simulation of random replacement policy faithfully models the hardware by using (pseudo) random numbers to select the target for eviction. However, the selection is path dependent. For example, assume a fully-associative cache of size two and let < a, e > denote that cache contains a and e as content, with different letters representing different cache lines, and assume an access sequence of d, a. As d is not in cache, it causes a miss event, leaving the cache in either < a, d > or < d, e >. Now the next access a will exhibit different behavior under the two possible cache states: for < a, d > it will be a hit event, and for < d, e > it will be a miss event. An observer that simulates sequence d, a multiple times will either see miss, miss or miss, hit, with equal likelihood. However, such a phenomenon can not be observed in a single round of simulation. To make the matter worse, the number of possible cache states keeps growing after each miss event, and in consequence requires more rounds of simulation to 1
2
3
Hereafter we will limit our attention to fully-associative cache. In a set-associative cache, as the mapping from the address to the cache set is fixed, the selection of eviction targets in different cache sets are statistically independent. Hence a setS associative cache of size S and associativity M is equivalent to M fully-associative cache operating in parallel, each of size M . Even if the replacement algorithm takes care to not evict valid data when there are free slots, as the cache is soon filled up with valid data, there will be no difference in practice. Sometimes referred to as pseudo random replacement policy due to difficulty, if not impossibility, of obtaining true randomness.
146
S. Zhou
capture the cache behavior. As mentioned above, maintaining n copies of possible cache states and simulating access sequences on these states simultaneously will not improve efficiency compared to running n copies of simulation in parallel, or running n rounds of simulation in sequence. In theory, we can represent the state of a cache under random replacement policy as a probability distribution among all possible states, and each possible address as a transition probability matrix. The simulation of cache is then reduced to multiplying the original probability distribution Z with a chain of matrices. However, the probability distribution has M size, and each transition i=0 i Z ) non-zero elements, where Z is the number of distinct matrix has O(M M i=0 i elements in trace, rendering this simple approach infeasible in general.4 Instead, we present an algorithm that implicitly calculates the probability distribution.
3
Algorithm
3.1
The Foundation
Assume the trace of cache lines of an access sequence with indices in logical time to be: a0 , a1 , .., aN . We represent the miss event of time i by an indicator random variable[1] Xi , such that Xi = 1 when a miss event happens at i and Xi = 0 otherwise, forming another sequence: X0 , X1 , .., XN . The indicator for hit event of time i is just 1 − Xi . The hit probability of ai is 1 − E(Xi ), where E(x) is x’s expectation. The expected count of hit event of a sequence S is just i∈S (1 − E(Xi )). Let the cache size be M . We can determine Xi in the following way. At time i, we inspect the subsequence {aj |j < i, aj = ai }. If the subsequence is empty, then ai will not be present in cache, hence Xi is 0. Otherwise, there is a reuse window and we assume the largest index of the subsequence, i.e. the start of the reuse window, is k. We let the number of misses since k be Zi . i−1 l=k+1 Xl , if there is a reuse window Zi = (1) ∞, otherwise. Due to linearity of expectation, we get i−1 l=k+1 E(Xl ), if there is a reuse window E(Zi ) = ∞, otherwise.
(2)
Proposition 1 E(Xi ) = 1 − E((1 − 4
n k
1 Zi ) ). M
is the number of k-element subsets of an n-element set.
(3)
An Efficient Simulation Algorithm
147
Proof. We observe that ak is definitely in cache after time k. As every miss 1 between k and now will evict ak with probability M , the probability of ak still being present in cache, which is also the probability of observing a hit event, is 1 ) to the power of number of misses since (excluding) k. Then with a given (1 − M Zi , we have: 1 Zi E(Xi |Zi ) = 1 − (1 − ) . (4) M Taking the expectation over Zi gives equation 3. However, as Xi are related to the preceding elements in {Xi }, hence in general E(aXi +Xj ) = E(aXi )E(aXj ). Therefore we cannot calculate right hand side of equation 3 easily. However, we note an approximation: Proposition 2 When M is large, E(Xi ) ≈ 1 − (1 −
1 E(Zi ) ) . M
(5)
Proof. Let E(Xi ) = 1 − E((1 − When Zi = ∞, δ = 0 as both (1 − When Zi < ∞ δ =
1 E(Zi ) M)
1 E(Zi ) M) 1 E(Zi ) ln(1− M )
= 0 and E((1 −
1 Zi M) )
(6) = 0.
1 Zi M) ) 1 Zi ln(1− M )
= (1 −
− E((1 −
=e
− E(e
(7)
)
1 1 ))n ))n (E(Zi ) ln(1− M (Zi ln(1− M − E( ∞ ) n=0 n=0 n! n! n 1 ∞ )|) (| ln(1− M (E(Zin ) − E(Zi )n ). − n=0 (−1)n n!
∞
=
1 Zi 1 E(Zi ) ) ) = 1 − (1 − ) + δ. M M
(8) (9) (10)
∞ We note δ is the sum of an infinite converging alternating series n=0 (−1)n an 0 0 where it can be observed that a0 = a1 = 0 as E(Zi ) = E(Zi ) and E(Zi1 ) = (| ln(1−
1
M E(Zi )1 . As δ is an alternating series with coefficients n! rapidly , we may expect δ ≈ 0 and then obtain approximation 5.
)|)n
decreasing
Now we can can use equation 2 and 5 to approximate 1 − E(Xi ), the hit probability of each cache reference. However, we are unable to give a rigorous bound of δ and have to resort to empirical evaluation in section 4. 3.2
The Plain Algorithm
The key observation is that we can use a map m and an offset b to efficiently calculate E(Zi ) from preceding E(Xi ). When we are at i, we want m[x] + b to equal to the number of misses in the reuse window, i.e. since last occurrence of x to time i. By definition, E(Zi ) = m[ai ] + b. If x is not seen before time i, we let m map x to ∞.
148
S. Zhou
We see m and b can be constructed and maintained in the following way. – Initialization: ∀x, m[x] ← ∞, b ← 0. – Maintenance: At time i, b ← b + E(Xi ), m[ai ] ← −b. We can use a hash table h to implement m, and let the absence of x from h to indicate m[x] = ∞. As maintenance of h can be done in O(1), we have obtained an algorithm that approximates {E(Xi )} with time complexity O(N ). 3.3
The -Approximation
The main problem with the algorithm presented above is that the hash table size will grow up to the number of distinct elements in trace. We observe from equation 5 that within an absolute error of the hit probability 1 − E(Xi ), we can stop the backward summation of E(Xi ) at l when (1 − That is when
i≥l
1 i≥l E(Xi ) ) ≤ . M
E(Xi ) ≥
log 1 . log(1 − M )
(11)
(12)
log Let K be log(1− 1 . Intuitively, we are ignoring reuses that are under probM) ability, which means we can prune h of an element x if h[x] ≥ K − b under -approximation of 1 − E(Xi ). In light of this observation, we can improve the space complexity using slidingwindow technique. We split h into two hash tables h and h , and use them in round-robin fashion. Both tables are used for query. Putting an address to one table will remove its entry from another table, if there should be one. We use a counter c to keep track of when to swap h and h , and record the number of swaps with d.
– Initialization: c ← 0, d ← 0, h and h points to two empty hash tables. – Maintenance: At time i, c ← c + E(Xi ), b ← b + E(Xi ), h [ai ] ← −b. If c > K then c ← c − K, clear h , swap h and h , and d ← d + 1; otherwise continue the loop. Proposition 3. The -approximation is correct and works within linear time log 1 and O( log(1− 1 ) ≈ O(M ln( )) space. ) M
Proof. Correctness: We observe that c = b − dK. When x is updated, it holds that h [x] + b = 0. As b is monotonically increasing, it holds that 0 ≤ h [x] + b = h [x] + c + dK. As c ≤ K, we have h [x] ≥ −c − dK ≥ −(d + 1)K. h contains
An Efficient Simulation Algorithm
149
elements of step d− 1, therefore h [x] ≥ −dK. At the time of clearing h , c > K, all values x in h satisfy h [x] ≥ −dK = K − (dK + K) > K − (dK + c) = K − b
(13)
and can be safely pruned. Complexity: Let s be the sum of size of h and h , we next show that s ≤ 2K. At time i, if ai is found in h or h , then s will not increase, but b may be incremented; if ai is not found in either table, both s and b are incremented by 1. Thus increment of s is bounded by the increment of b when c goes from 0 to K. As h and h are cleared every two steps, s ≤ 2K. In this way we achieve log space complexity of O(K) = O( log(1− 1 ). As the O(K) cost of clearing tables M) is amortized among at least K cache references, the time complexity remains O(N ).
4
Empirical Evaluation
We implement two algorithms: the plain and the -approximation variation of our algorithm presented above, and a na¨ıve simulation algorithm that directly simulates current cache state. We perform our experiments on traces collected by HMTT, a platform independent full-system memory trace monitoring system[5]. The traces are collected from running LINPACK[10] and CPU2000[11] benchmark. Both our algorithm and the average of many rounds of na¨ıve simulation give an approximation of hit probability of each reference. In figures 1, 2, 3, 4, 5, we compare the inferred probability from average of 5, 50, and 500 rounds of the na¨ıve simulation, and the plain and -approximation variation of our algorithm. We use trace fragments of length 106 and use the average of 500 rounds of the na¨ıve simulation as reference, which is depicted in the figures as squaredotted lines with absolute value of hit probability as x-value and the distribution as y-value. All the other lines depict the distribution of absolute error of hit probability from the reference. A curve decreasing faster towards right indicates better approximation precision. It can be observed that both the plain and approximation variation of our algorithm provide approximation of precision comparable to what is obtained from average of 50 rounds of na¨ıve simulation, and consistently outperforms average of 5 rounds. Our algorithm is applicable to realistic processor cache parameters, for example when M = 2, and extends to cases of large associativity. As indicated by the proof of 5, larger M leads to better approximation precision, such that our algorithm outperforms average of 50 rounds when M ≥ 8. We also observe that choosing to be 0.01 or even 0.1 only mildly affects the precision of approximation. Experiments with other M values and benchmarks show similar results and are not shown here for lack of space. M = 1 case is not considered as in this case eviction targets are picked deterministically.
150
S. Zhou
"" $
! "#
Fig. 1. Absolute error of various methods when M=2 for LINPACK
"" $
! "#
Fig. 2. Absolute error of various methods when M=8 for LINPACK
An Efficient Simulation Algorithm
151
"" $
! "#
Fig. 3. Absolute error of various methods when M=64 for LINPACK
"" $
! "#
Fig. 4. Absolute error of various methods when M=4 for SWIM
152
S. Zhou
"" $
! "#
Fig. 5. Absolute error of various methods when M=16 for WUPWISE
5
Related Work
Cache simulation has long been an important tool in studying the impact of cache on program execution[7]. Much of literature is devoted to study of cache under Least Recently Used policy[8]. However, techniques that are developed for studying LRU cache, like the reuse distance technique, do not apply to cache of random replacement policy in general. In [12], random replacement cache is investigated under a simplifying assumption that miss rate should stay stable over a short period of execution. Under this assumption, [12] uses approximation 5 to calculate the gross cache hit probability. In contrast, we exploit approximation 5 without the assumption, and are able to give hit probability of each cache reference. [15] and [16] also make other simplifying assumptions to approximate the gross hit probability. Most of previous studies on cache of random replacement policy use average of a few rounds of na¨ıve simulation as an approximation of the hit ratio of each cache reference. For example, in [9], which studies the impact of replacement policy on instruction cache, average of three rounds of simulation is used.
6
Conclusion
For an important category of cache, namely those of random replacement policy, it would be necessary to perform many rounds of na¨ıve Monte Carlo simulation to approximate the hit probability of each cache reference. To improve the efficiency, we devise a simulation algorithm that can approximate the probability in a single round. We start with a plain algorithm and then reduce its
An Efficient Simulation Algorithm
153
space complexity through -approximation. The -approximation variation of the algorithm works in linear time complexity, and has space complexity of log 1 O( log(1− 1 ) ≈ O(M ln( )), where M is the size of cache. Experiments show M) that in one round, our algorithm collects information that would previously require up to dozens of rounds of simulation.
Acknowledgments We thank Ma Chunhui for interesting discussions that lead to this paper. We thank Bao Yungang, the author of HMTT[5], for providing the trace files and insightful comments. We would also like to thank the anonymous reviewers for their precious remarks which help improve this paper.
References 1. Introduction to algorithms. MIT Press, Cambridge (2001) ¨ 2. Fang, C., Carr, S., Onder, S., Wang, Z.: Reuse-distance-based miss-rate prediction on a per instruction basis. In: Proceedings of the 2004 Workshop on Memory System Performance MSP 2004, Washington, D.C., June 8, pp. 60–68. ACM, New York (2004) 3. Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. In: Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, PLDI 2003, San Diego, California, USA, June 9-11, pp. 245–257. ACM, New York (2003) 4. Beyls, K., D’Hollander, E.H.: Reuse Distance-Based Cache Hint Selection. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 265–274. Springer, Heidelberg (2002) 5. Bao, Y., Chen, M., Ruan, Y., Liu, L., Fan, J., Yuan, Q., Song, B., Xu, J.: HMTT: a platform independent full-system memory trace monitoring system. In: Proceedings of the 2008 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2008, Annapolis, MD, USA, June 2-6, pp. 229–240. ACM, New York (2008) 6. Sweetman, D.: See MIPS Run, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006) ISBN 0-12088-421-6 7. Sugumar, R.A., Abraham, S.G.: Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical report, University of Michigan (1993) 8. Mattson, R.L., Gecsei, J., Slutz, D., Traiger, I.L.: Evaluation techniques for storage hierarchies. IBM System Journal 9(2), 78–117 (1970) 9. Smith, J.E., Goodman, J.R.: A study of instruction cache organizations and replacement policies. SIGARCH Comput. Archit. News 11(3), pp. 132–137 (1983) 10. http://www.netlib.org/linpack/ 11. http://www.spec.org 12. Berg, E., Hagersten, E.: Fast data-locality profiling of native execution. SIGMETRICS Perform. Eval. Rev. 33(1), 169–180 (2005) 13. ARM Cortex-R4 processor manual, http://www.arm.com
154
S. Zhou
14. Guo, F., Solihin, Y.: An analytical model for cache replacement policy performance. In: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2006/Performance 2006, Saint Malo, France, June 26-30, pp. 228–239. ACM, New York (2006) 15. Chandra, D., Guo, F., Kim, S., Solihin, Y.: Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture. In: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, HPCA, February 12-16, pp. 340–351. IEEE Computer Society, Washington (2005) 16. Suh, G.E., Devadas, S., Rudolph, L.: Analytical cache models with applications to cache partitioning. In: Proceedings of the 15th International Conference on Supercomputing, ICS 2001, Sorrento, Italy, pp. 1–12. ACM, New York (2001) 17. Agarwal, A., Hennessy, J., Horowitz, M.: An analytical cache model. ACM Trans. Comput. 7(2), 184–215 (1989)
DABGPM: A Double Auction Bayesian Game-Based Pricing Model in Cloud Market Shifeng Shang, Jinlei Jiang, Yongwei Wu, Zhenchun Huang, Guangwen Yang, and Weimin Zheng Department of Computer Science and Technology Tsinghua National Laboratory for Information Science and Technology Tsinghua University, Beijing, 100084, China [email protected], {jjlei,wuyw,huangzc,ygw,zwm-dcs}@tsinghua.edu.cn
Abstract. Recently IT giants such as Google, Amazon, Microsoft, and IBM are gearing up to be a part of the Cloud and begin to sell their cloud services. However, the current market trading mechanism is inflexible, and the price is not reasonable enough in some situation. Therefore, we first propose a cloud market framework for people to build a uniform and fully competitive cloud market where users can buy resources from different companies and exchange their idle resources in a more flexible way. Then we define a double auction Bayesian Game-based pricing model (DABGPM) for the suggested cloud market and discuss how to develop an optimal pricing strategy for this model. Our work, we think, makes a good example of more flexible and more reasonable cloud resources trading. Keywords: Cloud Computing; Pricing Model; Double Auction; Cloud Market.
Table 1 shows the services prices of different vendors, where the basic configuration of compute is of 1 GB (=109 bytes) RAM, and 40 GB Disk. We can see that the prices of compute range from $0.06 to 0.12 with a maximum difference as much as $0.06 per hour for the same resource. Obviously, it would be attractive to establish a uniform and fully competitive cloud computing resources trading market where users can run their applications using resources from different companies as well as exchange unused resources. Table 1. The cloud services prices of different vendors Price Type
Amazon
Windows Azure
Google
GoGrid
Rackspace
Compute CPU /hours Storage GB/month
$0.085/linux $0.12/windows
$0.12
$0.10
$0.10
$0.06
$0.15
$0.15
$0.15 first 0.5GB free
$0.15 first 10GB free
$0.15
Data Upload GB
$0.10 Free through June 30, 2010
$0.10
$0.12
$free
$0.08
Data Download GB
$0.17 $0.13 if>10TB
$0.15
$0.10
$0.29
$0.22
The main contributions of this paper are two-fold. Firstly, we propose a cloud market framework for people to build a uniform and fully competitive cloud market where users can trade cloud resources in a more flexible and more effective way. Secondly, we propose the DABGPM model for our cloud market and show how to develop an optimal pricing strategy for this model. The remainder of this paper is organized as follows. Section 2 is the related work. Section 3 gives an introduction to our global cloud market framework. In Section 4, we define the DABGPM model and discuss how to develop pricing strategies based on it. The last section is the conclusion and future work.
2 Related Work There exist many economy-based resource allocation models in grid research [12], mainly including autonomous pricing mechanism [13], pricing based on the combinatorial auction [14], and pricing based on the commodities market model and load prediction [15]. They applied the auction model, the bargaining model, the combinatorial auction model, and the commodity market model, respectively. But these models usually focus on users without taking providers’ price requirements into consideration. Recently, many research projects such as Tycoon [16] and Shirako [17] have been proposed for trading resource allocations. These projects focus on allocating resources based on economy factor rather than a real market. In other words, they do not take into account the full competition between providers.
DABGPM: A Double Auction Bayesian Game-Based Pricing Model
157
With the rapid advancement of QoS (quality of services) and the establishment of security and interoperability standards for cloud computing, more and more companies began offering cloud resources (e.g., computing, storage, platform and software) and the cloud market is becoming more and more competitive. It is the time to establish a uniform and fully competitive cloud market for trading cloud resources.
3 Cloud Market Framework To facilitate cloud resources trading, we propose a uniform and fully competitive cloud market framework as shown in Fig. 1. There are three main types of participants in this framework, namely the Cloud Market Place (CMP), User Agent (UA), and Cloud Resource Provider (CRP).
Fig. 1. The Cloud Market Framework
3.1 Cloud Resource Provider Cloud Resource Provider (CRP) is a datacenter that sells its resources to users and profits from it. The resources include computing power, storage space, network bandwidth, etc. CRP has five components, that is, the Auction Agent, the Admission Control, the Resource Scheduler, the Pricing and Accounting, and the SLA Monitor. • The Auction Agent is responsible for registering resource offers into the Cloud Resource Directory of CMP. It is also auction agent’s duty to generate a bid and submit it to the CMP. • The Admission Control receives the auction result from the auction agent, and decides whether to do the requested tasks or not.
158
S. Shang et al.
• The Resource Scheduler is responsible for allocating resources according to the corresponding request. It is also in charge of balancing the supply and demand of cloud resources, providing feedback in terms of economic incentives for both cloud consumers and providers. • The Pricing and Accounting is responsible for charging users for the resources consumption; • The responsibility of SLA Monitor is to accept the user’s request and guarantee the user-requiring service quality. Cloud providers must consider and meet different QoS requirements of each individual consumer as negotiated in specific SLAs (Service Level Agreements). 3.2 Cloud Market Place CMP is composed of the Cloud Resource Auctioneer, the Pricing Algorithm, the Cloud Resource Directory, the Bank and the Insurance Service. • The Cloud Resource Auctioneer is responsible for collecting the bids of resources made by the UAs and CRPs. Based on the corresponding bidding information, the auctioneer runs the double auction algorithm to determine the winning UAs and CRPs. It is also in charge of informing the UAs and CRPs who have participated in the auction of the auction results. • The Pricing Algorithm is responsible for generating specific allocation results and the corresponding pricing information. The price calculated in this component will be sent to all the UAs and CRPs who participate in the trade. • The Cloud Resource Directory (CRD) allows participants to locate providers or consumers with the right offers. The cloud market place periodically clears bids received from market participants. • The Bank ensures that financial transactions pertaining to agreements between participants are carried out. • The Insurance Service provides insurances against a number of events that would harm both providers and users: (1) the cloud resources cannot be delivered to users; (2) the resource availability is interrupted; (3) users do not pay the resources consumed. 3.3 User Agent User agents help users make the appropriate choice of resources and auction types. There are four components within a user agent, that is, the Resource Planning Service, the Resource Discovery Agent, the Auction Agent, and the SLA Monitor. • The Resource Planning Service (RPS) helps users to determine the most appropriate computing capacity. The service will analyze the existing infrastructure, the deployed applications, and the resource requirements before suggesting a course of action according to users’ deadline, budget, and so on.
DABGPM: A Double Auction Bayesian Game-Based Pricing Model
159
• The SLA Monitor uses the result of RPS to form a service quality request and then sends the request to the auction agent. It is also in charge of monitoring the agreed SLAs to guarantee service quality. • The Resource Discovery Agent is responsible for finding the resources according to users’ SLA requirements. It contacts the CRD module of CMP to obtain the list of resources that meet these requirements. • The Auction Agent is responsible for generating the list of resources returned by the CRD. For each resource available, it makes a bid within the user’s SLA requirement and submits it to the CMP.
4 Pricing Strategies Based on the DABGPM Model As more and more IT Giants pave the way for creating a market infrastructure to trade services, a completely open bilateral cloud market will be established in the near future. Since cloud resource providers aim to maximize profit and consumers aim to minimize spending, how to do pricing is important to both sides. Here we proposed the DABGPM pricing model based on [18, 19, 20] and analyzed pricing strategies with it. 4.1 The DABGPM Model In a fully open and complete competitive global cloud resources exchange market, providers and buyers may not know each others’ valuations of cloud resources; it is a strategic game with incomplete information. The economic profit of the two sides (providers and buyers) is related to the strategies of their own and their opponents. Suppose that the total requirement of all users is Y. The providers and the consumers form a double auction, that is, they can decide whether to exchange Y resources. Let Vu and Vp be the expected resource price of the user and the provider respectively. Obviously, the provider knows the value of Vp, but doesn’t know the value of Vu, whereas the consumer knows Vu, but doesn’t know Vp. According to the above statements, the DABGPM pricing model involves the following elements: • a pair of resource provider and consumer, denoted by N=(s, b). • a set of states, denoted by Ω={(Vp, Vu)}. Indeed, Ω records the possible quoting prices of provider s and consumer b. • a set of signals, denoted by T={t1, t2, …}. Each signal indicates a resource quoting action. • a signal function τi(tÆΩ), which means signals will trigger state change. • a probability function that describes the possibility of each state in Ω. Assume that Vp and Vu are uniformly distributed on [0,1], and the provider and the user bids and offers at the same time. Obviously, we have Ps [0,1] and Pb [0,1]. If Ps≤Pb, two sides will make a deal at price P=kPs+(1-k)Pb; Otherwise, the deal will not happen. The payoff function of resource provider is as follows:
∈
∈
160
S. Shang et al.
if p ≤ p (1) if p >p
⎧ kp + (1 − k ) p b − V p Us = ⎨ s 0 ⎩
s
b
s
b
The payoff function of user b is as follows:
if p ≤ p (2) if p >p
⎧ V − (kps + (1 − k ) p b ) Ub = ⎨ u 0 ⎩
s
b
s
b
4.2 The Solution of DABGPM Model According to the double auction process and rules, Vp and Vu is independent and uniformly distributed on [0,1], the selling price of provider s is Ps(Vp); the bidding price of buyer b is Pb(Vu), the strategy p*s ( Vp ) , p*b ( Vu ) is a optimal strategy if and only if the following two conditions are met.
(
)
(1) p*s ( Vp ) is one solution to the optimization question:
max [k(p + E[(p (V ) | (p ( V ) ≥ p )]) − c]P{p (V ) ≥ p } (3) s
Ps
b
u
b
u
s
b
u
s
E[(pb(Vu)|pb(Vu)]≥ps)] is the expected price when the providers’ selling price is lower than consumers’ bidding price. (2) p*b ( Vu ) is one solution to the optimization question:
max [V -kp -kE[p (V ) | p Pb
u
b
s
p
b
≥ p s ( Vp )]]P{ pb ≥ ps ( Vp )}
(4)
E[( ps ( Vp ) |p b ≥ ps ( Vp ) ) is the expected price when the providers’ selling price is lower than consumers’ bidding price, p*s ( Vp ) and p*b ( Vu ) are the optimal strategy of provider s and consumer b respectively. Assuming the provider and consumer are all adopting a linear bidding pricing strategy, the solution of DABGPM is below: ⎧⎪ p s ( V p ) = α s + β s V p ( 5) ⎨ ⎪⎩ p b ( Vu ) = α b + β b Vu Vu is defined as of the uniform distribution, hence, p b is also uniformly distributed on the interval [α b , α b + β b ] , therefore
P {p b ( Vu ) ≥ p s } = P {α b + β b Vu ≥ p s } =
α b + β b − ps βb
1 αb + βb xdx ∫ β b ps 1 E ⎣⎡ p b ( Vu ) | pb ( Vu ) ≥ ps ⎦⎤ = = ( ps + α b + β b ) P {p b (Vu ) ≥ ps } 2
(6)
(7)
DABGPM: A Double Auction Bayesian Game-Based Pricing Model
161
max [kp + k2-1 ( p + α + β ) - V ] ⋅ α +ββ - p (8) b
Ps
s
s
b
b
b
s
p
b
Applying its first-order condition, we get: ps =
k 1 Vp ( αs + βs ) + k +1 k +1
(9)
Thus it can be seen, if consumer b adopting a linear strategy, the optimal response of cloud resource provider is also linear. As the same, because Vp is of uniform distribution, hence, p s is also of uniform distribution on the interval [α s , α s + β s ] , therefore: p − αb P {p b ≥ ps (Vp )} = P {p b ≥ α s + β s Vp } = b (10) βs
E ⎡⎣ p s ( V p ) | p b ( Vu ) ≥ p b ≥ p s ( V p ) ⎤⎦ =
1 pb ∫ xdx β s αs
P {p b ≥ p s ( V p )}
=
1 ( pb + αs ) 2
(11)
Inserting equation (10) and (11) into the effective function of buyer, we can obtain:
max [V − ((1 − k ) p Pb
u
b
+
p −α k ( pb + α s ))] ⋅ b b 2 βs
(12)
Applying its first-order condition, we get:
p
b
=
k 1 as + Vu 1+ k 1+ k
(13)
Synthesizing equation (5), (9) and (13), we can obtain a pricing solution between cloud resource providers and buyers: k 1 ⎧ ⎪ ps (V p ) = 2k + 1 + k + 1 V p ⎪ k ∈ [0,1] ⎨ k2 1 ⎪ p (V ) = + V u ⎪⎩ b u (1 + k )(1 + 2k ) k + 1
(14)
4.3 Pricing Strategy Analysis
The Formula (14) has given an equilibrium pricing strategy between cloud providers and consumers; we will analyze the characteristics of it. For the sake of simplicity, we assume that k = 0.5 and thus the formula (14) changes into (15):
162
S. Shang et al.
1 2 ⎧ ⎪⎪ ps (Vp ) = 4 + 3 Vp ⎨ ⎪ p (V ) = 1 + 2 V ⎪⎩ b u 12 3 u
(15)
(1) Based on the linear equilibrium strategy, the highest bidding price of consumer is p b (1) = 0.75 , and the lowest price of provider is ps ( 0 ) = 0.25 ; (2) If V p > 0.75 , the selling price of provider ps (V p ) = 0.25 + 2 / 3Vp is lower than the real cost. Even if it is higher than the highest bidding price p b (1) = 0.75 , the trade will not happen.
Fig. 2. The Solution of DABGPM Model
(3) When Vu < 0.25 , although the bidding price is higher than the cost, it is lower than the lowest selling price, ps ( 0 ) = 0.25 , the trade will not happen either. This can be seen from Fig. 2. The two lines ps (V p ) = 0.25 + 2 / 3Vp and p b (Vu ) = 1 / 12 + (2 / 3)Vu indicate the optimal pricing strategy of provider and consumer is the bold line in Fig. 2. (4) From the above analysis, we can conclude that ps (V p ) = 0.25 + 2 / 3Vp and p b (Vu ) = 1 / 12 + (2 / 3)Vu is the optimal pricing strategy of provider and consumer respectively. For example, when the cost of provider is 0.3, the optimal price is 0.45, when the valuation of buyer v is 1.0, the optimal price is 0.75.
5 Conclusion With the development of cloud technology, there will be more users and companies renting hardware and/or software resources from the cloud instead of buying and maintaining by themselves. As more and more IT giants invest into the technology, cloud computing has evolved into a big piece of cake. It is a trend that a uniform and complete competitive cloud resources market will emerge in the near future. For such a market, the trading strategy and reasonable price is important. To facilitate such a
DABGPM: A Double Auction Bayesian Game-Based Pricing Model
163
market, we proposed a complete competitive cloud market framework and a double auction Bayesian game-based pricing model and discussed pricing strategies based on this model. We think the work reported here makes a good example of more flexible and more reasonable cloud resources trading and would promote the applications of cloud computing.
Acknowledgement This Work is supported by Natural Science Foundation of China (60803121, 60773145, 60911130371, 90812001, 60963005), National High-Tech R&D (863) Program of China (2009AA01A130, 2006AA01A101, 2006AA01A108, 2006AA01A111, 2006AA01A117).
References 1. http://www.nimbusproject.org/ 2. Buyya, R.: Market-Oriented Cloud Computing: Vision, Hype, and Reality of Delivering Computing as the 5th Utility. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE Computer Society, Los Alamitos (2009) 3. Chu, X., Nadiminti, K., Jin, C., Venugopal, S., Buyya, R.: Aneka: Next-Generation Enterprise Grid Platform for e-Science and e-Business Applications. In: Proceedings of the Third IEEE International Conference on e-Science and Grid Computing, pp. 151–159. IEEE Computer Society, Los Alamitos (2007) 4. OpenNebula Project, http://www.opennebula.org/ 5. Zheng, W.M.: An Introduction to Tsinghua Cloud. Science in China Series F: Information Science 53(5) (2010) 6. Amazon Elastic Compute Cloud (EC2), http://aws.amazon.com/ec2/ 7. http://www.microsoft.com/windowsazure/ 8. Google App Engine, http://appengine.google.com 9. http://www.rackspace.com/index.php 10. http://www.gogrid.com/index.v2.php 11. http://www.vps.net/ 12. Buyya, R., Abramson, D., Venugopal, S.: The Grid Economy. Proceedings of the IEEE 93(3), 698–714 (2005) 13. Jin, Y., Shoubao, Y., Maosheng, L., Qianfei, F.: An Autonomous Pricing Strategy toward Market Economy in Computational Grids. In: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC 2005), vol. II, pp. 793–794. IEEE Computer Society, Los Alamitos (2005) 14. Schwind, M., Gujo, O., Stockheim, T.: Dynamic Resource Prices in a Combinatorial Grid System. In: Proceedings of the The 8th IEEE International Conference on E-Commerce Technology and The 3rd IEEE International Conference on Enterprise Computing, E-Commerce, and E-Services. IEEE Computer Society, Los Alamitos (2006) 15. Zhao, X., Xu, L., Wang, B.: A Dynamic Price Model with Demand Prediction and Task Classification in Grid. In: Proceedings of the Sixth International Conference on Grid and Cooperative Computing, pp. 775–782. IEEE Computer Society, Los Alamitos (2007)
164
S. Shang et al.
16. Lai, K., Rasmusson, L., Adar, E., Zhang, L., Huberman, B.A.: Tycoon: An implementation of a distributed, market-based resource allocation system. Multiagent Grid Syst. 1, 169– 182 (2005) 17. Irwin, D., Chase, J., Grit, L., Yumerefendi, A., Becker, D., Yocum, K.G.: Sharing networked resources with brokered leases. In: Proceedings of the Annual Conference on USENIX 2006 Annual Technical Conference. USENIX Association, Boston (2006) 18. Ren, Y.L., Zou, X.Y.: Bidding game model of a power generation company based on firstprice sealed auction. Journal of Systems Engineering 18(3), 248–254 (2003) 19. Fang, D.B., Wang, X.J., Zhang, Y.X., Liu, K., Wang, P., Zhu, Z.Y.: A Double auction Bayesian model with supplier and demander in open bilateral electricity market. Power System Technology 27(12) (2003) (in Chinese) 20. Gibbons, R.: A Primer in Game Theory. Pearson Higher Education, London (1992) 21. Wilkening, J., Wilke, A., Desai, N., Meyer, F.: Using clouds for metagenomics: A case study. In: IEEE International Conference on Cluster Computing and Workshops CLUSTER (2009)
NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism Changyou Xing, Li Yang, and Ming Chen Institute of Command Automation, PLA University of Science and Technology Nanjing, Jiangsu, China, 210007 {xcy,yl,cm}@plaust.edu.cn
Abstract. More than half of today’s Internet traffic is generated by P2P applications, and most of which is the BitTorrent file sharing traffic. The random neighbor selection mechanism of BT puts unprecedented pressure on the network operators and service providers. In this paper firstly we give a theoretical analysis on the impact of neighbor selection mechanism on BitTorrent performance as well as network resource cost. And then based on network performance prediction techniques, we propose a network performance aware BitTorrent traffic optimization mechanism NPA-BT, which uses network performance information to help overlay constructing. Simulation results show that NPA-BT can select peers in the same AS network and with good performance as neighbors, and thus the BitTorrent download time and the cross AS network traffic are all decreased. Keywords: Network performance aware, traffic optimization, BitTorrent Modeling.
1 Introduction Peer-to-Peer (P2P) applications have become immensely popular in the Internet. Network measurements shows that P2P traffic accounts for more than half of the Internet traffic [1]. Among P2P applications, file sharing is perhaps the most popular application. Nodes in P2P system construct their own application layer overlay network. However, most of current P2P systems ignore the performance and structure of under layer physical network, and two neighbor nodes in the overlay network may be far from each other in the physical network. Recent studies showed that the routing mechanisms in P2P overlay networks always violate ISP routing policies, and average P2P bit traverses 1000 miles and 5.5 metro-hops [2]. 50%-90% of existing local pieces in active users are downloaded externally [3]. For one thing, these problems make P2P applications waste network resources and increase the difficulty of ISP network engineering; for the other, these problems also decrease the P2P application performance itself. One of the key reason for such a problem is that the transparency of network performance to upper P2P applications. Internet provides best effort packet delivery
service to applications, with no network performance information to its applications. Due to the overload problems, it is also impossible to get the network performance using active measurement. In the former works, we proposed network delay and available bandwidth prediction models NetPharos [4] and PathGuru [5], which can predict network performance based on partial measurement. Using the information provided by them, we can help P2P applications choosing peers in the same ISP networks and with good performance as neighbors, so as to increase P2P application performance and decrease cross ISP network traffic. Based on this idea, we propose a network performance aware P2P traffic optimization mechanism named NPA-BT, which uses the AS and performance information to construct application layer overlay network. Since BitTorrent traffic is more than half of today’s P2P traffic, we use BitTorrent as our analysis object, but the analysis method is also suitable to other P2P applications. The rest of this paper is organized as follows: Section 2 presents a short survey of related works on P2P application layer traffic optimization. Section 3 models the BitTorrent file downloading process from different point of view, and analyzes the relationship between nodes performance and their AS networks, and provides the network performance aware BitTorrent traffic optimization mechanism NPA-BT. Section 4 evaluates the performance of NPA-BT by simulation. Finally, Section 5 summarizes our work.
2 Related Works Essentially, P2P traffic optimization is solving the mismatch problem between overlay network and physical network. The P4P [6] research project has developed a framework which ISPs can use to convey network information to P2P applications. The framework of the P4P project is based on two main architectural entities: the itracker and the p-distance. With a thorough measurement analysis, [3] showed that regular BitTorrent is totally network-unaware, and they proposed that ISPs intercept P2P-traffic at edge routers and redirect them to P2P-clients within the same ISP. However, they do not investigate such a solution in detail. [7] proposed to improve P2P-locality through network layer topology information, and they gave a biased neighbor selection mechanism, but it needs to get ISP network topology information first, which is hard to get, if not impossible. [8] presented an oracle based ISP and P2P application collaboration mechanism, in which each ISP maintain an oracle server, P2P client can query the oracle to get information about the underlying network, and then it can select node with the best performance as its neighbors. [9] presented an approach to reducing costly cross ISP traffic by using CDN redirection information. Yu presented a hierarchical architecture CBT (Clustered BitTorrent) for grouping BitTorrent-like P2P applications peers in proximity, However they also only focused on network proximity metric, and did not take network download available bandwidth into consideration [10]. For P2P file sharing applications such BitTorrent, the most concern thing is how to decrease file distribution time; but for ISP, the most concern thing is how to decrease cross ISP network traffic and network resource utilization. Unfortunately, most of
NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism
167
current traffic optimization mechanisms only focus on one of the two aspects, and cannot achieve the two aims simultaneously.
3 Network Performance Aware P2P Traffic Optimization Mechanism NPA-BT In this section, we will firstly analyze the impact of neighbor nodes selection mechanism on resource download time, discuss the optimization object from different point of view, and then we study on relationships between nodes distances and AS networks. Finally we propose a network performance aware P2P traffic optimization mechanism NPA-BT. 3.1 Theoretical Analysis of BitTorrent Neighbor Selection Mechanism 3.1.1 BitTorrent File Download Time Model Suppose each peer node in BitTorrent system selects n nodes randomly to download data, and the basic parameters are shown as follows:
x(t ) : number of lechers at time t; y (t ) : number of seeds at time t;
λ : arrival rate of new peers; μ : upload rate of peers; c : download rate of peers;
θ : abort rate of download peers; γ : leave rate of seed in the system; η : file share efficiency, most time close to 1. Suppose each peer selects k peers and (n-k) seeds to download data, and μ is the average upload rate of these n nodes. Using the Markov chain based fluid math model, we can give the variation rate of download nodes and seeds as follows:
Most times the download rate of a node is much larger that its upload rate, thus we can ignore the restriction of upload rate.
min{cx(t ), kμηx(t ) + (n − k ) μy (t )} = kμηx(t ) + (n − k ) μy (t ) (3) To simplify the model, we suppose there are no seeds in the system. That is, each node will quit the system as soon as it completes the file download process. Then equation (1) can be simplified as follows:
dx(t ) = λ − θx(t ) − nμηx(t ) dt
(4)
168
C. Xing, L. Yang, and M. Chen
Here we mainly pay our attention to the performance analysis of system in the _
stabilization state, lim x(t ) = x , and t →∞
dx(t ) = 0 , thus dt _
_
0 = λ − θ x − nμη x
(5)
By solving equation (5) we can get the following result.
λ
_
x=
(6)
θ + nμη
Little law can be used to estimate the average download time of each node during the system stable state, and by using this law we can get equation (7) _
_ λ −θ x _ x = (λ − θ x)T λ
(7) _
In which T represents the average download time, and λ − θ x represents average download rate. From equation (7) we can work out the result of the average download time T. _
T=
x
λ
=
1 θ + nμη
(8)
Generally we suppose η = 1 approximately, and the variation of θ on the system can also be ignored. Then from equation (8) we can see that in BitTorrent file sharing system, the download time is determined by the value of nμ . By choosing nodes with high upload rate as neighbors, the download time can be decreased dramatically. Thus, in order to optimize the BitTorrent file download time, we should make each peer to select peers with the highest upload available bandwidth as neighbors. 3.1.2 Network Resource Optimization Model The upper model mainly focuses on how to decrease the download time of BT peers, and it takes no consideration on network resource usage information during data download period. For example, if the data is downloaded from different ISP networks, the ISP should pay for the cross IPS network traffic. And even in the same ISP, selecting a close peer as neighbor will also save network resources. Thus, we will extend the upper model, and take both ISP network resource and BT download rate optimization into consideration. Suppose Bij (t ) represents the download rate of peer Hi from peer Hj at time t; Di (t ) represents the download rate restriction of peer Hi at time t; U j (t ) represents the upload rate restriction of peer Hi at time t; Lij (t ) represents the distance cost from peer Hi to peer Hj at time t; and Wij (t ) represents the trans-AS cost from peer Hi to peer Hj at time t. We also suppose that a peer will leave the network as soon as it finishes the download process, and there are n nodes in the system. From the viewpoint of P2P performance optimization, we want to maximize the download rate of the n nodes at any time t, which can be represented as follows.
NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism n
max ∑
n
∑B
i =1 j =1, j ≠ i
n
∑B
s.t.
j =1, j ≠ i n
∑B
i =1,i ≠ j
ij
ij
169
ij
(t )
(t ) ≤ Di (t )
(t ) ≤ U j (t )
(9)
∀H i ∀H j
Bij (t ), Di (t ),U i (t ) ≥ 0
∀H i , H j
On the other hand, from the viewpoint of ISP network resource optimization, we want to minimize the total network resource cost at any time t. n
min ∑
n
∑ (L
ij
i =1 j =1, j ≠ i
s.t. Lij (t ) > 0,
(t ) + Wij (t )) Bij (t ) ∀H i , H j , i ≠ j
⎧⎪m(t ) Wij (t ) = ⎨ ⎪⎩0
(10)
,
H i and H j in different AS m(t ) > 0 H i and H j in the same AS
The upper expression describes the optimization objects at each time t. Here we take the full data download period into consideration, and suppose Ti and Ci represent the time and cost of peer Hi needed to download a file separately. Generally, the BitTorrent peer downloads resources from different neighbors simultaneously. Suppose Hi downloads M ij ( M ij ≥ 0) data from Hj, and then the finish time of Hi is determined by the maximum download time from the n-1 peers, that is
⎛M ⎞ Ti = max⎜ ij ⎟ j ⎜ B ⎟ ⎝ ij ⎠
(11)
Without loss of generality, we suppose the size of downloaded file is 1, and there is no replicated download problem, then the total number of data downloaded by each node Hi must be 1, that is n
T
j =1
0
∑∫
Bij (t )dt = 1
(12)
To simplify the model, we ignore the limit of Di. That is, the download capability of a peer is large enough, so that the download rate of a node is the sum of all upload rates it gets from other nodes. Under this constraint, the best download schedule mechanism is that Hi partitions data into n-1 blocks and download them from n-1 neighbors separately, and the size of each block should be consistent with the upload rate of the corresponding peer, so that Hi can finish the n-1 download simultaneously. The worst download schedule mechanism is that Hi downloads the whole data from a node with the smallest upload bandwidth. Thus the download finish time of node Hi is
1 1 ≤ Ti ≤ min Bij ∑ Bij j
j
(13)
170
C. Xing, L. Yang, and M. Chen
On the other hand, the network resources Hi costs to complete its data download is determined by its download bandwidth, data transfer distances and the cross AS network cost. n
Ti
j =1
0
Ci = ∑ ∫ ( Lij (t ) + Wij (t )) Bij (t ) dt
(14)
Using the result of expression (12), we can get
min ( Lij ) ≤ Ci ≤ max ( Lij + Wij ) j
j
(15)
Thus, from the viewpoint of resource download, P2P application wants to minimize the total download time, and ISP wants to minimize network resource cost. The two optimization object can be represented as follows. n n ⎧ ⎛ M ij ⎞ ⎟ ⎪min ∑ Ti = min ∑ max ⎜ j ⎜ B ⎟ ⎪ i =1 i =1 ⎝ ij ⎠ ⎨ n n n Ti ⎪ ⎪min ∑ Ci = min ∑ ( ∑ ∫0 ( Lij (t ) + Wij (t )) Bij (t )dt ) i =1 j =1, j ≠ i i =1 ⎩
(16)
Taking equation (13), (15) and (16) together, we can see that the optimization of P2P application download time is determined by choosing high available bandwidth nodes, which is consistent with our upper model. But for the ISPs to optimize network resource utilization, the distances between neighbors and the cross AS traffic should be decreased. Thus to achieve this aim, the BitTorrent clients should select nodes in the same ISP network or close networks as neighbors. Besides, links between different ASes always form bottlenecks, thus selecting nodes in the same AS network as neighbors will also provide high download rate, and the two optimization objects can be satisfied simultaneously. Based on this discussion, we will propose the network performance aware BitTorrent traffic optimization mechanism NPA-BT in the next section. 3.2 Network Performance Aware BitTorrent Traffic Optimization Mechanism NPA-BT As discussed above, to optimize BitTorrent traffic, the neighbor selection mechanism should be modified. Each peer should select nodes in the same AS network or close to it, and have high bandwidth as its neighbors. An intuitive idea is firstly using NetPharos and PathGuru to predict the distances and available bandwidths between candidate peers, and then selecting nodes with the lower delay and higher bandwidth as neighbors. But an open question is whether nodes with short distances are in the same AS network? To answer this question, we firstly give an analysis on a measurement dataset. Ledlie uses a typical BitTorrent client Azureus to measure distances from 283 PlanetLab nodes to 156658 Azureus peers running all around the world. The measurement dataset includes about 9.5 x 107 items, and each item includes node IP address and distances [11]. We use this dataset to study the nodes distances and AS relationships. Firstly we use RouteView [12] information to find the AS number and
NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism
171
country information of each IP address, and then estimate the relationship between distance and AS. Fig. 1 shows the AS distribution of nodes with different distances, from which we can see that there is no direct relationship between nodes distance and their AS networks, and nodes with small distances may locate in different ASes. For example, nodes that have distances less than 2 ms are located in more than 5 ASes, which means that when using distances as the criteria to select neighbors, we cannot achieve the aim of decreasing inter AS traffic. 25
Number of Ases
20 15 10 5 0 1
2
3
4
5 6 Delay (ms)
7
8
9
10
Fig. 1. Number of AS distribution with different delays
Besides, we also investigate the distance variation when selecting nodes in the same AS or the same country as neighbors. In mathematics, variance is used to describe the scattering degree of a variable, but this metric is related to the quantity of the variable, and thus we use the ratio of standard deviation and expectation to eliminate such impact. If the expectation of a variable X is E ( X ) , and its variance is D ( X ) , we define its standard deviation and expectation ratio as
DE ( X ) =
D( X )
. DE ( X ) = 0 means the value of X has no fluctuation at E( X ) all. Otherwise the larger DE ( X ) is, the more serious X fluctuates. Fig. 2 shows the standard deviation and expectation ratio of nodes distances in the same AS or the same country, from which we can see that no matter which kind is considered, the distances fluctuate seriously. This result means if we only use in the same country or AS network as neighbor selection criteria, we can reduce inter AS traffic, but we cannot guarantee to select nodes with short distances. From the upper discussion we can see that only using distances or AS relationships as neighbor selection criteria has limitations, and thus we combine the network performance information and AS information during neighbor selection. Firstly we find nodes in the same AS network as candidates, and then select nodes that have small distances and high available bandwidth from the candidates as neighbors. Besides, though the random neighbor selection mechanism of BitTorrent has the low efficiency problem, it can enhance the system robustness. If nodes all select peers in its AS network as neighbors, the BitTorrent overlay may be isolated. Wang showed that resources in a BitTorrent overlay network have typical region feature. For example, videos with a special language
172
C. Xing, L. Yang, and M. Chen
are only popular in regions that use that language as mother tongue, and the resources are rare in other regions [13]. Thus if the number of peers in an AS network is very small, we still use random candidate neighbor selection mechanism, so as to keep the connectivity of overlay network. The algorithm is shown as follows. 1 0.9
Cumulative Distribution
0.8 0.7 0.6 0.5 0.4 0.3 0.2 Delay variance ratio to nodes in a country Delay variance ratio to nodes in a AS
0.1 0
0
0.5
1
1.5 2 2.5 Delay variance ratio
3
3.5
4
Fig. 2. Standard deviation and expectation ratio of nodes distances in the same AS or the same country
Algorithm: NPA-BT //Select peers in the same AS to form Candidate Begin Get peer_list from tracker; For each peeri in peer_list Begin If ASi = = ASlocal Add peeri into Candidate; End // num < threshold means the number of peers in this // AS is too small. Only select peers with good // performance as neighbors num := sizeof (Candidate); If (num < threshold) Begin Select threshold peers randomly from peer_list; Add the threshold peers to Candidate; End // Select good performance peers in candidate For each peeri in Candidate Begin distancei := NetPharos (local, peeri); avbi := PathGuru (local, peeri); perfi := avbi/distancei; Sort Candidate by perfi; Return the first k nodes in Candidate; End End
NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism
173
The algorithm of NPA-BT includes two main parts: finding candidates and selecting neighbors. During the finding candidates period, we try to find peers in the same AS network as candidate neighbors, but if the number of peers in an AS network is too small (less than predefined threshold), we will use random peer selection mechanism to add some peers to the candidates. During the neighbor selection period, we firstly use NetPharos and PathGuru to compute the distance and available bandwidth metrics, and use their ratio to represent the performance of a peer, and finally select the first k nodes with good performance.
4 Performance Evaluation In this section, we evaluate the performance of NPA-BT by simulation, and the simulation parameter is similar to [14]. Firstly a two hierarchical network topology with 300 routers is generated using BRITE, which includes 32 backbone routers, and they only provide traffic forwarding service. Others are stub routers, which provide network access service to end users. 1000 end user nodes are connected to the stub routes randomly, and the access bandwidth is distributed uniformly between 1~9 Mbps. The distance of each link in the stub network is distributed uniformly between [1, 3], and the distance of each link between stub router and transit router is distributed uniformly between [10, 15], and the distance of each link in the transit network is distributed uniformly between [20, 90]. For the upper network topology, we use NetPharos and PathGuru to predict the distances and available bandwidths between every two end nodes. And during the simulation each end user generates 100 download requests randomly, and each resource has 30 replicas distributed randomly in the 1000 end nodes. We use both random neighbor selection and NPA-BT to construct, and compare the average performance between neighbor nodes in the overlay networks. 1 0.9 0.8 0.7
CDF
0.6 0.5 0.4 0.3 0.2 NPA-BT Original BT
0.1 0
0
20
40
60 80 100 120 140 RTT Between Neighbors (ms)
160
180
Fig. 3. Cumulative distribution of RTT between neighbors
Fig. 3 and Fig. 4 show the cumulative distribution of average distances and available bandwidth when using the two peer selection mechanism. From which we can see that with the help of network performance information, NPA-BT can select
174
C. Xing, L. Yang, and M. Chen
the more efficient nodes as neighbors, and decreases the distances, increases available bandwidths between neighbors. Thus we can achieve the optimization object of decrease download time and cross AS network traffic by using NPA-BT. 1 0.9 0.8 0.7
CDF
0.6 0.5 0.4 0.3 0.2 NPA-BT Original BT
0.1 0
1
2 3 4 5 6 7 8 9 Available Bandwidth Between Neighbors (Mbps)
10
Fig. 4. Cumulative distribution of available bandwidth between neighbors
To improve the object download efficiency, many P2P applications will select more than 1 node as peers to download data at the same time. Fig. 5 and Fig. 6 show the distance and available bandwidth comparison of selecting peers with and without network performance information. From these two results we can see that no matter which metric is considered, neighbors selected by NPA-BT are better than that selected randomly in original BT. 120
NPA-BT
Original BT
100
RTT (ms)
80 60 40 20 0 1
2
3 4 Number of peers
5
6
Fig. 5. Relationship between average RTT and number of neighbors
Fig. 5 and Fig. 6 also show that as the number of neighbors increases, the average distance of peers selected by NPA-BT also increases accordingly (the average available bandwidth decreases accordingly), but the average distance or available bandwidth of peers selected randomly nearly keeps to be the same. This phenomenon is because NPA-BT always selects the best peers as neighbors. As the number of
NPA-BT: A Network Performance Aware BitTorrent Traffic Optimization Mechanism
175
neighbors increases, some sub-optimal nodes must be selected as neighbors, and thus the average performance will decrease accordingly. But it still much larger than that of random peer selection mechanism. 9 Average available bandwidth(Mbps)
NPA-BT 8 Original BT 7 6 5 4 3 2 1 0 1
2
3 4 Number of peers
5
6
Fig. 6. Relationship between average available bandwidth and number of neighbors
Taking all the above analysis into consideration, we can see that NPA-BT can select the best peers as neighbors, which increases the performance of P2P applications as well as decreases the cross ISP network traffic. As the emergence of new applications, how to provide network status information to them so as to optimize their performance becomes more and more important. And this method can also be extended to solve other distributed application optimization problems.
5 Conclusions More than half of current Internet traffic is generated by P2P applications, and most of which is BitTorrent traffic. Thus how to optimize P2P traffic is considered both by ISP and Internet researchers. In this paper we give a analysis on the impact of neighbor selection on BitTorrent performance and network resource consumption, and then propose a network aware BitTorrent traffic optimization mechanism NPABT. Simulation results show that with network status information, NPA-BT can select the best peers as neighbors, and thus improve the BT performance as well as decrease the inter AS traffic. In the future work, we are going to deploy the NPA-BT on PlanetLab so as to test its performance. Since the significance of P2P traffic optimization on both ISP and P2P applications, this research will be useful in providing a mechanism for BitTorrent traffic optimization.
References [1] IPOQUE. Internet Study 2007: Data about P2P, VoIP, Skype, file hosters like RapidShare and streaming services like YouTube (November 2007), http://www.ipoque.com/media/internet_studies/ internet_study_2007 [2] Seetharaman, S., Ammar, M.: Characterizing and mitigating inter-domain policy violations in overlay routes. In: Proc. of IEEE ICNP (2006)
176
C. Xing, L. Yang, and M. Chen
[3] Karagiannis, T., Rodriguez, P., Papagiannaki, K.: Should internet service providers fear peer-assisted content distribution? In: Proc. of the ACM SIGCOMM Internet Measurement Conference 2005, Berkeley, CA (October 2005) [4] Xing, C., Chen, M.: A Virtual Node based Network Distance Prediction Mechanism. In: Proceedings of the IEEE GlobeCom 2008 (2008) [5] Xing, C., Chen, M., Yang, L.: Predicting Available Bandwidth of Internet Path with Ultra Metric Space-based Approaches. In: Proceedings of the IEEE GlobeCom 2009 (2009) [6] Xie, H., Yang, Y., Krishnamurthy, A., Liu, Y., Silberschatz, A.: P4P: Provider Portal for Applications. In: Proceedings of the ACM SIGCOMM 2008, pp. 351–362. ACM Press, New York (2008) [7] Bindal, R., Cao, P., Chan, W., et al.: Improving traffic locality in BitTorrent via biased neighbor selection. In: Proc. of IEEE ICDCS 2006, Lisboa, Portugal (July 2006) [8] Aggarwal, V., Feldmann, A., Scheideler, C.: Can ISPs and P2P Systems Co-operate for Improved Performance? ACM SIGCOMM Computer Communications Review 37(3), 29–40 (2007) [9] Choffnes, D., Bustamante, F.: Taming the Torrent: A practical approach to reducing cross-isp traffic in peer-to-peer systems. In: Proceedings of the ACM SIGCOMM 2008, pp. 363–374. ACM Press, New York (2008) [10] Yu, J., Li, M.: CBT: A proximity-aware peer clustering system in large-scale BitTorrentlike peer-to-peer networks. Computer Communications 31, 591–602 (2008) [11] Network Coordinate Research at Harvard, http://www.eecs.harvard.edu/~syrah/nc/ [12] Route Views Project, http://www.routeviews.org/ [13] Wang, H., Liu, J., Xu, K.: On the Locality of BitTorrent-based Video File Swarming. In: Proc. of IPTPS 2009, Boston, MA (2009) [14] Shen, G., Wang, Y., Xiong, Y., et al.: HPTP: Relieving the Tension between ISPs and P2P. In: Proc. of ACM IPTPS 2007, Bellevue, WA (Feburary 2007)
User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors Jiqiang Song1, Eugene Y. Tang1, and Leibo Liu2 1 Intel Labs China, Beijing, China {jiqiang.song,eugene.y.tang}@intel.com 2 Tsinghua University, Beijing, China [email protected]
Abstract. More and more mobile phones are equipped with multiple sensors today. This creates a new opportunity to analyze users’ daily behaviors and evolve mobile phones into truly intelligent personal devices, which provide accurate context-adaptive and individualized services. This paper proposed a MAST (Movement, Action, and Situation over Time) model to explore along this direction and identified key technologies required. The sensing results gathered from some mobile phone sensors were presented to demonstrate the feasibility. To enable always sensing while reducing power consumption for mobile phones, an independent sensor subsystem and a phone-cloud collaboration model were proposed. This paper also listed typical usage models powered by mobile phone sensor based user behavior prediction. Keywords: Mobile computing, sensor, user behavior analysis, pattern prediction, MAST, context-adaptive, individualized.
characterize human behaviors, and provides a probabilistic method to analyze and predict user behavior patterns. Key technologies required for this research are discussed in Section 4. Finally, we list three typical usage models based on user behavior pattern analysis and prediction using mobile phone sensors.
2 Related Work User behavior analysis is not a new topic. Plenty of research work [4-6] has been done by analyzing applications and contents that users interact with. However, these analysis tools are either embedded in applications or in the service provider’s servers, which are not taking into account the new and powerful sensing capability of next generation mobile phones. Some work on wearable sensors [7] utilized multiple sensors, but assuming sensors are worn on several positions of human body, which is very different from the usage model of mobile phones. On the other hand, existing research on sensors mainly focuses on the environmental infrastructure buildup and inter-device communications, such as ubiquitous computing [8], pervasive computing [9], and wireless sensor network [10]. Different from the above research, the work presented in this paper is mobile phone centric, and will be complementary with the advancement of above research and form synergy for mobile computing.
3 MAST Model 3.1 Concept This section proposes a simple model for mobile phone user behavior analysis. The basis of user behavior analysis is to understand human activities and link them to other factors. In general, human daily activities can be classified into movements and actions according to their effects. First, we give their definitions in our context. z z
A movement is a salient physical activity which causes location change or pose change of human body, e.g., sit down, walk, run, etc. An action is a subtle physical activity which generates impact to the rest of the world other than location change or pose change of human body, e.g., talk, press on a key, tap on a screen, etc.
There are two important factors related to human activities, time and situation. Time is the natural time perceivable by human, measured by seconds. Situation is the location where human activities happen and the environment that affects human activities. Note that the meaning of location in the situation defined here is not the same as the precise geographic location which is measured by longitude and latitude. It is what human regard as a “place”, such as office, home, bus, etc.
User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors
179
Movements and actions have different relations with time and situation. Movements usually take measurable time, and it may bring human from one situation to another, while actions take variable time (very short for pressing a key or pretty long for watching a video), and it does not lead to situation changes. Figure 1 shows an example of recording a person’s behavior over a short period of time, where Si (i=1..3) denote different situations, Mi (i=1..4) denote different movements, and Ai (i=1..5) denote different actions.
Fig. 1. Temporal record of a person’s behavior over time
From Figure 1, one can clearly determine where the person is and what the person is doing at any particular time. We call this the MAST (Movement, Action, Situation over Time) model. M(t), A(t) and S(t) denote the value of movement, action and situation at time t, respectively. 3.2 Probabilistic Model If we use one day as the analyzing period for human behavior, M(t), A(t) and S(t) may be different at the same t of different dates. Therefore, M(t), A(t) and S(t) are not scalar values, but vectors containing probabilities of possible values. For example, suppose the possible movements of a person at a particular time are Mi (i=1..n), M(t) = [P1, P2, .., Pn], where Pi is the probability of Mi happening at that time, and Pi =1.0. Same for A(t) and S(t). M(t), A(t) and S(t) are denoted as column vectors. M(t), A(t) and S(t) can be independently observed over time and over days. Since a person’s daily life does show certain patterns due to his/her occupation and habits, M(t), A(t) and S(t) will converge after a certain period. They may converge at more than one peaks, e.g., corresponding to weekday pattern, weekend pattern, etc. This will be useful to predict a person’s movement, action or situation by time. However, the prediction is sensitive to the possible time shift of human activity. On the contrary, the correlations among movements, actions and situations are less time sensitive, so that they are more robust to characterize human behavior patterns. Figure 2 shows the probabilistic relations among M(t), A(t) and S(t). In short, when a user is observed in
∑
180
J. Song, E.Y. Tang, and L. Liu
Δ
Δ
state M(t), A(t) or S(t) at time t, how possible he/she will be in state M(t+ t), A(t+ t) or S(t+ t) after time t, where t is the time interval between observation and prediction.
Δ
Δ
Δ
Fig. 2. Probabilistic relation model among M(t), A(t) and S(t)
Below are the physical meanings of probabilities defined in Figure 2, where i, j are indexes of possible actions, movements or situations. z z z z z z z z z
Δ Δ Δ Δ Δ Δ Δ Δ Δ
PΔt(A|A) indicates when Ai happens at t, the probability of Aj happening at (t + t). PΔt(M|A) indicates when Ai happens at t, the probability of Mj happening at (t + t). PΔt(S|A) indicates when Ai happens at t, the probability of being Sj at (t + t). PΔt(M|M) indicates when Mi happens at t, the probability of Mj happening at (t + t). PΔt(A|M) indicates when Mi happens at t, the probability of Aj happening at (t + t). PΔt(S|M) indicates when Mi happens at t, the probability of being Sj at (t + t). PΔt(S|S) indicates when being Si at t, the probability of being Sj at (t + t). PΔt(A|S) indicates when being Si at t, the probability of Aj happening at (t + t). PΔt(M|S) indicates when being Si at t, the probability of Mj happening at (t + t).
Therefore, for a specified time t, PΔt(Y|X) is a two-dimension probability matrix, which is trained by day-to-day observations of user activities. Its row dimension equals to the dimension of X(t), and its column dimension equals that of Y(t). For any given Xi detected at time t, PΔt(Y|X) stores the probability of each Yj happening after t. The minor time shifts of user activities should be considered during training PΔt(Y|X). When X(t) or Y(t) is updated with new observation data, PΔt(Y|X) should be updated as well.
Δ
User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors
181
With the converged MAST probability model trained from user activities over many days, one can extract many interesting user behavior patterns by time-movement correlation, time-action correlation, time-situation correlation, movement-movement correlation, movement-action correlation, etc. It is also interesting to extract time-related correlations and time-unrelated correlations. 3.3 MAST-Based Behavior Prediction The MAST probabilistic model described in the above section is designed for characterizing a user’s daily behavior pattern. As long as the model has been trained to converge, it is very useful to predict the user’s next activity based on his/her current activity being detected. For example, assuming time is always a known value, below is a basic flow for predicting action by movement. i. ii. iii. iv. v. vi.
Detect a human movement, denoted by Mi; Using the current time t as a starting point, search the neighborhood of M(t) for a local probability peak for Mi; If the local peak is larger than a threshold, a strong prediction can be made; otherwise, a weak prediction is made; Get the time t’ corresponding to the local peak; Using t’ as the starting point, search the neighborhood of PΔt(A|M) for a local probability peak for Mi; Take the Aj corresponding to the local peak as the prediction.
The prediction flows for other input/output requirements are similar. One can also further develop to use more than one input to predict more precisely.
4 Key Technology There are many methods, such as Bayesian-based methods [11] and HMM-based methods [12], available for training the MAST probabilistic model, so that we do not repeat them in this paper. This section describes the other key technologies required for user behavior pattern analysis and prediction based on mobile phone sensors. They include user activity detection, independent sensor subsystem, and phone-cloud collaboration. 4.1 User Activity Detection When using sensors embedded in a mobile phone to collect its user’s activity data in a non-intrusive way, the detectable movements and situations are not as many as those detectable by surrounding sensors or wearable sensors. On the other hand, not all human movements and actions are concerned for user behavior analysis. Table 1 shows some concerned movements, actions and situations.
182
J. Song, E.Y. Tang, and L. Liu
Table 1. Concerned movements, actions and situations
Movements Sit down Stand up Walk on flat Walk upstairs Walk downstairs Take lift up Take lift down Run Drive
Actions Phone call Read news Web search Read mail Write mail Take photo Listen music Watch video Visit SNS website
Situations Home Office Car Bus Subway Street Canteen Meeting room Playfield
Among several sensors on a mobile phone (e.g., Nexus One), we identified that the 3-axis accelerometer and the 3-axis orientation sensor (a.k.a. digital compass) played important roles in detecting movements. Figure 3 shows the definition of 3D output of these two sensors, where G(i) means gravity acceleration on axis i.
Fig. 3. 3D output definitions of accelerometer and orientation sensor on a phone
Figure 4 shows the accelerometer data graphs over time for some typical movements, which demonstrate salient features. In the figure, the horizontal axis is time, and the vertical axis is G(i). Since the accelerometer data are 3-dimension vectors, 2-dimension and 3-dimension graphics pattern recognition method [13] is employed to detect movements. Orientation sensor data are used to determine the orientation and pose of phone, which does not necessary indicate its user’s orientation because there are different positions that the phone may be placed on human body, e.g., shirt pocket, coat pocket, trousers pocket or bag. When the user is not moving, it is impossible to determine the user’s orientation from orientation sensor output. Fortunately, we observed that when the user is doing the same walking movement on flat ground, the accelerometer sensor
User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors
183
a. G(x) data for walking on flat
b. G(x) data for walking upstairs
c. G(z) data for taking lift up Fig. 4. Accelerometer data graphs for some typical movements
output has different features for different phone poses, so that one can derive the relative position of phone on its user’s body. The user orientation detection algorithm may fuse the accelerometer output and orientation sensor output to determine user’s current
184
J. Song, E.Y. Tang, and L. Liu
orientation. Table 2 shows some results of user walking orientation detection using sensor output fusion. In the table, the accelerometer output, G*(i), is the averaged pace-wise sum of gravity acceleration values on each axis for walking, and the granularity of fused user orientation is 15⁰ in the 0⁰~360⁰ scope. From the result, we can see that when the phone moves with user at different poses, e.g., holding phone in hand to read messages so that the phone is heading front and facing up, the feature (G*(x), G*(y), G*(z)) varies much, and the fused user orientation can be significantly different from the original “Azimuth” output. Table 2. Sensor output fusion for user orientation detection Phone pose
Accelerometer output
Orientation sensor output
Fused user
Heading
Facing
G (x)
G (y)
G (z)
Azimuth
Pitch
Roll
orientation
Front
Up
-0.8
101.1
301.4
287⁰
-18⁰
0⁰
285⁰
Up
Back
49.7
308.9
-39.3
231⁰
-99⁰
9⁰
240⁰
Down
Front
56.2
-299.9
58.6
43⁰
75⁰
10⁰
75⁰
Back
Down
-18.0
-60.1
-297.9
102⁰
152⁰
-4⁰
255⁰
*
*
*
All concerned actions listed in Table 1 are through a user’s interaction with the applications installed in a mobile phone. These actions cannot be detected by sensors, but can be detected by software approach. We used an application manager which monitors the execution status of applications, such as start, active, inactive, and stop. Therefore, the user actions can be recorded by the virtual desktop. For the situation detection, the microphone sensor is most useful. There are some pretty good methods [14] can be used to detect situations by audio only. The fusion with other sensors, such as accelerometer, will further clear the ambiguity between similar audio scenes, like canteen and playfield, since the body movements are quite different. 4.2 Independent Sensor Subsystem Unlike other wearable sensor based user behavior analysis approach, we must consider the limitation imposed by mobile phone sensors. The most critical limitation is that mobile phone is a battery-powered small device. Detecting a user’s movement requires always sensing in the background, which means sensors are working even the user does not use the phone or phone is in sleep mode. This will pose a severe power consumption issue, since sensor data need CPU processing in today’s mobile phone SoC (System on Chip) architecture, so that CPU cannot sleep when sensors are working. Figure 5 shows a typical mobile phone SoC architecture.
User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors
185
Fig. 5. Current mobile phone SoC architecture
In details, following reasons make the architecture in Figure 5 not suitable for always sensing. i. ii.
iii.
Sensors’ always-on traffic characteristic is different from other low-speed peripherals; There are two levels of buses to separate high-speed and low-speed traffics to alleviate burden on main bus. However, sensor data need to travel upwards to be buffered and processed by CPU. It does not make sense either to keep the high-speed bus on just waiting for the low-speed sensor data; Furthermore, there is no direct communication between sensors, which prevents sensors from autonomous inter-control and collaboration.
To save the power consumption for always sensing, we propose to design an independent sensor subsystem in mobile phone SoC to resolve the above problems, as shown in Figure 6. This design isolates the core network from the sensor network. It not only allows CPU and main bus to sleep while one or more sensor subsystems are working, but also adds intelligence to sensor peripheral for inter-sensor communication, sensor data manipulation, and sensor data fusion. Isolation of the core network from the sensor network mainly occurs in two different aspects – power and clocking. There are well-known techniques in IC (Integrated Circuit) design for creating different power islands and clock sub-networks, so they will not be repeated here. Adding intelligence to the sensor network is more complicated. On the one hand, we do not want to add a processing block to each sensor, as it creates significant overhead. On the other hand, it is not a good idea to have one processing block for all sensors, as this will cause a significant burden on the routing and processing capability. Instead, what we are proposing is a network of sensor subsystems, where each subsystem has its own processing unit. The size of each subsystem depends on the sensor processing requirements, size of chip and capability of the processing unit. We can employ power saving techniques to such a network of sensor subsystems by further dividing the sensor subsystems into sub-groups.
186
J. Song, E.Y. Tang, and L. Liu
Fig. 6. Independent sensor subsystem
4.3 Phone-Cloud Collaboration It’s well known that probabilistic model training is computation intensive. Mobile phone itself is not an ideal platform to do that, unless there is specific hardware accelerator for that, which is not true nowadays. Therefore, how to leverage the network side capability is essential. With the fast development of wireless communication, always-connected mobile phones will become mainstream and the average available bandwidth for each user will exceed 1Mbps at the 4G era. In this circumstance, cloud computing and storage service [15, 16] is going to be the most easy-to-access and always-available computation and storage resource for mobile phones. We propose a phone-cloud collaboration model to fully exploit the advantages of both sides, as shown in Figure 7. In this collaboration model, the phone acts like a sensory organ, while the cloud acts like a brain. The cloud must emphasize privacy management, so that its users feel like they are using a private computer. The MAST probabilistic model for a user is stored in the cloud side, so that switching phones does not impact behavior analysis. The cloud provides standard interface to interact with phones. The phone uses its sensors and/or software to detect its user’s movements, actions and situations. The detection results are sent to the cloud for two purposes. One is to train the MAST probabilistic model for a specific user; the other is to serve as input to predict the user’s next behavior. With this partition, the communication data rate is very low since only results are transmitted.
User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors
187
Fig. 7. Phone-cloud collaboration model
5 Usage Model Mobile phone sensor based user behavior pattern analysis and prediction is the foundation of evolving mobile phones from a life-enriching device to an intelligent user behavior assistant. This section depicts three usage models powered by this technology. 5.1 Proactive UI Adaptation This usage model is between user and phone. Today’s mobile phone contains many applications that cannot be easily navigated on the small screen. Even though there are many UI innovations on mobile phones, but they still cannot fulfill individual’s need. From the user behavior analysis, the user pattern about using what application at what time, what situation, and after what movement, what action is known. With the real-time user behavior prediction result, the phone’s UI can be reorganized to adapt to the next action or situation, so that the user will always see the application he/she wants to use on the first screen. 5.2 Proactive Situation Adaptation This usage model is between user and situation. There are two aspects: one is that the phone adapts to the situation, the other is that the site adapts to the user. For the former, when a user is in different situations (e.g., home, office, driving, playfield, etc.), he/she wants to use the phone in different modes, such as turning the ring volume louder or lower, declining all calls, etc. For the latter, the same site may need to change settings or services for different users. For example, a car shared by several users need to change seat position and radio channels when a user is coming. With the real-time user behavior prediction result, the proactive adaptation can be realized.
188
J. Song, E.Y. Tang, and L. Liu
5.3 Proactive Service Adaptation The above two usage models are beneficial to phone users only. This usage model is between user and service provider. If a user is willing to share his/her behavior pattern information to service providers, service providers can access the user behavior pattern analysis and prediction interface in the cloud to classify the user’s profile and predict the user’s activity, so that to push accurate commercial services to him/her. This is a win-win usage model. Users get timely individualized service, and service providers save costs on untargeted promotions.
6 Summary This paper proposed a MAST probabilistic model for using mobile phone sensors to analyze and predict user behavior patterns. Some concerned new technologies, like user activity detection, independent sensor system and phone-cloud collaboration, are discussed. From those preliminary studies, we find that this research direction is promising while challenging. In the future, we will further investigate those key technologies and complete a prototype system.
References [1] [2] [3] [4]
[5]
[6]
[7]
[8]
[9]
[10] [11]
Apple’s iPhone Information (April 6, 2010), http://www.apple.com/iphone/ Google’s Nexus One Information (April 6, 2010), http://www.google.com/phone Bryzek, J., et al.: Marvelous MEMS. IEEE Circuits and Devices Magazine 22, 8–28 (2006) Dou, Y., Yan, H., Lei, Z.: Broadband dial-up user behavior identification and analysis. In: Proceedings of the 2nd IEEE International Conference on Broadband Network & Multimedia Technology, pp. 316–322 (2009) Vilas, M., et al.: User behavior analysis of a video-on-demand service with a wide variety of subjects and lengths. In: Proceedings of the 31st EUROMICRO Conference on Software Engineering and Advanced Applications, pp. 330–337 (2005) Chung, Y.W., Chung, M.Y., Sung, D.K.: Modeling and Analysis of Mobile Terminal Power on/off-State Management Considering User Behavior. IEEE Transactions on Vehicular Technology 57, 3708–3722 (2008) Corbellini, S., Ferraris, F., Parvis, M.: A System for Monitoring Workers Safety in an Unhealthy Environment by means of Wearable Sensors. In: Proceedings of IEEE Instrumentation and Measurement Technology Conference, pp. 951–955 (2008) Cheng, J.: Testing and Debugging Persistent Computing Systems: A New Challenge in Ubiquitous. In: Proceedings of IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, pp. 408–414 (2008) Reddy, Y.V.: Pervasive Computing: Implications, Opportunities and Challenges for the Society. In: Proceedings of the 1st International Symposium on Pervasive Computing and Applications, pp. 5–5 (2006) Chong, C.-Y., Kumar, S.P.: Sensor Networks: Evolution, Opportunities, and Challenges. Proceedings of the IEEE 91, 1247–1256 (2003) Mirikitani, D.T., Nikolaev, N.: Recursive Bayesian Recurrent Neural Networks for Time-Series Modeling. IEEE Transactions on Neural Networks 21, 262–274 (2010)
User Behavior Pattern Analysis and Prediction Based on Mobile Phone Sensors
189
[12] Wu, F., Chiu, I.-H., Lin, J.-R.: Prediction of the intention of purchase of the user surfing on the Web using hidden Markov model. In: Proceedings of ICSSSM 2005, International Conference on Services Systems and Services Management, pp. 387–390 (2005) [13] Lu, T., et al.: A novel knowledge-based system for interpreting complex engineering drawings: theory, representation and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 1444–1457 (2009) [14] Lu, L., Zhang, H., Jiang, H.: Content Analysis for Audio Classification and Segmentation. IEEE Transactions on Speech and Audio Processing 10, 504–515 (2002) [15] Amazon Cloud Computing Service Information (April 6, 2010), http://aws.amazon.com/ec2/ [16] Amazon Cloud Storage Service Information (April 6, 2010), http://aws.amazon.com/s3/
ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition* Jun Jin1, Yu Zhang2, Yuanda Cao1, Xing Pu1, and Jiaxin Li1 1
Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, China 2 School of Computer Science, Beijing University of Civil Engineering and Architecture, Beijing, China [email protected]
Abstract. Web service composition is to integrate component services for providing a value-added new service. With the growing number of component services and their dynamic nature, the centralized composition model can’t manage them efficiently and accurately. In this paper, we proposed a distributed hash table (DHT)-based peer-to-peer (P2P) service composition framework, called ServiceStore. Compared with the central control in centralized model, in our ServiceStore, service selection and composition are distributed to the involved task brokers, requesting nodes and service nodes. Furthermore, a simple parallel service selection approach which can still satisfy global constraints is proposed and implemented in our multi-role cooperation (MRC) protocol. The results of experimental evaluation show that ServiceStore can achieve high scalability and efficiency. Keywords: P2P; DHT; service selection; task broker; MRC protocol.
1 Introduction Standardized web service as a main solution of service-oriented computing provides a flexible and convenient way for applications to select and integrate basic services to form new value-added services. Many applications bring service composition into practice, Figure 1 shows a service composition example. In Figure 1, a traveler requests a tourism planning from a service provider and existing atomic services can’t satisfy this composite request by themselves. Service composer can integrate the fight booking, accommodation booking and car rental by using BPEL and execute the BPEL file on an engine such as BEPL4J[6]. Furthermore, the traveler also associates the request with some global QoS constraints (e.g. total price ≤ $5 and response time ≤ 1 minute). The service composer must ensure that the integrated QoS attributes satisfy the global constraints. *
This project supported by Provincial Natural Science Foundation of Hebei Province, China (Grant No. F2009000929).
ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition
191
Fig. 1. Web service composition example
With the aim to realize efficient service composition and resource utilization in distributed environment, our contribution of this paper can be briefly stated as follows: (1) A resource-efficient service selection approach. In order to pick out appropriate component services from the alternative ones that provide identical functionality but distinct QoS attributes and resource states. With the method proposed in [7], we design a simple local selection approach that not only satisfies the global constraints but also provides efficient resource utilization. (2) A multi-role cooperation (MRC) protocol. Each peer in MRC protocol can plays four roles – query originator, query decomposer, task broker and coordinator. With the help of this protocol, a composite service request can be solved efficiently. Assuming that the component service is atomic, the rest of this paper is organized as follows. Section 2 gives a brief overview of related work. Section 3 introduces the system architecture. Our multi-role cooperation protocol for distributed service selection and composition is presented in Section 4. Section 5 gives a simple proactive failure recovery approach. Experimental evaluations are presented in Section 6. Finally, Section 7 gives conclusions and our future work.
2 Related Work The problem of service composition has drawn many research institutes in recent years. As centralized orchestration lacks scalability and is easy to break down, more research work concentrates on decentralized orchestration. P2P system which is famous for its self-organizing and scalability has been adopted by many projects as their basic architecture. SELF-SERV [4,5] adopts an orchestration model based on P2P interactions between service components in the composition which provides greater scalability than the approaches based on central scheduler. They propose the concept of service community and a decentralized execution model. But service discovery and selection are not considered. WSPDS [7] uses an unstructured P2P system – Gnutella [2] as its infrastructure. With probability flooding technique and the subsequent content-based network, the overhead of query dissemination is significantly reduced. In WSPDS, each servent (acts as both server and client) is composed of two engines, communication engine and local query engine, having tasks for communication, collaboration and issue query. To discover a requested service, each servent receives the query and forwards it to the neighbor that has the most similar identity to the query. However, QoS-aware service selection is not considered in WSPDS.
192
J. Jin et al.
SpiderNet [10] proposes a QoS-aware service composition framework that uses DHT based P2P system as its infrastructure. Using (key, value) pairs, service discovery is very efficient. SpiderNet, each peer acts the same as in WSPDS, both server and client. SpiderNet uses a probing protocol to collect needed information and perform parallel searching of multiple candidate service graphs. Although the authors use probing budget and quota to control each request’s probing overhead, they don’t consider the situation when dealing with large number of concurrent requests and the service session setup time can’t be guaranteed. Note that all the research work above treats service discovery and selection sequentially which is costly and unnecessary. Integer programming [8] can be used to find optimal selection of component services [15]. Alrifai et al. [9] adopts it and changes service selection into a parallel fashion. In this paper, we adopt this idea and propose a distributed broker-based framework with MRC protocol to achieve QoS-aware and resource-efficient service composition.
3 System Architecture The ServiceStore system is implemented as a distributed middleware infrastructure, which can effectively map user’s composite service request into a set of component services in the P2P service overlay. 3.1 Three-Layer Architecture The architecture of ServiceStore is a three-layer structure (see Figure 2). The bottom service overlay is constructed by all service nodes with their registry component services and links mapped from underlying physical network. To facilitate node location in service overlay, a unique identifier nodeID is assigned to each service node. Each component service provides its functionality with advertised QoS attributes, however these non-functional values are mutative yet, e.g., the response time will be high when network congestion emerges, we classify them into two parts: (1) static metadata denoted as MDs(sij), a profile of the component service, including function name, its location LC(sij) and IO parameter list; (2) dynamic metadata denoted as MDd(sij), including recent statistical QoS attribute values Q(sij) = [q1(sij),…,qM1(sij)] and instant workload WL(sij) = [r1(sij),…,rM2(sij)], where M1 and M2 are the sizes of QoS vector and workload vector respectively, Q(sij) and WL(sij) are mutative to describe the performance of sij. All these component services can be classified into different service classes with each class sharing the same functionality. From the perspective of delivered functionality, each service class is identified as a service task, denoted as Ti. For clarification, we use Si = {si1,…, sij,…, siL} to denote the service class corresponding to Ti, where sij represents the j-th component service being able to fulfill the service task Ti, and L is the size of Si. The scenario of L > 1 indicates that the service task Ti is able to be realized by multiple candidate services, which can differ in their respective QoS attributes. To fast locate components services, we adopt Distributed Hash Table technique [13] to manage the component services in ServiceStore. DHT systems use (key, value) pairs to store and retrieve the value associated with a given key. We design a hash function to map a function name (keywords of function name) to a nodeID
ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition
193
(GUID). After applying it, the metadata list of component services that own similar function names and thus belong to the same service class (including MDd(Si) and MDs(Si)) are stored on the same service node, here called task broker. We use Broker(Si) to denote the task broker corresponding to Si. In Figure 2, the middle layer shows the task brokers of all component services on the service overlay. The top layer is a function graph that comes from a user’s composite service request. A request is denoted as R = F, Qr , where F = {T1,…,TN}is a function graph composed by a set of tasks (composition relations); Qr = [ q1r ,…, q Mr 1 ] shows a user’s QoS requirements.
〈
〉
Fig. 2. System architecture
3.2 QoS-Aware Service Composition Problem In ServiceStore, given a composite service request R with F = {T1,…,TN} and Qr = [ q1r ,…, q Mr 1 ], the aim of service composition is to find a list of component services, we call it Execution Plan (EP), that can realize each task in F and satisfy each quality attribute in Qr. Figure 2 shows a few of selected components services and their links with dotted lines and circles respectively, which make up an EP for R.
4 Multi-Role Cooperation (MRC) Protocol 4.1 Four Roles in MRC Protocol In MRC protocol, each peer plays four roles: (1) Query originator provides visualized specification environment [1] to help users issue composite service requests. (2) Query decomposer receives the quality levels and converts the global constraints Qr into local constraints and sends them to the involved task brokers. (3) Task broker maintains the meta-data list of alternative component services, and receives local constraints from requesting node and returns the most appropriate component service according to the heartbeats from every component service. (4) Execution coordinator the execution of the selected component services using composition pattern (e.g. sequential, conditional, concurrent, loop) and return the results to the user.
194
J. Jin et al.
Fig. 3. A peer plays four roles in ServiceStore
4.2 MRC Protocol Our MRC protocol includes five major steps shown in Figure 3: Step 1. Issue a composite service request. With the help of prestored composite service templates, a user at a host specifies a composite service request R using GUI. After query analysis, F, Qr is sent to the query decomposer and F is sent to execution coordinator.
〈
〉
Step 2. Decompose global QoS constraints into local ones. Through graph analysis, the involved tasks and global QoS constraints Qr are sent to the MIP decomposition [7]. Then the query decomposer achieves N local constraints and sends them to the corresponding task brokers. Step 3. Select feasible component services locally. To achieve accurate selection, each component service sends heartbeat message with MDd(sij) indicating the states of sij to Broker(Si). Then each corresponding task broker performs local selection and returns the selected service candidates to the requesting peer separately. And for efficient resource utilization, task broker updates WL(sij) of the optimal selected candidate service(e.g. its concurrent connection plus one). The details of this step will be described in the next subsection. Step 4. Form executable plans. Upon the receipt of all service candidates from the corresponding task brokers, the execution coordinator begins to compose them into EP according to F and sends to the first component service in EP. Step 5. Coordinate the execution of the EP. When receives an EP, a component service checks if its function is contained in the EP. If its function exists, the component service begins to execute and output results to the next component service according to EP. Finally, after all component services complete their executions, the
ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition
195
last component service sends the results to the requesting peer (user) and each involved task broker recovers WL(sij) of the component services in EP . 4.3 Service Selection As each task broker keeps the information of all alternative component services, upon the receipt of local constraints, it uses them as the bound and performs service selection for the corresponding service class independently. Given the following parameters: (1) The received local constraints for service class Si: Qc(Si)= [qk(Si) | 1 ≤ k ≤ M1] ; (2)The dynamic metadata of component service sij: MDd(sij) = , where Q(sij) = [qk(sij) |1 ≤ k ≤ M1] and WL (sij) = [rk(sij)| 1 ≤ k ≤ M2] . We compute the utility U(sij) of the j-th service candidate in class Si as M1
U ( sij ) = ∑
Qmax (i, k ) − qk ( sij )
k =1 Qmax (i, k ) − Qmin (i, k )
where
∑
M1 k =1
× ωk
(1)
ω k = 1 , Qmax (i, k ) and Qmin (i, k ) represent the max and min value of the k-
th quality attribute in class Si. Generally, the component service with the highest U(sij) is always selected as it provides the best capability. However, with the increasing number of invocation, its actual performance may become poor. Thus, a resource utility function UR(sij) is needed for representing a component service’s resource utility.
∑
v
M2
ra ij (k )
k =1
ij rmax (k )
UR( s ij ) = ∑
v
×ω k
(2)
v v ω k = 1 , rmax ( k ) and ra (k ) represent the max and available value of the kth resource (e.g. memory) in node vij (suppose sij is on the node vij). Here three requirements need to be considered in service selection: The selected component services (1) satisfy the global constraints; (2) achieve a large resource utility value; (3) achieve the optimal utility. The first requirement is very essential, as our major aim is to achieve a feasible solution for the user. The last requirement is set the lowest priority, even if big value does please the user, it may cause the system unstable yet for resource competition. Hence, for better stability and resource utility, the second requirement has higher priority than the last one. We apply these requirements in the algorithm 1.
where
M2
ij
ij
k =1
Algorithm 1. Local service selection Input: Decomposed local constraints Qc = [q1,q2,…,qM1] and metadata list of the service class S MDd(S) = {<[q1(sj),…, qM1(sj)], [r1(sj), …, rM2(sj)]> | 1< j qk then break
196
J. Jin et al.
5. end if 6. set k = k + 1 7. end for 8. if c then 9. add sj to Sout 10. compute U(sj) and UR(sj) 11. end if 12. end for 13. Sort Sout according to UR(sj) 14. return Sout
Our aim is to get a list of feasible component services from each involved task broker for the completion of EP. All feasible component services must meet the aforementioned three requirements. Algorithm 1 shows the service selection process. With the decomposed QoS constraints for service class S and meta-data of S as input parameters, the Broker(S) begins to run this algorithm. Broker(S) checks every QoS constraint (e.g. price ≤ $2) for every candidate service. If any QoS attribute was beyond the upper bound of the according given QoS constraint, Broker(S) would discard that component service. Therefore, the first requirement is satisfied during service selection. After Sout filled with all feasible component services, we sort it according to UR(sj) of every feasible component service. For fault tolerance, each involved task broker returns more than one candidate services.
5 Proactive Failure Recovery Failure recovery is very essential in dynamic environment [12]. ServiceStore provides a proactive failure recovery mechanism to maintain the quality of service composition during system runtime. As task broker is very crucial in MRC protocol, we discuss the situation when task brokers fail. ServiceStore maintains a small number of backup task brokers for each service class, for fault tolerance, when a peer publishes a component service, the metadata of this component service are stored into more than one task brokers [10]1. (1) Backup task broker computation. Applying the secure hash algorithm to the strings formed by concatenating two or three the component service’s function name, we can achieve different resourceIds, thus different task brokers to maintain the metadata list of this service class. For clarification, we give the following simple functions: String Concatenate(String functionname, int n) : concatenate the function name for n times. GUID SH(String functionname) : apply the secure hash algorithm to the functionname. And these three task brokers of the service class Si with the function name name_i are calculated as the flowing: Broker1(Si) = SH(Concatenate(name_i, 1)) Broker2(Si) = SH(Concatenate(name_i, 2)) Broker3(Si) = SH(Concatenate(name_i, 3)) 1
Current implementation keeps two additional backups.
ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition
197
For example, three task brokers of car rental service class are calculated by SH(Concatenate (“carrental”,1)), SH(Concatenate (“carrental”,2)) and SH(Concatenate (“carrental”,3)) respectively. (2) Backup task broker selection. If one of the task brokers failed, we would adopt the rule depicted in Figure 4. (3) Backup task broker synchronization. As service selection depends on the metadata list of the candidate services, each component service periodically sends the MDd to the task brokers (including the backup ones).
Fig. 4. A simple rule for task broker selection
6 Implementation and Evaluation The experiment is carried out on PeerSim [3] and the decentralized service overlay is implemented based on DHT based P2P system Pastry [11,14]. Please note that we set the same parameters during each round of simulation: 50 service functions in service overlay, 2 composite service requests during each time unit and 3000 time units each round of simulation lasts. First, we evaluate the resource utility of our MRC protocol. For simplicity, we use the number of each candidate component service’s concurrent link to measure the resource efficiency of our selection approach. We use a 1000 nodes service overlay, with each node provides component services whose function is selected from 50 predefined functions and each function has 4 duplicates. Each composite service request contains 3 functions. Every component service’s number of concurrent link will be increased by one when it is selected and reduced by one after working for 80-90 time units. Figure 5 illustrates the 4 duplicates almost have the same number of concurrent connections during 2000 running time units.
Fig. 5. Concurrent connections of each service duplicate
198
J. Jin et al.
Second, we evaluate the max and average hops when discovering a component service in different size service overlay. As the expected number of routing hops in DHT based P2P system Pastry is O(logN) [14], in Figure 6 although the service overlay size grows from 500 to 20000, the number of average hop increased very slowly and only few hops reach the max number.
Fig. 6. Max and average hops for discovering a component service
Third, we measure the average composite service session setup time with different service overlay size and different function number of each request. Figure 7 illustrates the average composite service session setup time when the function number is 3, 5, 10, 20 and the service overlay size varies from 50-20000. Thanks to parallel service selection, with the increasing service overlay size and function number in each request, composite service setup time increases slowly and doesn’t multiples with the function number.
Fig. 7. Composite service session setup time
7 Conclusion and Future Work In this paper, we have presented a P2P service composition framework called ServiceStore. The main contributions are: 1) ServiceStore provides a fully decentralized architecture implemented by using distributed task brokers as coordinators; 2)
ServiceStore: A Peer-to-Peer Framework for QoS-Aware Service Composition
199
ServiceStore provides a simple MRC protocol for service composition; 3) Our evaluation shows that ServiceStore scales well with service overlay size increasing and achieves good resource efficiency. Since the task broker is a critical role in MRC protocol and the failure recovery needs more time especially near the end of composite service execution, in the future we will integrate behavior prediction into our service composition framework which can help us to improve system stability.
References 1. Frolund, S., Koistinen, J.: Quality of service specification in distributed object systems desig. Distributed Systems Engineering Journal 5(4) (December 1998) (QML, which is a language for the description of QoS using XML) 2. Gnutella, http://gnutella.wego.com/ 3. PeerSim, http://peersim.sourceforge.net/ 4. Benatallah, B., Dumas, M., Sheng, Q.Z., Ngu, A.H.: Declarative Composition and Peer-toPeer Provisioning of Dynamic Web Services. In: Proc. Int’l Conf. Data Eng. ICDE, pp. 297–308 (February 2002) 5. Benatallah, B., Sheng, Q., Dumas, M.: The Self-Serv environment for web services composition. IEEE Internet Computing 7(1), 40–48 (2003) 6. Business Process Execution Language for Web Services Java Run Time (BPWS4J), http://www.alphaworks.ibm.com/tech/bpws4j 7. Banaei-Kashani, F., Chen, C.-C., Shahabi, C.: WSPDS: web services peer-to-peer discovery service. In: Proc. of the 5th Int’l Conference on Internet Computing (IC), Las Vegas, Nevada, pp. 733–743 (June 2004) 8. Nemhauser, G.L., Wolsey, L.A.: Integer and Combinatorial Optimization. WileyInterscience, New York (1988) 9. Alrifai, M., Risse, T.: Combining Global Optimization with Local Selection for Efficient QoS-aware Service Composition. In: Proc. of the 18th Int’l World Wide Web (WWW), Madrid, Spain (April 2009) 10. Gu, X., Nahrstedt, K., Yu, B.: SpiderNet: An Integrated Peer-to-Peer Service Composition Framework. In: Proc. of the 13th Int’l Symp. on High-Performance Distributed Computing (HPDC), Honolulu, Hawaii, pp. 110–119. IEEE Computer Society, Los Alamitos (2004) 11. PeerSim-Pastry, http://code.google.com/p/peersim-pastry/ 12. Raman, B., Katz, R.H.: Load Balancing and Stability Issues in Algorithms for Service Composition. In: Proc. of IEEE INFOCOM 2003, San Francisco, CA (April 2003) 13. Distributed hash table, http://en.wikipedia.org/wiki/ Dist-ributed_hash_table 14. Rowstron, A., Druschel, P.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Proc. of the 18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany (November 2001) 15. Ardagna, D., Pernici, B.: Global and Local QoS Constraints Guarantee in Web Service Selection. In: 3rd IEEE International Conference on Web Services (ICWS), Orlando, FL, USA, pp. 805–806 (July 2005)
Identifying Nearest Neighbor Nodes and Connectivity in Three-Dimensional Wireless Sensor Networks Using Poisson Point Field Yanhuai Qu1 , Jianan Fang1 , and Shuai Zhang2 1 School of Information Science and Technology, Donghua University, Shanghai 201620, P.R. China 2 School of Information and Electronic Engineering, Shandong Institute of Business and Technology, Yantai 264005, P.R. China [email protected], [email protected], [email protected]
Abstract. Some problems about the connectivity of wireless sensor networks (WSNs) are always important and difficult topics in research, especially the trade-off between connectivity and energy control. In this paper, we present a novel and effective method to calculate nearest neighbor nodes in three-dimensional WSNs using Poisson point field theory, which enables each node to find the kth nearest neighbor node and adjust transmitting range according to local requirement. Moreover, we derive the probability expression of the k-connected network. Finally, we give corresponding simulation results and discussions about the practical value for node energy control and design of WSNs. Keywords: Wireless sensor networks, nearest neighbor nodes, connectivity, transmitting range, Poisson point field.
1
Introduction
Wireless Sensor Networks (WSNs) have received significant attention in recent years due to their numerous potential in military and civil applications. Typically, a wireless sensor node consists of sensing, computing, communication, actuation, and power components. Sensor nodes are often air-dropped in hostile or dangerous region. It is not possible for humans to reach these sensor nodes and maintain each sensor node, as often the number of sensor nodes is quite large. Hence, self-organization of sensor nodes to form a connected network is an essential requirement. Additionally, the unattended nature of WSNs destines a majority of sensors to have energy sources which may not be replenished. Though some WSNs have equipped renewable energy such as solar battery, the energy consumed will limit their application. So energy is a decisive resource in WSNs. Energy efficiency and network lifetime have occupied a large portion of research effort in WSNs [1],[2],[3],[4],[5]. In this paper, we investigate the problems of nearest neighbor nodes and connectivity of WSNs in three dimensions applying Poisson point field theory C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 200–209, 2010. c IFIP International Federation for Information Processing 2010
Identifying Nearest Neighbor Nodes and Connectivity in WSNs
201
and random geometry graphs theory. We assume n nodes (We denote sensor node by node later.) are distributed in [0, l]3 according to a three-dimensional homogeneous Poisson point field with density λ (λ = n/l3 ). Firstly, we investigate nearest neighbor nodes of a node and make use of conditional probability to find the probability distribution function F (r) of the k th nearest neighbor node distance. And then corresponding critical transmitting range of the node is given. Next, basis of prior obtained results of this paper we discuss the connectivity of WSNs applying the random geometric graphs theorem and derive the probability expression of the k-connected Network and corresponding critical transmitting range of all nodes. Moreover, we present simulation results and discuss the practical value for node power control and design of WSNs. Finally, we present conclusion and outline the directions for our future work.
2
Related Prior Work
In this section, we recall some related work. Hou and Li [6] presented a model for analyzing the performance of transmission strategies in a multihop packet radio network where each station had adjustable transmission radius. One of the transmission strategies, namely transmitting to the nearest forward neighbor by using adjustable transmission power, had desirable features in a high terminal density environment. Philips et al. [7] examined a model of a packet radio network in which transmitters with range R were distributed according to a two-dimensional Poisson point process with density D. To ensure network connectivity, the expected number of nearest neighbors of a transmitter must grow logarithmically with the area of the network. Gupta and Kumar [8] derived the critical power a node in wireless network needed to transmit in order to ensure that the network was connected with probability one as the number of nodes in the network tended to infinity. They shown that if n nodes were distributed in a disc of unit area in and each node transmitted at a power level so as to cover an area of πr2 = (ln(n + c(n))/n), then the resulting network was asymptotically connected with probability one if and only if c(n) → +∞. Bettstetter [9],[10] investigated the minimum node degree and k-connectivity of wireless multihop network in the one (respectively, two) dimensional space, he derived an analytical expression of transmitting range of k-connected network.
3 3.1
Preliminary Poisson Point Field
One-dimensional number distributions of stationary Poisson point field N: P rob(N (B) = k) =
(λ|B|)k −λ|B| e k!
(k = 0, 1, ..., N ; |B| < ∞),
(1)
where N (B) denotes the number of points of N in a bounded Borel set B, λ is the density of N, and | · | denotes Lebesgue measure [11],[12],[13],[14].
202
3.2
Y. Qu, J. Fang, and S. Zhang
Some Basic Concepts of Graph Theory
Definition 1 (Graph). A graph G is an ordered pair of disjoint sets (V ,E), where E ⊆ V × V . Set V is called the vertex(or node, or point),while set E is the edge (or line) set of graph G. Definition 2 (Connected Graph). A graph G = (V, E) is connected if for any two nodes u, v ∈ V there exists a path from u to v in G. Definition 3 (k − Connected and k-Edge-Connected Graph). A graph G = (V, E) is k-connected, for some k ≥ 2, if removing any k-1 nodes from the graph does not disconnect it. Similarly, G is k-edge-connected, for some k ≥ 2, if removing any k-1 edges from the graph does not disconnect it [15]. (see Fig.1)
Fig. 1. 2-connected Graph
3.3
Elements of Random Geometric Graphs Theory
Definition 4 (Random Geometric Graphs). In the random geometric graphs G(n,r), n nodes are placed at random with uniform distribution in a square area of unit size (more generally, a m-dimensional cube). There is an edge (u,v) between any pair of nodes u and v, if the Euclidean distance between them is less than r [16].
4 4.1
Network Model Network Topology Model
Consider a WSN. Given n ∈ N+ (N+ = {1, 2, ..., N ),we use V to denote the set of n nodes, i.e. V = Vi :∈ [n]}. Given l ∈ N+ , we use L to denote the set of l communication links between the nodes in V, i.e. L = {Li : i ∈ [l]}. Definition 5 (N etwork T opology Graph). The network topology graph N T G =< V, L > is a graph with vertex set V representing the set of nodes, and edge set L representing the set of communication links. An arbitrary WSN consists of a set of nodes randomly placed in a space. Each node can communicate with another node within distance r through one-hop communication, where r is given as the transmitting range of node (see Fig.2). Throughout the paper, we consider an independent and identically distributed (i.i.d.) network topology model.
Identifying Nearest Neighbor Nodes and Connectivity in WSNs
203
r
Isolated nodes
Fig. 2. The topology of WSNs
4.2
Spatial Node Distribution Model
Assume n nodes are distributed in [0, l]3 according to a three-dimensional homogeneous Poisson point field with density λ(λ = n/l3 ).
5
Nearest Neighbor Nodes
For a Poisson point field, we can specify how the inter-node distances are distributed. By the stationarity of the Poisson point field and Slivnyak’s theorem [11],[12], the distance between the origin and its the k th nearest neighbor is the same as the distance between an arbitrary point and its the k th nearest neighbor. 5.1
The kth Nearest Neighbor Node
In this section, we give the probability distribution function F (r) of the k th nearest neighbor node distance, denoted by Fk (r). According to the homogeneity of Poisson point field, we can calculate it for a point close to the origin o. The conditional probability Fk (r) = 1 − P rob(((N (B(o, r) − B(o, ε)) = 0) ∪ (N (B(o, r) − B(o, ε)) = 1)... ∪(N (B(o, r) − B(o, ε)) = k − 1))|N (B(o, ε)) = 1)
(2)
is the probability that the k th neighbor node of a node in a small sphere B(o, ε)(ε > 0) locates at distance not greater than r from the origin o (see Fig.3), where r ε. As ε tends 0, we can derive the probability distribution function Fk (r) of the k th nearest neighbor node distance. According to the definition of conditional probability, we have Fk (r) = 1 − P rob(((N (B(o, r) − B(o, ε)) = 0) ∪ (N (B(o, r) −B(o, ε)) = 1) ∪ ... ∪ (N (B(o, r) − B(o, ε)) = k − 1)), N (B(o, ε)) = 1) P rob(N (B(o, ε)) = 1).
(3)
204
Y. Qu, J. Fang, and S. Zhang
z The 2nd nearest Neighbor Node
R2
y
ε o
R1
Rk The kth nearest Neighbor Node The 1st nearest Neighbor Node
x
Fig. 3. Nearest neighbor nodes in three dimensions
Because B(o, r) − B(o, ε) and B(o, ε) are disjoint Borel sets, by the fundamental properties of homogeneous Poisson point field, we get Fk (r) = 1 − P rob((N (B(o, r) − B(o, ε)) = 0) ∪ (N (B(o, r) − B(o, ε)) = 1)... ∪(N (B(o, r) − B(o, ε)) = k − 1)).
(4)
Make use of (1), we can derive Fk (r) = 1 − Exp(−λ|B(o, r) − B(o, ε)|)(1 + λ|B(o, r) − B(o, ε)| + ... +(λ|B(o, r) − B(o, ε)|)k−1 /(k − 1)!) 4 4 Fk (r) = 1 − Exp(− λπ(r3 − ε3 )) · (1 + λπ(r3 − ε3 ) + ... 3 3 ( 43 λπ(r3 − ε3 ))k−1 + ). (k − 1)!
(5)
(6)
Here we let ε tend 0, then ( 4 λπr3 )k−1 3 4 4 Fk (r) = 1 − e− 3 λπr (1 + λπr3 + ... + 3 ) 3 (k − 1)! 4
= 1 − e− 3 λπr
3
k−1 i=1
( 43 λπr3 )i . i!
(7)
The corresponding probability density function is fk (r) =
3( 34 λπr3 )k − 4 λπr3 e 3 . r(k − 1)!
(8)
Identifying Nearest Neighbor Nodes and Connectivity in WSNs
205
We denote critical transmitting range of the node by Rk−th when its the k th nearest neighbor node got. We can deduce the probability P rob(r ≤ Rk−th )=
Rk−th
fk (r)dr 0
Rk−th
= 0
=1− e
3( 43 λπr3 )k − 4 λπr3 e 3 dr r(k − 1)!
k−1 3 ( 4 λπRk−th )i − 43 λπR3k−th 3
i!
i=1
5.2
.
(9)
The Isolated Node
Specially, when a node has no neighbor node, i.e. it is a isolated node, the probability is
P rob(isolated node) = P rob(N (B(o, r)− B(o, ε)) = 0) = Exp(−λ|B(o, r) − B(o, ε)|) 4 = Exp(− λπ(r3 − ε3 )) 3 − 43 λπr 3 =e (ε → 0), and critical transmitting range Riso of the node is 3 ln P rob(isolated node) 3 Riso = − . 4λπ
6
(10)
(11)
Connectivity
In this section, we investigate the k-connected network. In [17], Penrose has proved P rob(G is k − connected) = P rob(DEGmin (G) ≥ k), (12) where DEGmin (G) denotes the minimum node degree of random geometric graph G, i.e. a network is said to be k-connected if each node has at least the k th nearest neighbor node. So the probability for the k-connected network is 4
3
P rob(k − Connected) ≈ (1 − e− 3 λπRkConn
k−1 i=1
3 ( 43 λπRkConn )i n ) , i!
(13)
where RkConn denotes critical transmitting range of all nodes of k-connected network.
206
7
Y. Qu, J. Fang, and S. Zhang
Simulation and Discussion
In this section, we present simulation and discussion. In our simulation environment, the space of nodes distribution V = [0, 125]3 (a unit of measurement: m3 ), the number of nodes n ∈ [0, 100], so node density λ = n/V . Firstly, Fig.4 shows simulation results of the probability that a node has no neighbor node. For WSNs to function properly in the first place it must be connected, or mostly connected. Otherwise the network would consist of scattered isolated islands of information and could not support properly networking applications.
(a)
(b)
Fig. 4. (a) The probability that a node has no neighbor node, (b) Relation between the number of nodes and transmission range when probability that a node has no neighbor node is 99%
Next, Fig.5 shows the simulation results of the probability distribution that a node has the 3rd nearest neighbor node. A main constraint in WSNs is energy. It would contribute to the survival of overall network if each node is capable of finding nearest neighbor node and adjusting transmitting range according to local requirement. Then, Fig.6 shows simulation results of the probability distribution for 3connected network. The connectivity is a vital attribute of WSNs. Through simulations, we notice that connectivity of network becomes multiple with augmentation of transmitting range rapidly. Finally, the critical transmitting range values reported in Table 1 can be interpreted as transmitting range in different operation state of WSNs, where Riso denotes critical transmitting range of isolated node, R1st (respectively, R2nd and R3rd ) denotes critical transmitting range of the node having the 1st (respectively, 2nd and 3rd ) nearest neighbor node, R1Conn (respectively, R2Conn and R3Conn )
Identifying Nearest Neighbor Nodes and Connectivity in WSNs
(a)
207
(b)
Fig. 5. (a) The probability distribution that a node has the 3rd nearest neighbor node, (b) Relation between the number of nodes and transmitting range when probability that a node has the 3rd nearest neighbor node is 99%
(a)
(b)
Fig. 6. (a) The probability distribution of 3-connected network, (b) Relation between the number of nodes and transmitting range when probability of the 3-connected network is 99%
208
Y. Qu, J. Fang, and S. Zhang
denotes critical transmitting range of all nodes of 1-connected (respectively, 2connected and 3-connected) network. We observe that the critical transmitting range values have significant differences between the k th nearest neighbor node and k-connected network. Due to sensor transmitting power in direct proportion to transmitting range, above results can provide a idea for designers of WSNs. Under the conditions of a large connected component, local nodes or few isolated nodes can adjust transmitting range independent of any other node, which can maximally prolong lifetime of network. Table 1. Critical transmitting range (a unit of measurement: m) yielding 99% of different operation state of WSNs varying with network capacity n 10 25 50 75 100
In this paper, we propose a novel method to investigate the problems of nearest neighbor nodes and connectivity of WSNs in three dimensions. We firstly provided probability distribution function of the k th nearest neighbor node distance and analytical expression of critical transmitting range of corresponding node. And then we derive probability expression of k-connected network and critical transmitting range of all nodes. Simulation results and discussions demonstrate that the method is accurate and effective which can provide some theoretical basis for furthering research energy efficiency and topology control of WSNs. Our future work will be focus on inhomogeneous WSNs and their stochastic properties. Acknowledgments. This research was supported in part by the National Natural Science Foundation of China under Grant No. 60874113. The authors also thank the reviewers for their valuable comments.
References 1. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: A survey. Computer Networks 38(2), 393–422 (2002) 2. Chakrabarty, K., Iyengar, S.S.: Scalable Infrastructure for Distributed Sensor Networks. Springer, London (2005) 3. Pottie, G.J., Kaiser, W.J.: Wireless sensor networks. Communications of the ACM 43, 51–58 (2000)
Identifying Nearest Neighbor Nodes and Connectivity in WSNs
209
4. Brooks, R.R., Iyengar, S.S.: Multi-Sensor Fusion: Fundamentals and Applications with Software. Prentice Hall, Englewood Cliffs (1998) 5. Iyengar, S.S., Brooks, R.R.: Distributed Sensor Networks. Chapman & Hall/CRC, Boca Raton (2005) 6. Hou, T.C., Li, V.O.K.: Transmission range control in multihop packet radio networks. IEEE Trans. on Communications COM 34, 38–44 (1986) 7. Philips, T.K., Panwar, S.S., Tantawi, A.N.: Connectivity properties of a packet radio network model. IEEE Trans. on Information Theory 35(5), 1044–1047 (1989) 8. Gupta, P., Kumar, P.R.: Critical power for asymptotic connectivity in wireless networks. In: Stochastic Analysis, Control, Optimization and Applications. Birkh¨ auser, Boston (1998) 9. Bettstetter, C.: On the minimum node degree and connectivity of a wireless multihop network. In: Proc. ACM International Symposium on Mobile Ad Hoc Networking and Computing. ACM Press, New York (2002) 10. Bettstetter, C.: On the connectivity of wireless multihop networks with homogeneous and inhomogeneous range assignment. In: Proc. IEEE Vehicular Technology Conference. IEEE Press, Los Alamitos (2002) 11. Stoyan, D., Kendall, W., Mecke, J.: Stochastic Geometry and Its Applications, 2nd edn. John Wiley & Sons, Chichester (1996) 12. Stoyan, D., Stoyan, H.: Fractals, Random Shapes and Point Fields. John Wiley & Sons, Chichester (1994) 13. Daley, D., Jones, D.V.: An Introduction to the Theory of Point Processes, 2nd edn. Elementary Theory and Methods, vol. I. Springer, New York (2003) 14. Daley, D., Jones, D.V.: An Introduction to the Theory of Point Processes, 2nd edn. General Theory and Structure, vol. II. Springer, New York (2008) 15. Bollob´ as, B.: Modern Graph Theory. Springer, New York (1998) 16. Penrose, M.D.: Random Geometric Graphs. Oxford University Press, Oxford (2003) 17. Penrose, M.D.: On k-connectivity for a geometric random graph. Random Structures and Algorithms 15(2), 145–164 (1999)
A Novel Trust Evaluation Model for Mobile P2P Networks Xu Wu Department of Computer Science, Xi’an Institute of Posts and Telecommunications, Xi’an, 710121, China [email protected]
Abstract. Trust is one of key factors which influence the development of mobile P2P networks. However, current trust evaluation models are not applicable to mobile P2P networks properly due to some of its characteristics such as heterogeneous nature of the peers, limited-range as well as unreliability of wireless links. In the paper main factors that influence the trust in mobile P2P networks are identified. Based on the analyzed result, we propose a novel trust evaluation model, which helps the networks to operate normally with high probability. Our model does not employ cryptographic approaches or certification mechanisms, so it is light enough to fit well with mobile P2P networks without great overheads. In addition, it also effectively solves the trust problem when no prior interaction history exists, an issue that has not been addressed in many models. The proposed model is shown to be robust in the presence of attackers through simulation. Keywords: Model; trust; P2P networks; mobility.
A Novel Trust Evaluation Model for Mobile P2P Networks
211
However mobile P2P networks pose some unique challenges, many trust evaluation models [2-6] are not applicable to mobile P2P networks properly. In the paper characteristics of mobile P2P networks are discussed, and main factors that influence the trust in mobile P2P networks are analyzed. Based on the analyzed result, we propose a novel trust evaluation model for resilient mobile P2P networks, which helps the networks to operate normally with high probability. The rest of the paper is organized as follows. Section 2 describes related work. Section 3 presents the proposed trust model. Section 4 contains experimental study. Finally, we conclude this paper in Section 5.
2 Related Work EigenTrust [2] model is designed for the reputation management of P2P systems. The global reputation of peer i is marked by the local trust values assigned to peer i by other peers, which reflects the experience of other peers with it. The core of the model is that a special normalization process where the trust rating held by a peer is normalized to have their sum equal to 1. The shortcoming is that the normalization could cause the loss of important trust information. Runfang Zhou and Kai Hwang [3] proposed a power-law distribution in user feedbacks and a computational model, i.e., PowerTrust, to leverage the power-law feedback characteristics. The paper used a trust overlay network (TON) to model the trust relationships among peers. PowerTrust can greatly improve global reputation accuracy and aggregation speed, but it can not avoid the communication overhead in global trust computation. A new trust model based on recommendation evidence is proposed for P2P Networks by Tian Chun Qi et al [4]. The proposed model has advantages in modeling dynamic trust relationship and aggregating recommendation information. It filters out noisy recommendation information. Thomas Repantis and Vana Kalogeraki [5] propose a decentralized trust management middleware for ad-hoc, peer-to-peer networks, based on reputation. In the work, the middleware’s protocols take advantage of the unstructured nature of the network to render malicious behavior, and the reputation information of each peer is stored in its neighbors and piggy-backed on its replies. Recently, there are many approaches studying trust management of wireless networks. The significant efforts done so far are to manage trust with the help of Certificate Authority (CA) or Key Distribution Center (KDC). A CA/KDC is responsible for setting up the foremost trust relationships among all the nodes by distributing keys or certificates [6]. However, this strategy suffers from difficulty on collecting t certificates efficiently. In the distributed CA scheme [7], Kong et al. mentioned that the trust between a to-be-member node and t member nodes in its neighborhood can be established by out-of-bound physical proofs, such as human perception or biometrics. However, we can find that this method is far from practical.
3 Trust Evaluation Model Our trust model has two types of trust: direct trust and recommendation trust. Direct trust is the trust of a peer on another based on the direct interacting experience and is used to evaluate trustworthiness when a peer has enough interacting experience with
212
X. Wu
another peer. On the other hand, recommendation trust is used when a peer has little interacting experience with another one. Recommendation trust is the trust of a peer on another one based on direct trust and other peers’ recommendation. In the section we firstly introduce five trust factors which influence the trust in such a mobile environment. We then present the details about how to evaluate the trustworthiness of peers by these trust factors. 3.1 The Trust Factors Communication of P2P application: This factor contains communication ratio information. When a peer finds a certain event, if its neighbor peers also find the same event and broadcast the results of event, communication ratio values for those neighbor peers go up. If they do not communicate, communication ratio values for those peers go down. This factor represents the level of selfishness and normality of a peer. If a peer does not participate in communication in the networks continuously for its battery saving or some other Roubles, its trust value will be degraded.
Vi =
vsi − vf i vsi + vf
(1)
Vi : communication value of peer i, where 1 ≤ i ≤ k .
vsi : communicating success count of node i vf i : communicating failure count of node i
Communicating results: This factor represents the result information of finding malicious events. This factor consists of communicating data and communicating time for the events. The information of this factor is used to check a consistency of each mobile peer and to find malicious peers in the networks. The inconsistency check result affects the value of consistency factor, Ci . When peer j checks the inconsistency of peer i’s communicating results, if the results are out of relatively standard bound of node j , node j estimates the results to be inconsistent or deceitful data. Such an estimation for peer i, affects the value of the consistency factor, Ci .
Ri =< cdi , cti >
(2)
Ri : communicating result value of peer i, where 1 ≤ i ≤ k
cd i : communicating data of peer i cti : communicating time of peer i Consistency: This factor represents a level of consistency of a peer. Based on this factor, we can identify malicious peers, and filter out their communicating data in the networks. Ci =
cci − ici cci + ici
(3)
A Novel Trust Evaluation Model for Mobile P2P Networks
213
Ci : Consistency value of peer i, where 1 ≤ i ≤ k cci : consistent communicating count of peer i
ici : inconsistent communicating count of peer i Power: This factor represents remained lifetime of a mobile peer. As we compute trust values in consideration of this factor, we can reduce additional processes which would be necessary to handle some power- managing policies. In addition, some peers which have high trust values are likely to process more jobs than the other peers which have low trust values. In that case, the higher trust value a peer has, the earlier the peer meets its end. According to the adoption of this power factor, we can prevent such a biased power exhaustion. −1 ≤ P i ≤ 1
(4)
P i : Power value of peer i, where 1 ≤ i ≤ k Size of interactions: Size has different meanings in different mobile P2P environments. For example, in a mobile P2P file sharing network, the size of interaction expresses the file size shared in each interaction. Size of interactions is an important factor that should be considered in the trust model. For peers without any interacting history, most previous trust models often define a default level of trust. But if it is set too low, it would make it more difficult for a peer to show trustworthiness through its actions. If it is set very high, there may be a need to limit the possibility for peers to start over by re-registration after misbehaving. In our trust model, the introduction of the size of interactions effectively solves the trust problem of peers without any interacting history.
“
”
3.2 The Computational Model
Consider the situation where peer i wants to interact with peer j in order to accomplish a certain task. There are two ways in which to calculate trust value: direct and recommendation. Direct trust is denoted as D(Ti ( j ), S ) , where Ti ( j ) is the direct trust value that peer i calculates for peer j. S expresses peer j’s level of size of interaction which is granted by peer i. The level of size of interaction satisfies the following rules. (1) The lowest level is given to a new peer that doesn’t have any interaction history. (2) A certain level is updated if the number of successful interactions reaches the predefined number in the level. The predefined number is decided by the peer itself. The lower the current level is, the more the number of successful interactions it needs. (3) The predefined successful interaction number in a certain level is increased if interactions fail due to malicious activities. Direct trust computation involves an assignment of weights to the trust factors. We define Wi as a weight which represents importance of a particular factor from 0, unimportant, to +1, most important. The weight is dynamic and dependent on the application. If Pj ≠ −1 , Direct trust value that peer i calculates for peer j is computed by the following equation:
214
X. Wu
Ti ( j ) =
W1C j + W2V j + W3 Pj
∑ j =1W j 3
+ pen( m)
1 1 + e− n
(5)
where 0 < W j ≤ 1 . In case of Pj = −1 , we just assign -1 to Ti ( j ) and exclude the node from the networks because it totally cannot work in the networks. Because each mobile peer uses histograms for the accumulative trust evaluation, some malicious or compromised peers that broadcast inconsistent or deceitful data continuously can be found and classified in this trust computing process. pen( m) denotes the punishment function.
1 is the acceleration factor, where 1 + e− n
n denotes the number of fail. It can
make trust value to drop fast when the interaction is failed. When two peers have little interaction experience, other peers’ recommendation is needed for trust establishment. Recommendation trust is the trust of a peer on another one based on direct trust and other peers’ recommendation. Let we assume that peer j requests an interaction with peer i and the size of the interaction is Q . First, peer i computes peer j’s direct trust denoted as D(Ti ( j ), S ) . (1) If Q ≤ S and Ti ( j ) reaches a certain value (which is set by peer i), peer i considers peer j to be trustworthy. It will then decide to interact with peer j. (2) If Q ≤ S but Ti ( j ) fails to reach a certain value, peer i chooses to join a group based on its interest. Then it checks its own group and location with GPS and floods a HELLO message which containing a packet to announce itself to other peers by using Echo protocol [8], then requests requests all other members of the group to cast a vote for peer j from the perspective of trust in the level of Q . For any new peer without any interaction history, its trust value would be 0 and would be granted the lowest level of the size of interaction. Without requesting, it will be permitted to interact at the lowest level. (3) If Q ≤ S but Ti ( j ) fails to reach a certain value, peer i immediately refuses to interact with peer j. (4) If Q ≤ S and Ti ( j ) reaches a very high value, peer i chooses to join a group based on its interest and then requests requests all other members of the group to cast a vote for peer j from the perspective of trust in the level of Q . Second, after the other peers receive the poll request message, they will decide whether to cast the vote based on the following formula. Let e denotes a voting peer, then DTe ( j ) =
N ( j)
∑( m =1
W1C j + W2V j + W3 Pj
∑
3
Wj
j =1
+ pen( m)
1 ) 1 + e−n
(6)
where DTe ( j ) is the poll value of e in j. N ( j ) denotes the total number of interactions e has conducted with j at level Q . Lastly, peer i gathers up all poll information of peer j from the repliers and gets peer j’s recommendation trust by this equation:
A Novel Trust Evaluation Model for Mobile P2P Networks
215
N ( w)
T=
∑ R(w) × p
(7)
i =1
N (w)
where N ( w) denotes the total number of votes and R ( w) denotes peer w ’s vote accuracy factor which is in the range of (0, 1). p is related to DTw ( j ) such that if DTw ( j ) > 0 , p = 1 , else p = 0 .
4 Experimental Study Experiments have been carried out to study the effectiveness and the benefits of our proposed model. In a real environment, there may exist some vicious attacks including malicious recommendations or cheating in the accumulation of trust in small-size interactions. In addition, it should solve the trust problem when there is no interaction history or little trust value. Table 1. Default parameters in simulation experiments
Number of Peers
300
Communicating range (m)
70
2
×
Simulation area (m )
500 500
Number of malicious Peers
0% - 70% of all peers
Risk attitude
averse, neutral, seeking
Communication protocol
802.11
Life time (s)
[50,100]
Maximum speed (m/s)
20
The simulation environment is set up as follows: we create 300 peers that will perform interacting in a mobile p2p resource sharing system. 300 mobile peers are uniformly distributed at the area whose size is 500m × 500m . Communicating range of a mobile device is 70m. The simulated experiments were run on a dual-processor Dell server and the operation system installed on this machine is Linux with kernel 2.6.9. To make our simulation as close to the real mobile p2p systems where peers often go offline, we simulate the offline peers by assigning every peer a random lifetime (or Time-To-Live) within the step range [50, 100]. After reaching the lifetime, the peer will not respond to any service request, and won’t be counted in the statistics either. After one more step, the peer comes alive again with a new life time randomly chosen from the range [50, 100]. In this analysis, we assume that all mobile peers have a same amount of battery power and participate in communication positively regardless of their roles. Each peer acts as both client and server to share its resources with other peers, and communicates with each other via IEEE 802.11. The default parameters in simulation experiments are showed in the table 1.
216
X. Wu 1
The recommendation trust
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
3
5
7
9
11
13
15
17
19
21
19
21
The wrong recommendation number
(a) 1
The recommendation trust
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
3
5
7
9
11
13
15
17
The wrong recommendation number
(b) Fig. 1. Trust evaluation results
In the first experiment we evaluate the trust evaluation model in terms of its efficiency of excluding malicious recommendations in the network. We implement and simulate a file sharing system. The environments of the system are as follows. 300 mobile peers are uniformly distributed at the area whose size is 500m × 500m . Communicating range of a mobile device is 70m. In this analysis, we assume that all mobile peers have a same amount of battery power and participate in communication positively regardless of their roles. So, we consider only a consistency evaluation
A Novel Trust Evaluation Model for Mobile P2P Networks
217
factor. Fig. 1 shows the simulation result in which the broken line denotes the recommendation trust value Tm that includes malicious peers’ recommendations and the solid line denotes the real recommendation trust value Tr that doesn’t include any malicious recommendations. In this simulation, a same malicious recommendation event occurs every 10 seconds. As we can see Fig. 1 (a), normal recommendation trust value is 0.3, but a malicious recommendation result is 0.9 by few malicious peer which broadcasts three times as high as a normal recommendation result. This indicates the vulnerability of a system without a trust evaluation scheme. Fig. 1(b) shows the process of filtering inconsistent data of a malicious node which acts inconsistently after certain seconds with a proposed trust evaluation scheme. We can see that Tm fluctuates around Tr but the scale of the fluctuation is very small. The earlier the system detects a malicious node, the lower the malicious recommendations of it can affect the aggregated result.
Fig. 2. Simulation results of peers under independent cheat and group cheat
In the second experiment, we assess the performance of our mechanism under two attack models: independent cheat and group cheat. Our experiment also points out that the trust model is also sensitive to the group cheat. In the experiment, we add a number of malicious peers to the network such that malicious peers make up between 0% and 70% of all peers in the network. Fig. 2.shows what is happening. In this figure, we compare the independent cheap and group cheat. Under independent cheat, the malicious peers firstly accumulate trust values through small interactions, gaining a relatively high trust. After trusted by most adjacent peers, the peer takes advantage of its high trust value to attack another peer, which means to always provide an inauthentic file to another peer when selected as download source. Group cheat is that
218
X. Wu
there is a group in which the peer of the group provides an authentic file to each other and provides an inauthentic file to the peer outside the group. The rate of inauthentic downloads under independent cheat or group cheat increases at the beginning, then starts to drop when the number of malicious peers reaches to 30%-40% of all peers in the network. The reason is that the trust computing mechanism used in our experiments punishes this behavior by lower the trust values quickly. Since malicious peers found by the mechanism will lose choice selected as download sources. As a result, the rate of inauthentic downloads will drop. However, due to the good rating coming from the cheating group, the rate of inauthentic downloads under group cheat drops more slowly than the one under independent peer. Yet one thing remains assured: the rate under group cheat is still dropping and will drop to 5%. Even if no malicious peers are present in the system, downloads are evaluated as inauthentic in 3%-5% of all cases – this accounts for mistakes users make when creating and sharing a file, e.g., by providing the wrong meta-data or creating and sharing an unreadable file.
5 Conclusion and Future Work The realization of trust mechanism in mobile p2p networks is quite different due to some characteristics of mobile environment, which indicates the trust between participants can not be set up simply on the traditional trust mechanism. In the paper we proposed a novel trust evaluation model for mobile P2P networks. The main factors that influence the trust in mobile P2P networks are identified. Our model does not employ cryptographic approaches or certification mechanisms, so it is light enough to fit well with mobile P2P networks without great overheads. To the best of our knowledge, our approach is one of the incipient researches on trust evaluation model for mobile P2P networks that can detect malicious and compromised mobile peers. In addition, the proposed model effectively solves the trust problem of peers without any interacting history. We expect that our trust evaluation model can help to make resilient mobile P2P networks. In the near future, we would like to test our trust into more real mobile P2P systems and analyze the system performances.
References 1. Takeshita, K., Sasabe, M., Nakano, H.: Mobile P2P Networks for Highly Dynamic Environments. In: 6th IEEE International Conference on Pervasive Computing and Communications, Hong Kong, pp. 453–457 (2008) 2. Kamvar, S.D., Schlosser, M.T., Molina, H.G.: The EigenTrust Algorithm for Reputation Management in P2P Networks. In: 12th International Conference on Word Wide Web, Budapest, Bulgaria, pp. 640–651 (2003) 3. Zhou, R., Hwang, K.: PowerTrust: A Robust and Scalable Reputation System for Trusted P2P Computing. IEEE Transactions on Parallel and Distributed Systems 18(5) (2007) 4. Tian, C.Q., Zou, S.H., Wang, W.D., Cheng, S.D.: A New Trust Model Based on Recommendation Evidence for P2P Networks. Chinese Journal of Computers 31(2), 271–281 (2008)
A Novel Trust Evaluation Model for Mobile P2P Networks
219
5. Thomas, R., Vana, K.: Decentralized trust management for ad-hoc peer-to-peer networks. In: 4th International Workshop on Middleware for Pervasive and Ad-Hoc Computing, Melbourne, Australia (2006) 6. Zhou, L., Haas, Z.J.: Securing ad hoc networks. IEEE Special Issue on Network Security 13(6), 24–30 (1999) 7. Kong, J., Zerfos, P., Luo, H., Lu, S., Zhang, L.: Providing robust and ubiquitous security support for mobile ad-hoc networks. In: 9th International Conference on Network Protocol, pp. 25–260 (2001) 8. Sastry, N., Shankar, U., Wagner, D.: Secure verification of Locotion Cloims. In: 2nd ACM Workshop on Wireless Security, New York, pp. 1–10 (2003)
Evaluating and Optimizing I/O Virtualization in Kernel-based Virtual Machine (KVM) Binbin Zhang1, Xiaolin Wang1, Rongfeng Lai1, Liang Yang1, Zhenlin Wang2, Yingwei Luo1, and Xiaoming Li1 1
Dept. of Computer Science and Technology, Peking University, Beijing, China, 100871 2 Dept. of Computer Science, Michigan Technological University, Houghton, USA {wxl,lyw}@pku.edu.cn, [email protected]
Abstract. I/O virtualization performance is an important problem in KVM. In this paper, we evaluate KVM I/O performance and propose several optimizations for improvement. First, we reduce VM Exits by merging successive I/O instructions and decreasing the frequency of timer interrupt. Second, we simplify the Guest OS by removing redundant operations when the guest OS operates in a virtual environment. We eliminate the operations that are useless in the virtual environment and bypass the I/O scheduling in the Guest OS whose results will be rescheduled in the Host OS. We also change NIC driver’s configuration in Guest OS to adapt the virtual environment for better performance. Keywords: Virtualization, KVM, I/O Virtualization, Optimization.
1 Introduction Software emulation is used as the key technique in I/O device virtualization in Kernel-based Virtual Machine (KVM). KVM uses a kernel module to intercept I/O requests from a Guest OS, and passes them to QEMU, an emulator running on the user space of Host OS. QEMU translates these requests into system calls to the Host OS, which will access the physical devices via device drivers. This implementation of VMM is simple, but the performance is usually not satisfactory because multiple environments are involved in each I/O operation that results in multiple context switches and long scheduling latency. In recent versions, KVM tries to reduce the I/O virtualization overhead by emulating key devices in the KVM kernel module. However, the main I/O devices are still emulated by QEMU. In this paper, we evaluate KVM disk and network virtualization overhead and try to optimize it by reducing the overhead of VM Exits1 and simplifying the corresponding virtualization operations in Guest OS. The rest of the paper is organized as follows. Section 2 evaluates disk and network performance in a KVM guest. Section 3 presents our optimizations and evaluations. Section 4 discusses related work. And we conclude the paper in Section 5. 1
A VM Exit is a context switch from Guest OS to VMM. When a sensitive instruction is executed in the Guest OS, a VM Exit will happen. After the VMM emulates the sensitive instruction, the Guest OS can continue to work.
Evaluating and Optimizing I/O Virtualization in KVM
221
2 KVM I/O Performance Evaluation A software emulation-based I/O virtualization technique, which is used by KVM and most other host-based VMMs, causes a significant performance overhead. To analyze the sources of the overhead, we begin with a serial of experimental evaluations of KVM's disk and network I/O virtualization performance. We use bonnie++ [6] to evaluate the performance of disk I/O operations, such as character/block read and write, random/sequential read and write. And we use netperf [7] to evaluate network I/O performance, including data throughput, latency, and CPU utilization rate during sending and receiving data using TCP and UDP protocols. Finally, we use SPECjbb [8] to emulate data warehouse’s I/O workload. 2.1 Disk I/O Performance Test-bed: Intel Core 2 Quad Q9550, 2.83GHz, 4G RAM, 500G SATA disk, Linux 2.6.27.7, KVM-76 with default configuration, and the virtual machine uses raw disk image. Results: We run Bonnie++ which evaluates the file system calls in the Guest OS. Figure 1 illustrates the throughput of the sequential block write benchmarking, as well as the CPU utilization rate during the experiments, of the virtual machine and the native machine. Sequential Block Writes 80000
70
70000
60
s)/ 60000 bK 50000 (t up 40000 hg uo 30000 hrT 20000
50 40 30 20 10
10000 0
)% (n oi ta izl it u UP C
256B 512B
1K
2K
4K
8K
16K 32K
64K 128K 256K 512K 1M
0
Block size Throughput(Native) CPU utilization(Native)
Throughput(KVM) CPU utilization(KVM)
Fig. 1. Comparison of the performance between Native and KVM - Sequential block writes
From Figure 1, it can be observed that the throughput and CPU utilization of sequential block writes on the virtual machine is only about a half compared to the native machine. Figure 2 illustrates the throughput of the sequential block read benchmarking, as well as the CPU utilization rate during the experiments, of the virtual machine and the Native machine.
Fig. 2. Comparison of the performance between Native and KVM – Sequential block reads
It can be observed that the throughput of the sequential block read of the virtual machine is very close to that of the Native machine. However, when the size of the disk block is less than 4K, the CPU utilization of the virtual machine is much higher than that of the Native. The reason is that when bonnie++ is reading disk, the data can be frequently hit in the disk cache instead of actually accessing the actual disk during such experiments. From the experimental data shown in Fig.1 and Fig. 2, it can be deduced that the key to achieve better performance of disk I/O is to improve write throughput and reduce the CPU overhead during disk-reading. 2.2 Network Performance Test-bed: Two physical machines, HostA and HostB, both using Gigabit Ethernet. HostA works with 82566DC NIC and HostB works with 82567LM-2 NIC. One virtual machine (named GuestB) runs on HostB with KVM-76. Ping-pong benchmarking: The Round Trip Time (RTT) between HostA and HostB, between HostB and GuestB and that between HostA and GuestB are measured and illustrated in Figure 3. Based on Figure 3, we are able to estimate the additional network overhead brought by virtualization. Packet transmission time from HostA to GuestB is 0.378 ms, which is equivalent to the transmission time form HostA to HostB (0.234 ms) plus the transmission time from HostB to GuestB (0.126 ms). That is true when the transmission reverses, from GuestB to HostA. That means transmission time between HostB and GuestB can be considered as the virtualization overhead. More accurately, this time overhead accounts for 33% of the transmission time between HostA and GuestB.
Evaluating and Optimizing I/O Virtualization in KVM
223
Ping-pong Benchmarking 0.5 0.45 0.4 0.35 ) 0.3 s m ( T 0.25 T R 0.2 0.15 0.1 0.05 0 HostA ping HostB ping HostA ping HostB ping GuestB GuestB HostB GuestB GuestB HostA ping HostB ping HostA
RTT
Fig. 3. Ping-pong benchmarking results
Netperf Tests: The throughput and latency between the client and the server is measured using Netperf, and illustrated in Figure 4. Experiments: 1. Run the netperf server on remote HostA. 2. HostB tests using netperf client. 3. GuestB tests using netperf client, with various configurations, e.g. different virtual NIC and different drivers. Figure 4 shows the result. Netperf 1.2 1 0.8 0.6 0.4 0.2 0
TCP Stream(Mb/s) UDP Stream(Mb/s) TCP RR(Trans/s)
Native(e1000e)
KVM(e1000)
KVM(e1000+NAPI)
Udp RR(Trans/s)
KVM(rtl8139)
Fig. 4. Netperf benchmark test, Guest OS used different NIC configurations - virtual e1000 NIC (e1000 driver, e1000 driver with NAPI support), virtual rtl8139 NIC
According to the conclusion in Experiment 1 based on ping-pong benchmarking, the throughput on the virtual machine should be 2/3 of that of the Native. In other words, the throughput should be approximately 600Mb/s, not at the level of 400Mb/s as measured. This gap indicates that data processing in the VMM and the Host OS may cause an additional overhead.
224
B. Zhang et al.
From Figure 4, it can be also observed that the virtual device capability greatly influences network performance in a virtual environment. 2.3 Instruction Level Evaluation To further analyze the reasons of such performance degradation, we intercept the operations issued by Guest OS that result in VM Exits during the SPECjbb and Netperf tests. The top 5 functions, named the hot functions which result in most VM Exits, are listed in Table 1 and Table 2. It can be observed that I/O operations have a certain hot code effect, that is, a small number of instructions cause lots of VM Exits. So we can optimize these hot functions to decrease the virtualization overheads. Table 1. Hot functions during SPECjbb test on KVM VM
Trap function ioread16 iowrite16 ioread32 ioread8 _spin_lock_irqrestore
Additionally, in the SPECjbb test, the timer interrupt routine is one of the top 5 time consumers. Therefore, optimization on timer interrupts will improve SPECjbb performance. We also intercept hot functions in the Bonnie++ benchmark, and the top 5 hot functions are shown in Table 3. Because these functions are frequently called here and there, we further intercepted the caller functions. Table 3. Hot functions during Bonnie++ test on KVM VM
Trap function acpi_pm_read iowrite8 ioread8 iowrite32
Caller function address 0xc06442e7 0xc05e8fe3 0xc05eaa44 0xc05ea9c2
Caller function verify_pmtmr_rate ata_sff_dev_select ata_bmdma_status ata_bmdma_setup
We observe that the top caller function verify_pmtmr_rate is used to read the clock on motherboard. It is redundant in the virtual environment and thus can be eliminated.
Evaluating and Optimizing I/O Virtualization in KVM
225
3 Optimization and Evaluation Based on the experiment results in Section 2, we focus on reducing context switches and simplifying the Guest OS to optimize KVM I/O virtualization. We discuss our optimization methods in this section. 3.1 Reducing the Context Switching There are multiple occasions for context switching at each I/O operation in KVM, including the context switching among the Guest OS, KVM, and the Host OS. Most switches are caused by the KVM architecture, but the switches between the Guest OS and KVM depend on the behavior of the Guest OS, which may lead to other context switches. If we can modify Guest OS’ behavior and remove some VM Exits, the context switches will be reduced. From the instruction test results in Section 2.3, it can be observed that I/O operations in Guest OS are clustered to a certain extent. Some I/O instructions bring a lot of VM Exits, and it can be further optimized. 3.1.1 Merging Successive I/O Instructions When an I/O instruction is executed in a Guest OS, it will cause a VM exit to KVM. In the Guest OS disk driver, some code fragments include successive I/O instructions, which will cause multiple consecutive VM Exits. If we merge these instructions into a single operation, only a single VM Exit is needed to handle the multiple Guest I/O instructions. We merge the successive I/O instructions to a single vmcall, which will exit to KVM actively. The method is to put information of each instruction into a queue, including IN/OUT, port number, and the value. The address and the length of the queue are passed to KVM as the parameters of vmcall. KVM will get information of each instruction from the queue, and emulate them one by one. For example, Figure 5 is a code fragment in a function (__ide_do_rw_disk) in the disk driver code (/driver/ide/ide-disk.c). /driver/ide/ide-disk.c hwif->OUTB(tasklets[1], IDE_FEATURE_REG); hwif->OUTB(tasklets[3], IDE_NSECTOR_REG); hwif->OUTB(tasklets[7], IDE_SECTOR_REG); hwif->OUTB(tasklets[8], IDE_LCYL_REG); hwif->OUTB(tasklets[9], IDE_HCYL_REG); hwif->OUTB(tasklets[0], IDE_FEATURE_REG); hwif->OUTB(tasklets[2], IDE_NSECTOR_REG); hwif->OUTB(tasklets[4], IDE_SECTOR_REG); hwif->OUTB(tasklets[5], IDE_LCYL_REG); hwif->OUTB(tasklets[6], IDE_HCYL_REG); hwif->OUTB(0x00|drive->select.all,IDE_SELECT_REG);
Fig. 5. A code fragment including successive I/O instructions
226
B. Zhang et al.
11 I/O instructions will be executed concecutively which yield 11 VM Exits. Our approach merges them into one vmcall as the following. The fragment after replacement is illustrated in Figure 6. /driver/ide/ide-disk.c struct io_insn io_out[11]; // the queue to restore instruction information usigned long io_gpa, io_len; #define IO_OUT(x, _type, _val, _port) \ io_out[x].type = _type, \ io_out[x].port = _port, \ io_out[x].val = _val //put information of I/O instructions into the queue: IO_OUT(0, OUTB, tasklets[1], IDE_FEATURE_REG); IO_OUT(1, OUTB, tasklets[3], IDE_NSECTOR_REG); IO_OUT(2, OUTB, tasklets[7], IDE_SECTOR_REG); IO_OUT(3, OUTB, tasklets[8], IDE_LCYL_REG); IO_OUT(4, OUTB, tasklets[9], IDE_HCYL_REG); IO_OUT(5, OUTB, tasklets[0], IDE_FEATURE_REG); IO_OUT(6, OUTB, tasklets[2], IDE_NSECTOR_REG); IO_OUT(7, OUTB, tasklets[4], IDE_SECTOR_REG); IO_OUT(8, OUTB, tasklets[5], IDE_LCYL_REG); IO_OUT(9, OUTB, tasklets[6], IDE_HCYL_REG); IO_OUT(10, OUTB, 0x00|drive->select.all, IDE_SELECT_REG); //because the address space is different between KVM and Guest OS, we should translate io_insn address into physical address: io_gpa = virt_to_phys((unsigned long)io_out); io_len = 11; // the number of I/O instructions we have replaced vmcall(XKVM_IO_COALESCE, io_gpa, io_len, 0); // vmcall to exit into KVM, the parameters include the address and the length of io_insn.
Fig. 6. The code fragment - The successive I/O instructions are replaced into a vmcall
We have modified two fragments which include successive port I/O instructions. And the number of VM Exits caused by I/O instructions is reduced. We count the number of VM Exits caused by inb and outb, and the result is in Table 4: Table 4. The number of VM Exits caused by inb and outb Instruction address 0xc054fcc0 0xc054ecf0
Before modification 111050 15864
After modification 17226 2343
This method is actually borrowed from para-virtualization[5], which reduces context switches between the Guest OS and the VMM via modifying the Guest OS. We now only implement static modification. In the future we will try to implement dynamic
Evaluating and Optimizing I/O Virtualization in KVM
227
modification, which is to replace the code fragments on the fly when the Guest OS is running. One method is to monitor Guest OS execution, locate the fragment which present lots of VM Exits caused by I/O operations, merge the successive I/O instructions and produce the new fragment, then insert jump instructions to replace the old fragment. Another method is to prefetch following instructions after a VM Exit occurs. If there are I/O instructions following the current one, emulate them all at once, and then switch back to Guest OS. We evaluate the performance of our optimization. Unfortunately, the result is not promising. The CPU overhead is somewhat reduced, but I/O throughput is nearly the same. This may be because we only reduce the overhead brought by context switches, but the operation after VM Exit is not modified. Therefore the cost of VM Exit and VM Entry is not the main reason for the performance overhead. 3.1.2 Reducing the Timer Interrupts Timer interrupt is another cause of VM Exits. When a timer interrupt happens, a VM Exit will occur. For non-real-time applications, we can reduce the timer frequency to reduce VM Exits. KVM emulates a PIT (Programmable Interval Timer) for each Guest. The PIT can trigger timer interrupts at a programmed frequency. A PIT consists of an oscillator which produces clock signal at the frequency of (roughly) 1193182HZ. When a clock signal generated, the counter in PIT channel 0 is decreased by 1. When the counter reaches 0, PIT will generate a timer interrupt. The virtual PIT works in a similar way. So if we modify the initial value of the counter in PIT channel 0, the timer frequency is modified. For example, if the counter in PIT channel 0 is initialized 1193, the timer frequency is 1000HZ (1193182/1193). And if the counter is initialized 11932, the timer frequency is 100HZ. If the value initialized by the Guest OS is modified, the timer frequency is modified transparently to Guest OS. The implementation is simple: We only need to modify pit_load_count function. We decrease the Guest actual timer frequency to 100HZ, and we compared its I/O performance to the Guest when the timer frequency is 1000HZ. SPECjbb
Bonnie++ 1.2
1.04
1.15
1.02 1
1.1
0.98
1.05
0.96 1
0.94
0.95
0.92
0.9 vHZ1000
putc
write
rewrite
getc
s create
vHZ100
(a) Bonnie++
r create
0.9 vHZ1000 1 vHZ100
2
3
4
5 Warehouses
6
7
8
average
(b) SPECjbb
Fig. 7. Timer Interrupt Evaluation
It can be observed in Figure 7 that the lower timer frequency results up to 15% performance improvement.
228
B. Zhang et al.
3.2 Simplifying Guest OS A virtual environment is different from a native one. There may be some Guest OS operations that become redundant when the OS operates in a virtual environment. If these redundant operations are removed, the Guest OS can be more efficient. 3.2.1 Removing the Redundant Operations We try to locate and remove redundant operations in a Guest OS. An example is verify_pmtmr_rate. This function causes the most VM Exits during bonnie++ running (see Section 2.3). It is used to adjust timer on the motherboard, which is useless in a virtual environment. We modify the Guest OS by simply removing this function. The result is illustrated in Figure 8. The write performance is improved across all block sizes. Block Write Throughput
11200 11100 11000
) s / 10900 b K 10800 ( t u p 10700 h g 10600 u o r 10500 h T 10400 10300 10200 256B 512B
before
1K
2K
after
4K
8K
16K
32K
64K 128K 256K 512K
Block size
Fig. 8. Write performance (before and after verify_pmtmr_rate removed)
Another redundant operation is I/O scheduling in the Guest OS, because all the I/O requests scheduled by the Guest OS will always rescheduled in the Host OS. The scheduler in the Guest OS is thus redundant. For one thing, the Guest OS is unaware of the physical disk information, therefore the results of this scheduler may not be helpful anyway. For another, when multiple Guest OSes operate simultaneously, the Host OS must reschedule I/O requests from all Guest OSs and try to maximize the whole system performance. We remove the I/O scheduler in the Guest OS, and I/O requests are directly submitted to drivers. After our optimization, the throughput of block writes on the virtual machine increases by 6.01%; while throughput of block reads increases by 5.64% (the Host OS uses CFQ I/O scheduler [12] by default). The disadvantage is that all the I/O requests will be submitted in the order of FIFO when the I/O scheduling is eliminated. We cannot arrange set up performance quotas without the I/O scheduler. 3.2.2 Optimizing NIC Driver in Guest OS A NAPI function is used to reduce interrupt frequency via interrupt masking. But in a virtual environment, a virtual NIC can’t generate interrupts directly. So the NAPI support in the NIC driver in the Guest OS is useless and time-consuming. We remove
Evaluating and Optimizing I/O Virtualization in KVM
229
the NAPI support from the NIC driver in the Guest OS, and evaluate the performance. The result is illustrated in Figures 9. Netperf 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 TCP Stream KVM(e1000+NAPI)
UDP Stream
TCP RR
UDP RR
KVM(e1000)
Fig. 9. Netperf performance (with/without NAPI support)
It can be observed that this optimization improves network throughput. In TCP /UDP Stream and UDP Request/Response test, after NAPI support is removed from the NIC driver, the throughput is increased by more than 60%.
4 Related Work This section discusses the related work that focuses on reducing context switching overhead and modifying Guest OS. 4.1 Reducing the Context Switching Overhead There are two ways to reduce context switching cost. One is to lower the frequency of such operations, and the other is to reduce the cost of each switching. Submission in batch is one way to reduce context switch frequency. And another way is to reduce device interrupt if possible. Sugerman et al. [1] have implemented batch submission based VMware Workstation. When the context switch frequency exceeds a certain threshold, the requests will be queued in a cache until the next interrupt happens, and then all queued requests can be handled at one time. This method can also reduce the IRQ transfer cost because only one IRQ is needed once the batch of requests is complete. Another optimization by Sugerman et al. [1] is to improve the Guest driver protocol. They design an interface suitable for a virtual environment. For example, this interface avoids the I/O instructions accessing device status. And it can reduce the number of virtual IRQs which cause context switching between a VMM and a VM. Virtio (R. Russell [4]) is a general device driver which provides the same operation process for block and network devices. Virtio uses para-virtualization for reference, and can be used by various VMMs. The implementation is to maintain buffer rings based on shared memory between a Guest OS and a VMM. One of them posts to the rings, while the other consumes them. And an event notification mechanism is
230
B. Zhang et al.
implemented. When buffers are added to the ring, the other side will be notified. This notification can be masked, to facilitate batching and reduce context switching. Reduce the cost of each switch is to simplify the operations during each switching. Sugerman et al. [1] modify the process switch operation in a Guest OS on VMware Workstation. When switching to the idle task, the page table is not reloaded, since the idle task is a kernel thread, and can use any process’s page table. This optimization cuts the virtualization overhead caused by MMU by a half. In addition, Sugerman et al. try to optimize VMware Workstation via Host OS bypassing. The optimization is to make the VMM access hardware device directly, and thus avoid context switching between the VMM and the Host OS. This method is employed by full virtualization systems, e.g., VMware ESX Server. Another cost is due to TLB misses and cache misses after context switches. Aravind Menon et al. [2] enhance the Guest OS in Xen and make it support advanced virtual memory management, e.g. superpage and global page mapping, which greatly reduces TLB misses caused by context switches. 4.2 Simplifying Guest OS Ram et al. [3] manage to reduce the overhead in a Guest OS in three aspects. Firstly, they implement LRO (Large Receive Offload) which combines a number of data packets into one large-sized TCP/IP packet, so that a large amount of data can be handled during a single protocol stack process. Secondly, they reduce the buffer size to half-page which can reduce the working set and thereby reduce the TLB miss rate. Menon et al. [2] improve Xen’s virtual network interface to support offload features which is supported by most NICs (if the hardware does not support offload, then the Driver domain can simulate it, which can also improve performance). Offload functions include scatter/gather I/O, TCP/IP checksum offload, TCP segmentation offload (TSO). Scatter/gather I/O supports continuous DMA operation with noncontiguous memory. TSO reduces the number of packets to be processed. Checksum offload reduces the Guest OS loads. Most other optimizations specifically focus on the architecture of Xen, which are not very helpful to other VMMs. A series of hardware assistance and specifications try to assist device virtualization from the hardware level, including Intel VT-d [9], AMD IOMMU [10], and PCI-SIG IOV [11]. VT-d and IOMMU are similar. They ensure the isolation of I/O address space between different VMs. An I/O MMU, similar to MMU, is installed on the PCI bridge to translate DMA addresses to machine memory addresses. And an IOTLB accelerates this translation. PCI-SIG IOV includes ATS (Address Translation Services), SR-IOV (Single Root IOV), and MR-IOV (Multi-Root IOV). A series of specifications help I/O address translation and let the devices provide multiple interfaces for multiple VMs direct access. These new technologies and specifications help a guest to access hardware devices directly, avoiding virtualization overhead and simplifying implementation of I/O virtualization. Future research on optimizing I/O virtualization must try to focus on how to optimize the Guest OS to make it more efficient in the virtual environment. Another direction is to examine how to make the Guest OS, the VMM, and the Host OS work better in coordination with each other.
Evaluating and Optimizing I/O Virtualization in KVM
231
5 Conclusion We evaluate KVM I/O performance and propose some methods to optimize it. We reduce VM Exits by merging successive I/O instructions and decreasing the frequency of timer interrupt. And we simplify Guest OS by removing redundant operations in the virtual environment. Guest OS simplification will be an important direction in future research to optimize VMM performance. We will continue to research on how to make a Guest OS more efficient in a virtual environment and how to make the Guest OS, the VMM, and the Host OS coordinate with each other better.
Acknowledgement This work is supported by the National Grand Fundamental Research 973 Program of China under Grant No. 2007CB310900, National Science Foundation of China under Grant No.90718028 and No. 60873052, National High Technology Research 863 Program of China under Grant No.2008AA01Z112, and MOE-Intel Information Technology Foundation under Grant No. MOE-INTEL-08-09. Zhenlin Wang is also supported by NSF Career CCF0643664.
References [1] Jeremy, S., Ganesh, V., Beng-Hong, L.: Virtualizing I/O Device on VMware Workstation’s Hosted Virtual Machine Monitor. In: Proceddings of the 2001 USENIX Annual Technical Conference (June 2001) [2] Menon, A., Cox, A.L., Zwaenepoel, W.: Optimizing network virtualization in Xen. In: Proceedings of the Annual Technical Conference on USENIX 2006 Annual Technical Conference (2006) [3] Ram, K.K., Santos, J.R., Turner, Y., Cox, A.L., Rixner, S.: Achieving 10 Gb/s using safe and transparent network interface virtualization. In: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (2009) [4] Russell, R.: Virtio: Towards a De-Facto Standard For Virtual I/O Devices. Operating System Review 42(5), 95–103 (2008) [5] Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.: Xen and the Art of Virtualization. In: Proceedings of the 19th ACM SOSP (2003) [6] bonnie++, http://www.coker.com.au/bonnie++/ [7] netperf, http://www.netperf.org/netperf/ [8] SPEC JBB (2005), http://www.spec.org/jbb2005/ [9] Intel Corporation. Intel® Virtualization Technology for Directed I/O Architecture Specification (2007), http://download.intel.com/technology/computing/ vptech/Intelr_VT_for_Direct_IO.pdf [10] Advanced Micro Devices, Inc. ADM I/O Virtualization Technology (IOMMU) Specification (2007), http://www.amd.com/us-en/assets/content_type/ white_papers_and_tech_docs/34434.pdf [11] PCI-SIG. I/O Virtualization (2007), http://www.pcisig.com/specifications/iov/ [12] Axboe, J.: Time Sliced CFQ I/O Scheduler, http://kerneltrap.org/node/4406
Distributed Stream Processing with DUP Kai Christian Bader1 , Tilo Eißler1 , Nathan Evans1 , Chris GauthierDickey2 , Christian Grothoff1 , Krista Grothoff1 , Jeff Keene2 , Harald Meier1 , Craig Ritzdorf2 , and Matthew J. Rutherford2 1 Faculty of Informatics Technische Universit¨ at M¨ unchen 2 Department of Computer Science University of Denver
Abstract. This paper introduces the DUP System, a simple framework for parallel stream processing. The DUP System enables developers to compose applications from stages written in almost any programming language and to run distributed streaming applications across all POSIXcompatible platforms. Parallel applications written with the DUP System do not suffer from many of the problems that exist in traditional parallel languages. The DUP System includes a range of simple stages that serve as general-purpose building blocks for larger applications. This work describes the DUP Assembly language, the DUP architecture and some of the stages included in the DUP run-time library. We then present our experiences with parallelizing and distributing the ARB project, a package of tools for RNA/DNA sequence database handling and analysis. Keywords: Coordination language, parallel programming, productivity.
1
Introduction
The widespread adoption of multi-core processors and the commoditization of specialized co-processors like GPUs [1] and SPUs [2] requires the development of tools and techniques that enable non-specialists to create sophisticated programs that leverage the hardware at their disposal. Mainstream and productive development cannot rely on teams of domain and hardware experts using specialized languages and hand-optimized code, though this style of development will remain applicable to high-performance computing (HPC) applications that demand ultimate performance. This paper introduces the DUP System1 , a language system which facilitates productive parallel programming for stream processing on POSIX platforms. It is not the goal of the DUP System to provide ultimate performance; we are instead willing to sacrifice some performance gain for significant benefits in terms of programmer productivity. By providing useful and intuitive abstractions, the DUP System enables programmers without experience in parallel programming 1
Available at http://dupsystem.org/
C. Ding, Z. Shao, and R. Zheng (Eds.): NPC 2010, LNCS 6289, pp. 232–246, 2010. c IFIP International Federation for Information Processing 2010
Distributed Stream Processing with DUP
233
or networking to develop correct parallel and distributed applications and obtain speed-ups from parallelization. The key idea behind the DUP System is the multi-stream pipeline programming paradigm and the separation of multi-stream pipeline specification and execution from the language(s) used for the main computation. Multi-stream pipelines are a generalization of UNIX pipelines. However, unlike UNIX pipelines, which are composed of processes which read from at most one input stream and write to a single output stream (and possibly an error stream), multi-stream pipelines are composed of processes that can read from any number of input streams and write to any number of output streams. In the remainder of this document, we will use the term “stage” for individual processes in a multi-stream pipeline. Note that UNIX users — even those with only rudimentary programming experience — can usually write correct UNIX pipelines which are actually parallel programs. By generalizing UNIX pipelines to multi-stream pipelines, we eliminate the main restriction of the UNIX pipeline paradigm — namely, the inherently linear data flow. In order to support the developer in the use of multi-stream pipelines, the DUP System includes a simple coordination language which, similar to syntactic constructs in the UNIX shell, allows the user to specify how various stages should be connected with streams. The DUP runtime then sets up the streams and starts the various stages. Key benefits of the DUP System include: 1. Stages in a multi-stream pipeline can run in parallel and on different cores; 2. Stages can be implemented, compiled and tested individually using an appropriate language and compiler for the given problem and architecture; 3. Stages only communicate using streams; streams are a great match for networking applications and for modern processors doing sequential work; 4. If communication between stages is limited to streams, there is no possibility of data races and other issues that plague developers of parallel systems; 5. While the DUP System supports arbitrary data-flow graphs, the possibility of deadlocks can be eliminated by only using acyclic data-flow graphs; 6. Applications built using multi-stream pipelines can themselves be composed into a larger multi-stream pipeline, making it easy for programmers to express hierarchical parallelism In addition to introducing the DUP System itself, this paper also presents experimental results from a case study involving the DUP System. The case study shows that it is possible to rapidly parallelize and distribute an existing complex legacy bioinformatics application and obtain significant speed-ups using DUP.
2
Approach
The fundamental goal of multi-stream pipelines is to allow processes to read from multiple input streams and write to multiple output streams, all of which may be connected to produce the desired data-flow graph. This generalization of linear
234
K.C. Bader et al.
UNIX pipelines can be implemented using traditional UNIX APIs,2 especially the dup2 system call. Where a typical UNIX shell command invocation only connects stdin, stdout and stderr, the DUP System establishes additional I/O streams before starting a stage. Using this method, traditional UNIX filters (such as grep) can be used as stages in the DUP System without modification. New stages can be implemented in any language environment that supports POSIXlike input-output operations (specifically, reading and writing to a file). Since dup2 also works with TCP sockets, the DUP System furthermore generalizes multi-stream pipelines to distributed multi-stream pipelines. 2.1
The DUP Assembly Language
The DUP Assembly language allows developers to specify precisely how to connect stages and where those stages should be run. Figure 1 lists the DUP Assembly code for a distributed “Hello World” example program. s @10.0.0.1[0out.txt] $
fanout; grep Hello; grep World; faninany;
Fig. 1. DUP specification. in.txt is passed to fanout (“0
In essence, the DUP language allows developers to specify a directed graph using an adjacency list representation and IO redirection syntax similar to that of well-known UNIX shells [3]. The nodes in the directed graph are the stages initiated by DUP. A DUP program consists of a list of statements, each of which corresponds to one such node. Statements start with a label that is used to reference the respective stage in the specification of other stages. The keyword DUP is used to reference streams associated with the controlling dup command in the case that the dup command itself is used as a stage. The label is followed by a hostname specifying on which system the stage will be run. A helper process, dupd, will be started on the specified host, listen on a port to establish network connections and eventually supervise stages run there. The address is followed by a comma-separated list of edges representing primarily the outgoing streams for this stage. Input streams are only explicitly specified in the case of input from files or the controlling dup command. Inputs from other stages are not specified because they can be inferred from the respective entry of the producing stage. DUP supports four different ways to create streams for a stage: 2
The APIs needed are supported by all platforms conforming to the POSIX standard, including BSD, GNU/Linux, OS X, and z/OS.
Distributed Stream Processing with DUP <STAGE> <EDGELIST> <EDGE>