Compiler Construction 11 conf

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 2304 3 Berlin Heidelberg New Y...

Author: R. Niegel Horspool

113 downloads 1324 Views 6MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2304

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

R. Nigel Horspool (Ed.)

Compiler Construction 11th International Conference, CC 2002 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2002 Grenoble, France, April 8-12, 2002 Proceedings

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editor R. Nigel Horspool University of Victoria, Dept. of Computer Science Victoria, BC, Canada V8W 3P6 E-mail: [email protected]

Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Compiler construction : 11th international conference ; proceedings / CC 2002, held as part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2002, Grenoble, France, April 8 - 12, 2002. R. Nigel Horspool (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2304) ISBN 3-540-43369-4

CR Subject Classification (1998): D.3.4, D.3.1, F.4.2, D.2.6, I.2.2, F.3 ISSN 0302-9743 ISBN 3-540-43369-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN 10846505 06/3142 543210

Foreword

ETAPS 2002 was the ﬁfth instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference that was established in 1998 by combining a number of existing and new conferences. This year it comprised 5 conferences (FOSSACS, FASE, ESOP, CC, TACAS), 13 satellite workshops (ACL2, AGT, CMCS, COCV, DCC, INT, LDTA, SC, SFEDL, SLAP, SPIN, TPTS, and VISS), 8 invited lectures (not including those speciﬁc to the satellite events), and several tutorials. The events that comprise ETAPS address various aspects of the system development process, including speciﬁcation, design, implementation, analysis, and improvement. The languages, methodologies, and tools which support these activities are all well within its scope. Diﬀerent blends of theory and practice are represented, with an inclination towards theory with a practical motivation on one hand and soundly-based practice on the other. Many of the issues involved in software design apply to systems in general, including hardware systems, and the emphasis on software is not intended to be exclusive. ETAPS is a loose confederation in which each event retains its own identity, with a separate program committee and independent proceedings. Its format is open-ended, allowing it to grow and evolve as time goes by. Contributed talks and system demonstrations are in synchronized parallel sessions, with invited lectures in plenary sessions. Two of the invited lectures are reserved for “unifying” talks on topics of interest to the whole range of ETAPS attendees. The aim of cramming all this activity into a single one-week meeting is to create a strong magnet for academic and industrial researchers working on topics within its scope, giving them the opportunity to learn about research in related areas, and thereby to foster new and existing links between work in areas that were formerly addressed in separate meetings. ETAPS 2002 was organized by the Laboratoire Verimag in cooperation with Centre National de la Recherche Scientiﬁque (CNRS) Institut de Math´ematiques Appliqu´ees de Grenoble (IMAG) Institut National Polytechnique de Grenoble (INPG) Universit´e Joseph Fourier (UJF) European Association for Theoretical Computer Science (EATCS) European Association for Programming Languages and Systems (EAPLS) European Association of Software Science and Technology (EASST) ACM SIGACT, SIGSOFT, and SIGPLAN

VI

Foreword

The organizing team comprised Susanne Graf - General Chair Saddek Bensalem - Tutorials Rachid Echahed - Workshop Chair Jean-Claude Fernandez - Organization Alain Girault - Publicity Yassine Lakhnech - Industrial Relations Florence Maraninchi - Budget Laurent Mounier - Organization Overall planning for ETAPS conferences is the responsibility of its Steering Committee, whose current membership is: Egidio Astesiano (Genova), Ed Brinksma (Twente), Pierpaolo Degano (Pisa), Hartmut Ehrig (Berlin), Jos´e Fiadeiro (Lisbon), Marie-Claude Gaudel (Paris), Andy Gordon (Microsoft Research, Cambridge), Roberto Gorrieri (Bologna), Susanne Graf (Grenoble), John Hatcliﬀ (Kansas), G¨ orel Hedin (Lund), Furio Honsell (Udine), Nigel Horspool (Victoria), Heinrich Hußmann (Dresden), Joost-Pieter Katoen (Twente), Paul Klint (Amsterdam), Daniel Le M´etayer (Trusted Logic, Versailles), Ugo Montanari (Pisa), Mogens Nielsen (Aarhus), Hanne Riis Nielson (Copenhagen), Mauro Pezz`e (Milan), Andreas Podelski (Saarbr¨ ucken), Don Sannella (Edinburgh), Andrzej Tarlecki (Warsaw), Herbert Weber (Berlin), Reinhard Wilhelm (Saarbr¨ ucken) I would like to express my sincere gratitude to all of these people and organizations, the program committee chairs and PC members of the ETAPS conferences, the organizers of the satellite events, the speakers themselves, and ﬁnally Springer-Verlag for agreeing to publish the ETAPS proceedings. As organizer of ETAPS’98, I know that there is one person that deserves a special applause: Susanne Graf. Her energy and organizational skills have more than compensated for my slow start in stepping into Don Sannella’s enormous shoes as ETAPS Steering Committee chairman. Yes, it is now a year since I took over the role, and I would like my ﬁnal words to transmit to Don all the gratitude and admiration that is felt by all of us who enjoy coming to ETAPS year after year knowing that we will meet old friends, make new ones, plan new projects and be challenged by a new culture! Thank you Don! January 2002

Jos´e Luiz Fiadeiro

Preface

Once again, the number, the breadth and the quality of papers submitted to the CC 2002 conference continues to be impressive. In spite of some diﬃcult times which may have discouraged many potential authors from thinking of travelling to a conference, we still received 44 submissions. Of these submissions, 21 came from 12 diﬀerent European countries, 17 from the USA and Canada, and the remaining 6 from Australia and Asia. In addition to the regular paper submissions, we have an invited paper from Patrick and Radhia Cousot. It is especially ﬁtting that Patrick Cousot should deliver the CC 2002 invited paper in Grenoble because many years ago he wrote his PhD thesis at the University of Grenoble. The members of the Program Committee took their refereeing task very seriously and decided very early on that a physical meeting was necessary to make the selection process as fair as possible. Accordingly, nine members of the Program Committee attended a meeting in Austin, Texas, on December 1, 2001, where the diﬃcult decisions were made. Three others joined in the deliberations via a telephone conference call. Eventually, and after much (friendly) argument, 18 papers were selected for publication. I wish to thank the Program Committee members for their selﬂess dedication and their excellent advice. I especially want to thank Kathryn McKinley and her assistant, Gem Naivar, for making the arrangements for the PC meeting. I also wish to thank my assistant, Catherine Emond, for preparing the materials for the PC meeting and for assembling the manuscript of the proceedings. The paper submissions and the reviewing process were supported by the START system (http://www.softconf.com). I thank the author of START, Rich Gerber, for making his software available to CC 2002 and for his prompt attention to the little problems that arose. These conference proceedings include the invited paper of Patrick and Radhia Cousot, the 18 regular papers, and brief descriptions of three software tools.

January 2002

Nigel Horspool

VIII

Preface

Program Committee Uwe Aßmann (Linkopings Universitet, Sweden) David Bernstein (IBM Haifa, Israel) Judith Bishop (University of Pretoria, South Africa) Ras Bodik (University of Wisconsin-Madison, USA) Cristina Cifuentes (Sun Microsystems, USA) Christian Collberg (University of Arizona, USA) Stefano Crespi-Reghizzi (Politecnico di Milano, Italy) Michael Franz (University of California at Irvine, USA) Andreas Krall (Technical University of Vienna, Austria) Reiner Leupers (University of Dortmund, Germany) Kathryn McKinley (University of Texas at Austin, USA) Nigel Horspool – Chair (University of Victoria, Canada) Todd Proebsting (Microsoft Research, USA) Norman Ramsey (Harvard University, USA)

Additional Reviewers G.P. Agosta John Aycock Jon Eddy Anton Ertl Marco Garatti G¨ orel Hedin Won Kee Hong Bruce Kapron Moshe Klausner Annie Liu V. Martena Bilha Mendelson Sreekumar Nair

Dorit Naishloss Ulrich Neumerkel Mark Probst Fermin Reig P.L. San Pietro Bernhard Scholz Glenn Skinner Phil Tomsich David Ung Mike Van Emmerik JingLing Xue Yaakov Yaari Ayal Zaks

Table of Contents

Tool Demonstrations LISA: An Interactive Environment for Programming Language Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 ˇ Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer Building an Interpreter with Vmgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 M. Anton Ertl and David Gregg Compiler Construction Using LOTOS NT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Hubert Garavel, Fr´ed´eric Lang, and Radu Mateescu

Analysis and Optimization Data Compression Transformations for Dynamically Allocated Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Youtao Zhang and Rajiv Gupta Evaluating a Demand Driven Technique for Call Graph Construction . . . . . . 29 Gagan Agrawal, Jinqian Li, and Qi Su A Graph–Free Approach to Data–Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Markus Mohnen A Representation for Bit Section Based Analysis and Optimization . . . . . . . . .62 Rajiv Gupta, Eduard Mehofer, and Youtao Zhang

Low-Level Analysis Online Subpath Proﬁling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 David Oren, Yossi Matias, and Mooly Sagiv Precise Exception Semantics in Dynamic Compilation . . . . . . . . . . . . . . . . . . . . . . 95 Michael Gschwind and Erik Altman Decompiling Java Bytecode: Problems, Traps and Pitfalls . . . . . . . . . . . . . . . . . 111 Jerome Miecznikowski and Laurie Hendren

Grammars and Parsing Forwarding in Attribute Grammars for Modular Language Design . . . . . . . . .128 Eric Van Wyk, Oege de Moor, Kevin Backhouse, and Paul Kwiatkowski

X

Table of Contents

Disambiguation Filters for Scannerless Generalized LR Parsers . . . . . . . . . . . . 143 Mark G. J. van den Brand, Jeroen Scheerder, Jurgen J. Vinju, and Eelco Visser

Invited Talk Modular Static Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Patrick Cousot and Radhia Cousot

Domain-Specific Languages and Tools StreamIt: A Language for Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . .179 William Thies, Michal Karczmarek, and Saman Amarasinghe Compiling Mercury to High-Level C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Fergus Henderson and Zoltan Somogyi CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 George C. Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer

Energy Consumption Optimizations Linear Scan Register Allocation in the Context of SSA Form and Register Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Hanspeter M¨ ossenb¨ ock and Michael Pfeiﬀer Global Variable Promotion: Using Registers to Reduce Cache Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . 247 Andrea G. M. Cilio and Henk Corporaal Optimizing Static Power Dissipation by Functional Units in Superscalar Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Siddharth Rele, Santosh Pande, Soner Onder, and Rajiv Gupta Inﬂuence of Loop Optimizations on Energy Consumption of Multi-bank Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Mahmut Kandemir, Ibrahim Kolcu, and Ismail Kadayif

Loop and Array Optimizations Eﬀective Enhancement of Loop Versioning in Java . . . . . . . . . . . . . . . . . . . . . . . . 293 Vitaly V. Mikheev, Stanislav A. Fedoseev, Vladimir V. Sukharev, and Nikita V. Lipsky

Table of Contents

XI

Value-Proﬁle Guided Stride Prefetching for Irregular Code . . . . . . . . . . . . . . . . 307 Youfeng Wu, Mauricio Serrano, Rakesh Krishnaiyer, Wei Li, and Jesse Fang A Comprehensive Approach to Array Bounds Check Elimination for Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Feng Qian, Laurie Hendren, and Clark Verbrugge Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .343

LISA: An Interactive Environment for Programming Language Development ˇ Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer University of Maribor, Faculty of Electrical Engineering and Computer Science Institute of Computer Science Smetanova 17, 2000 Maribor, Slovenia

Abstract. The LISA system is an interactive environment for programming language development. From the formal language specifications of a particular programming language LISA produces a language specific environment that includes editors (a language-knowledgable editor and a structured editor), a compiler/interpreter and other graphic tools. The LISA is a set of related tools such as scanner generators, parser generators, compiler generators, graphic tools, editors and conversion tools, which are integrated by well-designed interfaces.

1

Introduction

We have developed a compiler/interpreter generator tool LISA ver 1.0 which automatically produces a compiler or an interpreter from the ordinary attribute grammar speciﬁcations [2] [8]. But in this version of the tool the incremental language development was not supported, so the language designer had to design new languages from scratch or by scavenging old speciﬁcations. Other deﬁciencies of ordinary attribute grammars become apparent in speciﬁcations for real programming languages. Such speciﬁcations are large, unstructured and are hard to understand, modify and maintain. The goal of the new version of the compiler/interpreter tool LISA was to dismiss deﬁciencies of ordinary attribute grammars. We overcome the drawbacks of ordinary attribute grammars with concepts from object-oriented programming, i.e. template and multiple inheritance [4]. With attribute grammar templates we are able to describe the semantic rules which are independent of grammar production rules. With multiple attribute grammar inheritance we are able to organize speciﬁcations in such way that speciﬁcations can be inherited and specialized from ancestor speciﬁcations. The proposed approach was successfully implemented in the compiler/interpreter generator LISA ver. 2.0 [5].

2

Architecture of the Tool LISA 2.0

LISA (Fig. 1) consists of several tools: editors, scanner generators, parser generators, compiler generators, graphic tools, and conversion tools such as fsa2rex, etc. The architecture of the system LISA is modular. Integration is achieved R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 1–4, 2002. c Springer-Verlag Berlin Heidelberg 2002

2

Marjan Mernik et al.

with strictly deﬁned interfaces that describe the behavior and type of integration of the modules. Each module can register actions when it is loaded into the core environment. Actions are methods accessible from the environment. These actions can be executed via class reﬂection. Their existence is not veriﬁed until invocation, so actions are dynamically linked with module methods. The module can be integrated in the environment as a visual or core module. Visual modules are used for the graphical user interface and visual representation of data structures. Core modules are non-visual components, such as the LISA language compiler. This approach is based on class reﬂection and is similar to JavaBeans technology. With class reﬂection (java.lang.reﬂect.* package) we can dynamically obtain a set of public methods and public variables of a module, so we can dynamically link module methods with actions. When the action is executed, the proper method is located and invoked with the description of the action event. With this architecture it is also possible to upgrade our system with diﬀerent types of scanners, parsers and evaluators, which are presented as modules. This was achieved with a strict deﬁnition of communication data structures. Moreover, modules for scanners, parsers and evaluators use templates for code generation, which can be easily changed and improved.

Fig. 1. LISA Integrated Development Environment

From formal language deﬁnition also editors are generated. The languageknowledgable editor is a compromise between text editors and structure editors since just colors the diﬀerent parts of a program (comments, operators, reserved

LISA: An Interactive Environment for Programming Language Development

3

words, etc.) to enhance understandability and readability of programs. Generated lexical, syntax and semantic analysers, also written in Java, can be compiled in an integrated environment without issuing a command to javac (Java compiler). Programs written in the newly deﬁned language can be executed and evaluated. Users of the generated compiler/interpreter have the possibility to visually observe the work of lexical, syntax and semantic analyzers by watching the animation of ﬁnite state automata, parse and semantic tree. The animation shows the program in action and the graphical representation of ﬁnite state automata, the syntax and the semantic tree are automatically updated as the program executes. Animated visualizations help explain the inner workings of programs and are a useful tool for debugging. These features make the tool LISA very appropriate for the programming language development. LISA tool is freely available for educational institutions from: http://marcel.uni-mb.si/lisa . It is run on diﬀerent platforms and require Java 2 SDK (Software Development Kits & Runtimes), version 1.2.2 or higher.

3

Applications of LISA

We have incrementally developed various small programming languages, such as PLM [3]. An application domain for which LISA is very suitable is a development of domain-speciﬁc languages. To our opinion, in the development of domain-speciﬁc languages the advantages of the formal deﬁnitions of generalpurpose languages should be exploited, taking into consideration the special nature of domain-speciﬁc languages. An appropriate methodology that considers frequent changes of domain-speciﬁc languages is needed since the language development process should be supported by modularity and abstraction in a manner that allows incremental changes as easily as possible. If incremental language development [7] is not supported, then the language designer has to design languages from scratch or by scavenging old speciﬁcations. This approach was successfully used in the design and implementation of various domain-speciﬁc languages. In [6] a design and implementation of Simple Object Description Language SODL for automatic interface creation are presented. The application domain was network applications. Since the cross network method calls slow down performance of our applications the solution was Tier to Tier Object Transport (TTOT). However, with this approach the network application development time has been increased. To enhance our productivity a new domainspeciﬁc SODL language has been designed. In [1] a design and implementation of COOL and AspectCOOL languages has been described using the LISA system. Here the application domain was aspect-oriented programming (AOP). AOP is a programming technique for modularizing concerns that crosscut the basic functionality of programs. In AOP, aspect languages are used to describe properties, which crosscut basic functionality in a clean and a modular way. AspectCOOL is an extension of the class-based object-oriented language COOL (Classroom Object-Oriented Language), which has been designed and implemented simultaneously with AspectCOOL. Both languages were formally speciﬁed with mul-

4

Marjan Mernik et al.

tiple attribute grammar inheritance, which enables us to gradually extend the languages with new features and to reuse the previously deﬁned speciﬁcations. Our experience with these non-trivial examples shows that multiple attribute grammars inheritance is useful in managing the complexity, reusability and extensibility of attribute grammars. Huge speciﬁcations become much shorter and are easier to read and maintain.

4

Conclusion

Many applications today are written in well-understood domains. One trend in programming is to provide software development tools designed speciﬁcally to handle such applications and thus to greatly simplify their development. These tools take a high-level description of the speciﬁc task and generate a complete application. One of such well established domain is compiler construction, because there is a long tradition of producing compilers, underlying theories are well understood and there exist many application generators, which automatically produce compilers or interpreters from programming language speciﬁcations. In the paper the compiler/interpreter generator LISA 2.0 is brieﬂy presented.

References ˇ 1. Enis Avdiˇcauˇsevi´c, Mitja Leniˇc, Marjan Mernik, and Viljem Zumer. AspectCOOL: An experiment in design and implementation of aspect-oriented language. Accepted for publications in ACM SIGPLAN Notices. 3 ˇ 2. Marjan Mernik, Nikolaj Korbar, and Viljem Zumer. LISA: A tool for automatic language implementation. ACM SIGPLAN Notices, 30(4):71–79, April 1995. 1 ˇ 3. Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer. A reusable object-oriented approach to formal specifications of programming languages. L’Objet, 4(3):273–306, 1998. 3 ˇ 4. Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer. Multiple Attribute Grammar Inheritance. Informatica, 24(3):319–328, September 2000. 1 ˇ 5. Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer. Compiler/interpreter generator system LISA. In IEEE CD ROM Proceedings of 33rd Hawaii International Conference on System Sciences, 2000. 1 ˇ 6. Marjan Mernik, Uroˇs Novak, Enis Avdiˇcauˇsevi´c, Mitja Leniˇc, and Viljem Zumer. Design and implementation of simple object description language. In Proceedings of 16th ACM Symposium on applied computing, pages 203–210, 2001. 3 ˇ 7. Marjan Mernik and Viljem Zumer. Incremental language design. IEE Proceedings Software, 145(2-3):85–91, 1998. 3 ˇ 8. Viljem Zumer, Nikolaj Korbar, and Marjan Mernik. Automatic implementation of programming languages using object-oriented approach. Journal of Systems Architecture, 43(1-5):203–210, 1997. 1

Building an Interpreter with Vmgen M. Anton Ertl1 and David Gregg2 1

Institut f¨ ur Computersprachen, Technische Universit¨ at Wien Argentinierstraße 8, A-1040 Wien, Austria [email protected] 2 Trinity College, Dublin

Abstract. Vmgen automates many of the tasks of writing the virtual machine part of an interpreter, resulting in less coding, debugging and maintenance eﬀort. This paper gives some quantitative data about the source code and generated code for a vmgen-based interpreter, and gives some examples demonstrating the simplicity of using vmgen.

1

Introduction

Interpreters are a popular approach for implementing programming languages, because only interpreters oﬀer all of the following beneﬁts: ease of implementation, portability, and a fast edit-compile-run-cycle. The interpreter generator vmgen1 automates many of the tasks in writing the virtual machine (VM) part of an interpretive system; it takes a simple VM instruction description ﬁle and generates code for: executing and tracing VM instructions, generating VM code, disassembling VM code, combining VM instructions into superinstructions, and proﬁling VM instruction sequences to ﬁnd superinstructions. Vmgen has special support for stack-based VMs, but most of its features are also useful for register-based VMs. Vmgen supports a number of high-performance techniques and optimizations. The resulting interpreters tend to be faster than other interpreters for the same language. This paper presents an example of vmgen usage. A detailed discussion of the inner workings of vmgen and performance data can be found elsewhere [1].

2

Example Overview

The running example in this paper is the example provided with the vmgen package: an interpretive system for a tiny Modula-2-style language that uses a JVM-style virtual machine. The language supports integer variables and expressions, assignments, if- and while-structures, function deﬁnitions and calls. Our example interpreter consists of two conceptual parts: the front-end parses the source code and generates VM code; the VM interpreter executes the VM code. 1

Vmgen is available at http://www.complang.tuwien.ac.at/anton/vmgen/.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 5–8, 2002. c Springer-Verlag Berlin Heidelberg 2002

6

M. Anton Ertl and David Gregg

Name Lines Description Makeﬁle 67 mini-inst.vmg 139 VM instruction descriptions mini.h 72 common declarations mini.l 42 front-end scanner mini.y 139 front-end (parser, VM code generator) support.c 220 symbol tables, main() peephole-blacklist 3 VM instructions that must not be combined disasm.c 36 template: VM disassembler engine.c 186 template: VM interpreter peephole.c 101 template: combining VM instructions proﬁle.c 160 template: VM instruction sequence proﬁling stat.awk 13 template: aggregate proﬁle information seq2rule.awk 8 template: deﬁne superinstructions 504 template ﬁles total 682 speciﬁc ﬁles total 1186 total

Fig. 1. Source ﬁles in the example interpreter

Figure 1 shows quantitative data on the source code of our example. Note that the numbers include comments, which are sometimes relatively extensive (in particular, more than half of the lines in mini-inst.vmg are comments or empty). Some of the ﬁles are marked as templates; in a typical vmgen application they will be copied from the example and used with few changes, so these ﬁles cost very little. The other ﬁles contain code that will typically be written speciﬁcally for each application. Among the speciﬁc ﬁles, mini-inst.vmg contains all of the VM description; in addition, there are VM-related declarations in mini.h, calls to VM code generation functions in mini.y, and calls to the VM interpreter, disassembler, and proﬁler in support.c. Vmgen generates 936 lines in six ﬁles from mini-inst.vmg (see Fig. 2). The expansion factor from the source ﬁle indicates that vmgen saves a lot of work in coding, maintaining and debugging the VM interpreter. In addition to the reduced line count there is another reason why vmgen reduces the number of bugs: a new VM instruction just needs to be inserted in one place in mini-inst.vmg (and code for generating it should be added to the front end), whereas in a manually coded VM interpreter a new instruction needs code in several places. The various generated ﬁles correspond mostly directly to template ﬁles, with the template ﬁles containing wrapper code that works for all VMs, and the generated ﬁles containing code or tables speciﬁc to the VM at hand.

Building an Interpreter with Vmgen

7

Name Lines Description mini-disasm.i 103 VM disassembler mini-gen.i 84 VM code generation mini-labels.i 19 VM instruction codes mini-peephole.i 0 VM instruction combining mini-proﬁle.i 95 VM instruction sequence proﬁling mini-vm.i 635 VM instruction execution 936 total

Fig. 2. Vmgen-generated ﬁles in the example interpreter

3

Simple VM Instructions

A typical vmgen instruction speciﬁcation looks like this: sub ( i1 i2 -- i ) i = i1-i2; The ﬁrst line gives the name of the VM instruction (sub) and its stack eﬀect: it takes two integers (i1 and i2) from the stack and pushes one integer (i) on the stack. The next line contains C code that accesses the stack items as variables. Loading i1 and i2 from and storing i to the stack, and instruction dispatch are managed automatically by vmgen. Another example: lit ( #i -- i ) The lit instruction takes the immediate argument i from the instruction stream (indicated by the # preﬁx) and pushes it on the stack. No user-supplied C code is necessary for lit.

4

VM Code Generation

These VM instructions are generated by the following rules in mini.y: expr: term ’-’ term { gen_sub(&vmcodep); } term: NUM { gen_lit(&vmcodep, $1); } The code generation functions gen sub and gen lit are generated automatically by vmgen; gen lit has a second argument that speciﬁes the immediate argument of lit (in this example, the number being compiled by the front end). Parsing and generating code for all subexpressions, then generating the code for the expression naturally leads to postﬁx code for a stack machine. This is one of the reasons why stack-based VMs are very popular in interpreters. The programmer just has to ensure that all rules for term and expr produce code that leaves exactly one value on the stack.

8

M. Anton Ertl and David Gregg

The power of yacc and its actions is suﬃcient for our example, but for implementing a more complex language the user will probably choose a more sophisticated tool or build a tree and manually code tree traversals. In both cases, generating code in a post-order traversal of the expression parse tree is easy.

5

Superinstructions

In addition to simple instructions, you can deﬁne superinstructions as a combination of a sequence of simple instructions: lit_sub = lit sub This deﬁnes a new VM instruction lit sub that behaves in the same way as the sequence lit sub, but is faster. After adding this instruction to mini-inst.vmg and rebuilding the interpreter, this superinstruction is generated automatically whenever a call to gen lit is followed by a call to gen sub. But you need not even deﬁne the superinstructions yourself, you can generate them automatically from a proﬁle of executed VM instruction sequences: You can compile the VM interpreter with proﬁling enabled, and run some programs representing your workload. The resulting proﬁle lists the number of dynamic executions for each static occurence of a sequence, e.g., 18454929 9227464

lit sub ... lit sub

This indicates that the sequence lit sub occured in two places, for a total of 27682393 dynamic executions. These data can be aggregated with the stat.awk script, then the user can choose the most promising superinstructions (typically with another small awk or perl script), and ﬁnally transform the selected sequences into the superinstruction rule syntax with seq2rule.awk. The original intent of the superinstruction features was to improve the runtime performance of the interpreter (and it achieves this goal), but we also noticed that it makes interpreter construction easier: In some places in an interpretive system, we can generate a sequence of existing instructions or deﬁne a new instruction and generate that; in a manually written interpreter, the latter approach yields a faster interpreter, but requires more work. Using vmgen, you can just take the ﬁrst approach, and let the sequence be optimized into a superinstruction if it occurs frequently; in this way, you get the best of both approaches: little eﬀort and run-time performance.

References 1. M. Anton Ertl, David Gregg, Andreas Krall, and Bernd Paysan. vmgen — a generator of eﬃcient virtual machine interpreters. Software—Practice and Experience, 2002. Accepted for publication. 5

Compiler Construction Using LOTOS NT Hubert Garavel, Fr´ed´eric Lang, and Radu Mateescu Inria Rhˆ one-Alpes – Vasy 655, avenue de l’Europe, 38330 Montbonnot, France {Hubert.Garavel,Frederic.Lang,Radu.Mateescu}@inria.fr

1

Introduction

Much academic and industrial eﬀort has been invested in compiler construction. Numerous tools and environments1 have been developed to improve compiler quality while reducing implementation and maintenance costs. In the domain of computer-aided veriﬁcation, most tools involve compilation and/or translation steps. This is the case with the tools developed by the Vasy team of Inria Rhˆ one-Alpes, for instance the Cadp2 [5] tools for analysis of protocols and distributed systems. As regards the lexical and syntax analysis, all Cadp tools are built using Syntax [3], a compiler generator that oﬀers advanced error recovery features. As regards the description, construction, and traversal of abstract syntax trees (Asts), three approaches have been used successively: – In the Caesar [8] compiler for Lotos [10], Asts are programmed in C. This low-level approach leads to slow development as one has to deal explicitly with pointers and space management to encode and explore Asts. – In the Caesar.Adt [6] and Xtl [13] compilers, Asts are described and handled using Lotos abstract data types, which are then translated into C using the Caesar.Adt compiler itself (bootstrap); yet, for convenience and eﬃciency, certain imperative processings are directly programmed in C. This approach reduces the drawbacks of using C exclusively, but suﬀers from limitations inherent to the algebraic speciﬁcation style (lack of local variables, of sequential composition, etc.). – For the Traian and Svl 1.0 compilers, and for the Evaluator 3.0 [14] model-checker, the Fnc-23 [12] compiler generator based on attribute grammars was used. Fnc-2 allows to declare attribute calculations for each Ast node and evaluates the attributes automatically, according to their dependencies. Although we have been able to suggest many improvements incorporated to Fnc-2, it turned out that, for input languages with large grammars, Fnc-2 has practical limitations: development and debugging are complex, and the generated compilers have large object ﬁles and exhibit average performances (slow compilation, large memory footprint due to the creation of multiple Asts and the absence of garbage collection). Therefore, the Vasy team switched to a new technology in order to develop its most recent veriﬁcation tools. 1 2 3

An extensive catalog can be found at http://catalog.compilertools.net http://www.inrialpes.fr/vasy/cadp http://www.inrialpes.fr/vasy/fnc2

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 9–13, 2002. c Springer-Verlag Berlin Heidelberg 2002

10

2

Hubert Garavel et al.

Using LOTOS NT for Compiler Construction

E-Lotos (Enhanced Lotos) [11] is a new Iso standard for the speciﬁcation of protocols and distributed systems. Lotos NT [9,16] is a simpliﬁed variant of E-Lotos targeting at eﬃcient implementation. It combines the strong theoretical foundations of process algebras with language features suitable for a wide industrial use. The data part of Lotos NT signiﬁcantly improves over the previous Lotos standard [10]: equational programming is replaced with a language similar to ﬁrst-order Ml extended with imperative features (assignments, loops, etc.). A compiler for Lotos NT, named Traian,4 translates the data part of Lotos NT speciﬁcations into C. Used in conjunction with a parser generator such as Lex/Yacc or Syntax, Traian is suitable to compiler construction: – Lotos NT allows a straightforward description of Asts: each non-terminal symbol of the grammar is encoded by a data type having a constructor for each grammar rule associated to the symbol. Traversals of Asts for computing attributes are deﬁned by recursive functions using “case” statements and pattern-matching. – Traian generates automatically “printer” functions for each Lotos NT data type, which enables to inspect Asts and facilitates the debugging of semantic passes. – Traian also allows to include in a Lotos NT speciﬁcation external data types and functions implemented in C, enabling an easy interfacing of Lotos NT speciﬁcations with hand-written C modules as well as C code generated by Lex/Yacc or Syntax.

3

Applications

Since 1999, Lotos NT has been used to develop three signiﬁcant compilers. For each compiler, the lexer and parser are built using Syntax and the Asts using Lotos NT. Type-checking, program transformation, and code generation are also implemented in Lotos NT. Some hand-written C code is added either for routine tasks (e.g., parsing options) or for some specialized algorithms (e.g., model-checking): – The Svl 2.0 [7] compiler transforms high-level veriﬁcation scripts into Bourne shell scripts (see Figure 1). – The Evaluator 4.0 model-checker transforms a temporal logic formula into a boolean equation system solver written in C; the solver is then compiled and executed, taking as input a labelled transition system and producing a diagnostic (see Figure 2). – The Ntif tool suite deals with a high-level language for symbolic transition systems; it includes a front-end, the Nt2if back-end generating a lower-level format, and the Nt2dot back-end producing a graph format visualizable by At&t’s GraphViz package. 4

http://www.inrialpes.fr/vasy/traian

Compiler Construction Using LOTOS NT

INPUT

SVL Program

Syntax Analysis & AST construction (SYNTAX)

Syntax error

LOTOS NT Term

Type Checking (LOTOS NT)

Type error

LOTOS NT Term

Expansion of Meta-Operations (LOTOS NT)

Code Generation (LOTOS NT)

LOTOS NT Term

Input Files

OUTPUT

Shell Interpreter

Bourne Shell Script

11

Output Files

INPUT

Fig. 1. Architecture of the Svl 2.0 compiler

Temporal Logic Formula

Labelled Transition System

OUTPUT

Model Checker

Syntax Analysis & AST construction (SYNTAX)

LOTOS NT Term

Syntax error Type error

Type Checking (LOTOS NT)

LOTOS NT Term

C Compiler

BES Solver (C)

Transl. to Boolean Equation Systems (LOTOS NT)

Diagnostic File

Fig. 2. Architecture of the Evaluator 4.0 model-checker The table below summarizes the size (in lines of code) of each compiler. Syntax Lotos NT C Shell Total Generated C Svl 2.0 1,250 2,940 370 2,170 6,730 12,400 Evaluator 4.0 3,600 7,500 3,900 — 15,000 37,000 Ntif 1,620 3,620 1,200 — 6,440 20,644

4

Related Work and Conclusions

Alternative approaches exist based upon declarative representations, such as attributed grammars (Fnc-2 [12], SmartTools [1]), logic programming (Ale [4], Centaur [2]), or term rewriting (Txl5 , Kimwitu [18], Asf+Sdf [17]). In these 5

http://www.thetxlcompany.com

12

Hubert Garavel et al.

approaches, Asts are implicit (not directly visible to the programmer) and it is not necessary to specify the order of attribute evaluation, which is inferred from the dependencies. On the contrary, our approach requires the explicit Ast speciﬁcation and attribute computation ordering. Practically, this is not too restrictive, since the user is usually aware of these details. Lotos NT is an hybrid between imperative and functional languages. Unlike the object-oriented approach (e.g., JavaCC6 ), in which Asts are deﬁned using classes, and visitors are implemented using methods, the Lotos NT code for computing a given attribute does not need to be split into several classes, but can be clearly centralized in a single function containing a “case” statement. Compared to lower-level imperative languages such as C, Lotos NT avoids tedious and error-prone explicit pointer manipulation. Compared to functional languages such as Haskell or Caml7 (for which the Happy8 and CamlYacc parser generators are available), Lotos NT does not allow higher-order functions nor polymorphism. In practice, we believe that these missing features are not essential for compiler construction; instead, Lotos NT provides useful mechanisms such as strong typing, function overloading, pattern-matching, and sequential composition. Lotos NT external C types and functions make input/output operations simpler than Haskell/Happy, in which one must be acquainted with the notion of monads. Contrary to functional languages specifically dedicated to compiler construction such as Puma9 and Gentle [15], Lotos NT is a general-purpose language, applicable to a wider range of problems. The Lotos NT technology can be compared with other hybrid approaches such as the App10 and Memphis11 preprocessors, which extend C/C++ with abstract data types and pattern-matching. Yet, these preprocessors lack the static analysis checks supported by Lotos NT and Traian (strong typing, detection of uninitialized variables, exhaustiveness of “case” statements, etc.), which signiﬁcantly facilitate the programming activity. Our experience in using Lotos NT for developing three compilers demonstrated the eﬃciency and robustness of this pragmatic approach. Since 1998, the Traian compiler is available on several platforms (Windows, Linux, Solaris) and can be downloaded on the Internet. The three Traian-based compilers are or will be available soon: Svl 2.0 is distributed within Cadp 2001 “Ottawa”; Evaluator 4.0 and Ntif will be released in future versions of Cadp. Ntif is already used in a test generation platform for smart cards in an industrial project with Schlumberger. 6 7 8 9 10 11

http://www.webgain.com/products/java_cc http://caml.inria.fr http://www.haskell.org/happy Puma belongs to the Cocktail toolbox (http://www.first.gmd.de/cocktail) http://www.primenet.com/~georgen/app.html http://memphis.compilertools.net

Compiler Construction Using LOTOS NT

13

References 1. I. Attali, C. Courbis, P. Degenne, A. Fau, D. Parigot, and C. Pasquier. SmartTools: A Generator of Interactive Environments Tools. In Proc. of CC ’2001, volume 2027 of LNCS, 2001. 11 2. P. Borras, D. Cl´ement, Th. Despeyroux, J. Incerpi, G. Kahn, B. Lang, and V. Pascual. Centaur: the system. In Proc. of SIGSOFT’88, 3rd Symposium on Software Development Environments (SDE3), 1988. 11 3. P. Boullier and P. Deschamp. Le syst`eme SYNTAX : Manuel d’utilisation et de mise en œuvre sous Unix. http://www-rocq.inria.fr/oscar/www/syntax, 1997. 9 4. B. Carpenter. The Logic of Typed Feature Structures. Cambridge Tracts in Theoretical Computer Science, 32, 1992. 11 5. J.-C. Fernandez, H. Garavel, A. Kerbrat, R. Mateescu, L. Mounier, and M. Sighireanu. CADP (CÆSAR/ALDEBARAN Development Package): A Protocol Validation and Veriﬁcation Toolbox. In Proc. of CAV ’96, volume 1102 of LNCS, 1996. 9 6. H. Garavel. Compilation of LOTOS Abstract Data Types. In Proc. of FORTE’89. North-Holland, 1989. 9 7. H. Garavel and F. Lang. SVL: A Scripting Language for Compositional Veriﬁcation. In Proc. of FORTE’2001. Kluwer, 2001. INRIA Research Report RR-4223. 10 8. H. Garavel and J. Sifakis. Compilation and Veriﬁcation of LOTOS Speciﬁcations. In Proc. of PSTV’90. North-Holland, 1990. 9 9. H. Garavel and M. Sighireanu. Towards a Second Generation of Formal Description Techniques – Rationale for the Design of E-LOTOS. In Proc. of FMICS’98, Amsterdam, 1998. CWI. Invited lecture. 10 10. ISO/IEC. LOTOS — A Formal Description Technique Based on the Temporal Ordering of Observational Behaviour. International Standard 8807, 1988. 9, 10 11. ISO/IEC. Enhancements to LOTOS (E-LOTOS). International Standard 15437:2001, 2001. 10 12. M. Jourdan, D. Parigot, C. Juli´e, O. Durin, and C. Le Bellec. Design, Implementation and Evaluation of the FNC-2 Attribute Grammar System. ACM SIGPLAN Notices, 25(6), 1990. 9, 11 13. R. Mateescu and H. Garavel. XTL: A Meta-Language and Tool for Temporal Logic Model-Checking. In Proc. of STTT ’98. BRICS, 1998. 9 14. R. Mateescu and M. Sighireanu. Eﬃcient On-the-Fly Model-Checking for Regular Alternation-Free Mu-Calculus. In Proc. of FMICS’2000, 2000. INRIA Research Report RR-3899. To appear in Science of Computer Programming. 9 15. F. W. Schr¨ oer. The GENTLE Compiler Construction System. R. Oldenbourg Verlag, 1997. 12 16. M. Sighireanu. LOTOS NT User’s Manual (Version 2.1). INRIA projet VASY. ftp://ftp.inrialpes.fr/pub/vasy/traian/manual.ps.Z, November 2000. 10 17. M. G. J. van den Brand, A. van Deursen, J. Heering, H. A. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P. A. Olivier, J. Scheerder, J. J. Vinju, E. Visser, and J. Visser. The ASF+SDF Meta-Environment: A Component-Based Language Development Environment. In Proc. of CC ’2001, volume 2027 of LNCS, 2001. 11 18. P. van Eijk, A. Belinfante, H. Eertink, and H. Alblas. The Term Processor Generator Kimwitu. In Proc. of TACAS ’97, 1997. 11

Data Compression Transformations for Dynamically Allocated Data Structures Youtao Zhang and Rajiv Gupta Dept. of Computer Science, The University of Arizona, Tucson, Arizona 85721

Abstract. We introduce a class of transformations which modify the representation of dynamic data structures used in programs with the objective of compressing their sizes. We have developed the commonpreﬁx and narrow-data transformations that respectively compress a 32 bit address pointer and a 32 bit integer ﬁeld into 15 bit entities. A pair of ﬁelds which have been compressed by the above compression transformations are packed together into a single 32 bit word. The above transformations are designed to apply to data structures that are partially compressible, that is, they compress portions of data structures to which transformations apply and provide a mechanism to handle the data that is not compressible. The accesses to compressed data are efﬁciently implemented by designing data compression extensions (DCX) to the processor’s instruction set. We have observed average reductions in heap allocated storage of 25% and average reductions in execution time and power consumption of 30%. If DCX support is not provided the reductions in execution times fall from 30% to 12.5%.

1

Introduction

With the proliferation of limited memory computing devices, optimizations that reduce memory requirements are increasing in importance. We introduce a class of transformations which modify the representation of dynamically allocated data structures used in pointer intensive programs with the objective of compressing their sizes. The ﬁelds of a node in a dynamic data structure typically consist of both pointer and non-pointer data. Therefore we have developed the common-preﬁx and narrow-data transformations that respectively compress a 32 bit address pointer and a 32 bit integer ﬁeld into 15 bit entities. A pair of ﬁelds which have been compressed can be packed into a single 32 bit word. As a consequence of compression, the memory footprint of the data structures is signiﬁcantly reduced leading to signiﬁcant savings in heap allocated storage requirements which is quite important for memory intensive applications. The reduction in memory footprint can also lead to signiﬁcantly reduced execution times due to a reduction in data cache misses that occur in the transformed program.

Supported by DARPA PAC/C Award. F29601-00-1-0183 and NSF grants CCR0105355, CCR-0096122, EIA-9806525, and EIA-0080123 to the Univ. of Arizona.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 14-28, 2002. c Springer-Verlag Berlin Heidelberg 2002

Data Compression Transformations for Dynamically Allocated Data Structures

15

An important feature of our transformations is that they have been designed to apply to data structures that are partially compressible. In other words, they compress portions of data structures to which transformations apply and provide a mechanism to handle the data that is not compressible. Initially data storage for a compressed data structure is allocated assuming that it is fully compressible. However, at runtime, when uncompressible data is encountered, additional storage is allocated to handle such data. Our experience with applications from Olden test suite demonstrates that this is a highly important feature because all the data structures that we examined in our experimentation were highly compressible, but none were fully compressible. For eﬃciently accessing data in compressed form we propose data compression extensions (DCX) to a RISC-style ISA which consist of six simple instructions. These instructions perform two types of operations. First since we must handle partially compressible data structures, whenever a ﬁeld that has been compressed is updated, we must check to see if the new value to be stored in that ﬁeld is indeed compressible. Second when we need to make use of a compressed value in a computation, we must perform an extract and expand operation to obtain the original 32 bit representation of the value. We have implemented our techniques and evaluated them. The DCX instructions have been incorporated into the MIPS like instruction set used by the simplescalar simulator. The compression transformations have been incorporated in the gcc compiler. We have also addressed other important implementation issues including the selection of ﬁelds for compression and packing. Our experiments with six benchmarks from the Olden test suite demonstrate an average space savings of 25% in heap allocated storage and average reductions of 30% in execution times and power consumption. The net reduction in execution times is attributable to reduced miss rates for L1 data cache and L2 uniﬁed cache and the availability of DCX instructions.

2

Data Compression Transformations

As mentioned earlier, we have developed two compression transformations: one to handle pointer data and the other to handle narrow width non-pointer data. We illustrate the transformations by using an example of the dynamically allocated link list data structure shown below – the next and value ﬁelds are compressed to illustrate the compression of both pointer and non-pointer data. The compressed ﬁelds are packed together to form a single 32 bit ﬁeld value next. Original Structure: struct list node { · · ·; int value; struct list node *next; } *t;

Transformed Structure: struct list node { · · ·; int value next; } *t;

Common-Preﬁx transformation for pointer data. The pointer contained in the next ﬁeld of the link list can be compressed under certain conditions. In particular, consider the addresses corresponding to an instance of list node (addr1)

16

Youtao Zhang and Rajiv Gupta

and the next ﬁeld in that node (addr2). If the two addresses share a common 17 bit preﬁx because they are located fairly close in memory, we classify the next pointer as compressible. In this case we eliminate the common preﬁx from address addr2 which is stored in the next pointer ﬁeld. The lower order 15 bits from addr2 represent the representation of the pointer in compressed form. The 32 bit representation of a next ﬁeld can be reconstructed when required by obtaining the preﬁx from the pointer to the list node instance to which the next ﬁeld belongs. Narrow data transformation for non-pointer data. Now let us consider the compression of the narrow width integer value in the value ﬁeld. If the 18 higher order bits of an array element are identical, that is, they are either all 0’s or all 1’s, it is classiﬁed as compressible. The 17 higher order bits are discarded and leaving a 15 bit entity. Since the 17 bits discarded are identical to the most signiﬁcant order bit of the 15 bit entity, the 32 bit representation can be easily derived when needed by replicating the most signiﬁcant bit. Packing together compressed ﬁelds. The value and next ﬁelds of a node belonging to an instance of list node can be packed together into a single 32 bit word as they are simply 15 bit entities in their compressed form. Together they are stored in value next ﬁeld of the transformed structure. The 32 bits of value next are divided into two half words. Each compressed ﬁeld is stored in the lower order 15 bits of the corresponding half word. According to the above strategy, bits 15 and 31 are not used by the compressed ﬁelds. Next we describe the handling of uncompressible data in partially compressible data structures. The implementation of partially compressible data structures require an additional bit for encoding information. This is why we compress ﬁelds down to 15 bit entities and not into 16 bit entities. Partial compressibility. Our basic approach is to allocate only enough storage to accommodate a compressed node when a new node in the data structure is created. Later, as the pointer ﬁelds are assigned values, we check to see if the ﬁelds are compressible. If they are, they can be accommodated in the allocated space; otherwise additional storage is allocated to hold the ﬁelds in uncompressed form. The previously allocated location is now used to hold a pointer to this additional storage. Therefore for accessing uncompressible ﬁelds we have to go through an extra step of indirection. If the uncompressible data stored in the ﬁelds is modiﬁed, it is possible that the ﬁelds may now become compressible. However, we do not carry out such checks and instead we leave the ﬁelds in such cases in uncompressed form. This is because exploitation of such compression opportunities can lead to repeated allocation and deallocation of extra locations if data values repeatedly keep oscillating between compressible and uncompressible kind. To avoid repeated allocation and deallocation of extra locations we simplify our approach so that once a ﬁeld is assigned an uncompressible value, from then onwards, the data in the ﬁeld is always maintained in uncompressed form.

Data Compression Transformations for Dynamically Allocated Data Structures

17

We use the most signiﬁcant bit (bit 31) in the word to indicate whether or not the data stored in the word is compressed or not. This is possible because in the MIPS base system that we use, the most signiﬁcant bit for all heap addresses is always 0. It contains a 0 to indicate that the word contains compressed values. If it contains a 1, it means that one or both of values were not compressible and instead the word contains a pointer to an extra pair of dynamically allocated locations which contain the values of the two ﬁelds in uncompressed form. While bit 31 is used to encode extra information, bit 15 is never used for any purpose. Original: Set "value" field and Create "next" link addr0

addr0

t

t value next

addr1

value ( = v1 ) next

nil

Transformed(case 1) : both "next" and "value" fields are compressible addr0

addr0

t

t 0

nv

(v1)

0

nv

(v1)

nil

addr11 Transformed(case 2) : "value" is compressible and "next" is not addr0

addr0

t

t 0

nv

(v1)

1

nv

nil

v1

addr11

v1

addr11

Transformed(case 3) : "value" is not compressible addr0

addr0

t

t 1

nv

1

v1

nv

nil

Fig. 1. Dealing with uncompressible data.

In Fig. 1 we illustrate the above method using an example in which an instance of list node is allocated and then the value and next ﬁelds are set up one at a time. As we can see ﬁrst storage is allocated to accommodate the two ﬁelds in compressed form. As soon as the ﬁrst uncompressible ﬁeld is encountered additional storage is allocated to hold the two ﬁelds in uncompressed form. Under this scheme there are three possibilities which are illustrated in Fig. 1. In the ﬁrst case both ﬁelds are found to be compressible and therefore no extra locations are allocated. In the second case the value ﬁeld, which is accessed ﬁrst, is compressible but the next ﬁeld is not. Thus, initially value ﬁeld is stored in compressed form but later when next ﬁeld is found to be compressible, extra locations are allocated and both ﬁelds are store in uncompressed form. Finally in the third case the value ﬁeld is not compressible and therefore extra locations are allocated right away and none of the two ﬁelds are ever stored in compressed form.

18

Youtao Zhang and Rajiv Gupta

3

Instruction Set Support

Compression reduces the amount of heap allocated storage used by the program which typically improves the data cache behavior. Also if both the ﬁelds need to be read in tandem, a single load is enough to read both the ﬁelds. However, the manipulation of the ﬁelds also creates additional overhead. To minimize this overhead we have design new RISC-style instructions. We have designed three simple instructions each for pointer and non-pointer data respectively that eﬃciently implement common-preﬁx and narrow-data transformations. The semantics of the these instructions are summarized in Fig. 2. These instructions are RISC-style instructions with complexity comparable to existing branch and integer ALU instructions. Let us discuss these instructions in greater detail. Checking compressibility. Since we would like to handle partially compressible data, before we actually compress a data item at runtime, we must ﬁrst check whether the data item is compressible. Therefore the ﬁrst instruction type we introduce allows eﬃcient checking of data compressibility. We have provided the two new instructions that are described below. The ﬁrst checks the compressibility of pointer data and the second does the same for non-pointer data. bneh17 R1, R2, L1 – is used to check if the higher order 17 bits of R1 and R2 are the same. If they are the same, the execution continues and the ﬁeld held in R2 can be compressed; otherwise the branch is taken to a point where we handle the situation, by allocating additional storage, in which the address in R2 is not compressible. The instruction also handles the case where R2 contains a nil pointer which is represented by the value 0 both in compressed and uncompressed forms. Since 0 represents a nil pointer, the lower order 15 bits of an allocated address should never be all zeroes - to correctly handle this situation we have modiﬁed our malloc routine so that it never allocates storage locations with such addresses. bneh18 R1, L1 – is used to check if the higher order 18 bits of R1 are identical (i.e., all 0’s or all 1’s). If they are the same, the execution continues and the value held in R1 is compressed; otherwise the value in R1 is not compressible and the branch is taken to a point where we place code to handle this situation by allocating additional storage. Extract-and-expand. If a pointer is stored in compressed form, before it can be derefrenced, we must ﬁrst reconstruct its 32-bit representation. We do the same for compressed non-pointer data before its use. Therefore the second instruction type that we introduce carries out extract-and-expand operations. There are four new instructions that we describe below. The ﬁrst two instructions are used to extract-and-expand compressed pointer ﬁelds from lower and upper halves of a 32-bit word respectively. The next two instructions do the same for non-poniter data. xtrhl R1, R2, R3 – extracts the compressed pointer ﬁeld stored in lower order bits (0 through 14) of register R3 and appends it to the common-preﬁx

Data Compression Transformations for Dynamically Allocated Data Structures

19

contained in higher order bits (15 through 31) of R2 to construct the uncompressed pointer which is then made available in R1. We also handle the case when R3 contains a nil pointer. If the compressed ﬁeld is a nil pointer, R1 is set to nil.

BNEH18 R1,L1

BNEH17 R1,R2,L1 if ( R2 != 0 ) && ( R131..15 != R231..15 ) goto L1 31

...

15

14 ...

if ( R131..14 != 0 ) && ( R131..14 != 0x3ff ) goto L1

0

31

R1

...

14

13 ...

0

R1

R2 XTRHL

XTRL

R1,R2,R3

if ( R314..0 != 0 ) /* Non-NULL case */ R1 = R231..15 R314..0 else R1 = 0 31 ... 15

14 ... 0

31

R2

R2 31

R3

30 ... 16

15

0

13 ... 0

- x

31 ... 15

31 30 29 ... 16

R2 31

30 ... 16

R1,R2

if ( R230 == 1 ) R1 = 0x1ffff R230..16 else R1 = R230..16

14 ... 0

R2 0

xxxxxxxxxxxxxxxx

XTRH

R1,R2,R3

if ( R330..16 != 0 ) /* Non-NULL case */ R1 = R231..15 R330..16 else R1 = 0

R1

15 14

14 ... 0

R1

R3

30 ... 16

0

-

R1 XTRHH

R1,R2

if ( R214 == 1 ) R1 = 0x1ffff R214..0 else R1 = R214..0

15

0 x

15

14 ... 0

-

14 ... 0

-

R1

xxxxxxxxxxxxxxxx

Fig. 2. DCX instructions.

xtrhh R1, R2, R3 – extracts the compressed pointer ﬁeld stored in the higher order bits (16 through 30) of register R3 and appends it to the commonpreﬁx contained in higher order bits (15 through 31) of R2 to construct the uncompressed pointer which is then made available in R1. If the compressed ﬁeld is a nil pointer, R1 is set to nil. The instructions xtrhl and xtrhh can also be used to compress two ﬁelds together. However, they are not essential for this purpose because typically there are existing instructions which can perform this operation. In the MIPS like instruction set we used in this work this was indeed the case. xtrl R1, R2 – extracts the ﬁeld stored in lower half of the R2, expands it, and then stores the resulting 32 bit value in R1. xtrh R1, R2 – extracts the ﬁeld stored in the higher order bits of R2, exapands it, and then stores the resulting 32 bit value in R1.

20

Youtao Zhang and Rajiv Gupta

Next we give a simple example to illustrate the use of the above instructions. Let us assume that an integer ﬁeld t → value and a pointer ﬁeld t → next are compressed together into a single ﬁeld t → value next. In Fig. 3a we show how compressibility checks are used prior to appropriately storing newvalue and newnext values in to the compressed ﬁelds. In Fig. 3b we illustrate the extract and expand instructions by extracting the compressed values stored in t → value next. ; $16 : &t− > value next ; $18 : newvalue ; $19 : newnext ; ; branch if newvalue is not compressible bneh18 $18, $L1 ; branch if newnext is not compressible bneh17 $16, $19, $L1 ; store compressed data in t− > value next ori $19, $19, 0x7fff swr $18, 0($16) swr $19, 2($16) j $L2 $L1: ; allocate extra locations and store pointer ; to extra locations in t− > value next ; store uncompressed data in extra locations ··· $L2: · · · (a) Illustration of compressibility checks. ; $16: &(t− > value next) ; $17: uncompressed integer t− > value ; $18: uncompressed pointer t− > next ; ; load contents of t− > value next lw $3,0($16) ; branch if $3 is a pointer to extra locations bltz $3, $L1 ; extract and expand t− > value xtrl $17, $3 ; extract and expand t− > next xtrhh$18, $16, $3 j $L2 $L1: ; load values from extra locations ··· $L2: · · · (b) Illustration of extract and expand instructions. Fig. 3. An example.

Data Compression Transformations for Dynamically Allocated Data Structures

4

21

Compiler Support

Object layout transformations can only be applied to a C program if the user does not access the ﬁelds through explicit address arithmetic and also does not typecast the objects of the transformed type into objects of another type. Like prior work by Truong et al. [14] on ﬁeld reorganization and instance interleaving, we assume that the programmer has given us the go ahead to freely transform the data structures when it is apprpriate to do so. From this step onwards the rest of process is carried out automatically by the compiler. In the remainder of this section we describe key aspects of the the compiler support required for eﬀective data compression. Identifying ﬁelds for compression and packing. Our observation is that most pointer ﬁelds can be compressed quite eﬀectively using the common-preﬁx transformation. Integer ﬁelds to which narrow-data transformation can be applied can be identiﬁed either based upon knowledge about the application or using value proﬁling. The most critical issue is that of pairing compressed ﬁelds for packing into a single word. For this purpose we must ﬁrst categorize the ﬁelds as hot ﬁelds and cold ﬁelds. It is useful to pack two hot ﬁelds together if they are typically accessed in tandem. This is because in this situation a single load can be shared while reading the two values. It is also useful to compress any two cold ﬁelds even if they are not accessed in tandem. This is because even though they cannot share the same load, they are not accessed frequently. In all other situations it is not as useful to pack data together because even though space savings will be obtained, execution time will be adversely aﬀected. We used basic block frequency counts to identify pairs of ﬁelds belonging to the above categories and then applied compression transformations to them. ccmalloc vs malloc. We make use of ccmalloc [6], a modiﬁed version of malloc, for carrying out storage allocation. This form of storage allocation was developed by Chilimbi et al. [6] and as described earlier it improves the locality of dynamic data structures by allocating the linked nodes of the data structure as close to each other as possible in the heap. As a consequence, this technique increases the likelihood that the pointer ﬁelds in a given node will be compressible. Therefore it makes sense to use ccmalloc in order to exploit the synergy between ccmalloc and data compression. Register pressure. Another issue that we consider in our implementation is that of potential increase in register pressure. The code executed when the pointer ﬁelds are found to be uncompressible is substantial and therefore it can increase register pressure signiﬁcantly causing a loss in performance. However, we know that this code is executed very infrequently since very few ﬁelds are uncompressible. Therefore, in this piece of code we ﬁrst free registers by saving values and then after executing the code the values are restored in registers. In other words, the increase in register pressure does not have an adverse eﬀect on frequently executed code.

22

Youtao Zhang and Rajiv Gupta

Instruction cache behavior and code size. The additional instructions generated for implementing compression can lead to an increase in code size which can further impact the instruction cache behavior. It is important to note however that a large part of the code size increase is due to the handling of the infrequent case in which the data is found not to be compressible. In order to minimize the impact on the code size we can share the code for handling the above infrequent case across all the updates corresponding to a given data ﬁeld. To minimize the impact of the performance on the instruction cache, we can employ a code layout strategy which places the above infrequently executed code elsewhere and create branches to it and back so that the instruction cache behavior for more frequently executed code is minimally aﬀected. Our implementation currently does not support the above techniques and therefore we observed code size increase and degraded instruction cache behavior in our experiments. Code generation. The remainder of the code generation details for implementing data compression are in most part quite straightforward. Once the ﬁelds have been selected for compression and packing together, whenever a use of a value of any of the ﬁelds is encountered, the load is followed by an extract-and expand instruction. If the value of any of compressed ﬁelds is to be updated, the compressibility check is performed before storing the value. When two hot ﬁelds that are packed together are to be read/updated, initially we generate separate loads/stores for them. Later in a separate pass we eliminate the later of the two loads/stores whenever possible.

5

Performance Evaluation

Experimental setup. We have implemented the techniques described to evaluate their performance. The transformations have been implemented as part of the gcc compiler and the DCX instructions have been incorporated in the MIPS like instruction set of the superscalar processor simulated by simplescalar [3]. The evaluation is based upon six benchmarks taken from the Olden test suite [5] (see Fig. 4a) which contains pointer intensive programs that make extensive use of dynamically allocated data structures. In order to study the impact of memory performance we varied the input sizes of the programs and also varied the L2 cache latency. The cache organization of simplescalar is shown in Fig. 4b. There are ﬁrst level separate instruction and data caches (I-cache and D-cache). The lower level cache is a uniﬁed-cache for instructions and data. The L1 cache used was a 16K direct mapped cache with 9 cycle miss latency while the uniﬁed L2 cache is 256K with 100/200/400 cycle miss latencies. Our experiments are for an out-of-order issue superscalar with issue width of 4 instructions and the Bimod branch predictor. Impact on storage needs. The transformations applied and their impact on the node sizes is shown in Fig. 5a. In the ﬁrst four benchmarks (treeadd, bisort, tsp, and perimeter), node sizes are reduced by storing pairs of compressed pointers in a single word. In the health benchmark a pair of small values are

Data Compression Transformations for Dynamically Allocated Data Structures

Program treeadd

Application Recursive sum of values in a B-tree bisort Bitonic Sorting tsp Traveling salesman problem perimeter Perimeters of regions in images health Columbian health care simulation mst Minimum Spanning tree of a graph (a) Benchmarks.

Parameter Issue Width I cache I cache miss latency L1 data cache L1 data cache miss latency L2 uniﬁed cache Memory latency (L2 cache miss latency)

23

Value 4 issue, out of order 16K direct mapped 9 cycles 16K direct mapped 9 cycles 256K 2-way Conﬁguration 1/2/3 = 100/200/400 cycles

(b) Machine conﬁgurations. Fig. 4. Experimental setup.

compressed together and stored in a single word. Finally, in the mst benchmark a compressed pointer and a compressed small value are stored together in a single word. The changes in node sizes range from 25% to 33% for ﬁve of the benchmarks. Only in case of tsp is the reduction smaller – just over 10%. We measured the runtime savings in heap allocated storage for small and large program inputs. The results are given in Fig. 5b. The average savings are nearly 25% while they range from 10% to 33% across diﬀerent benchmarks. Even more importantly these savings represent signiﬁcant levels of heap storage – typically in megabytes. For example, the 33% storage savings for treeadd represents 4.2 Mbytes and 17 Mbytes of heap storage savings for small and large program inputs respectively. It should also be noted that such savings cannot be obtained by other locality improving techniques described earlier [14, 15, 6]. From the results in Fig. 5b we make another very important observation. The extra locations allocated when non-compressible data is encountered is non-zero for all of the benchmarks. In other words we observe that for none of the data structures to which our compression transformations were applied, were all of the instances of the data encountered at runtime actually compressible. A small amount of additional locations were allocated to hold a small number of uncompressible pointers and small values in each case. Therefore the generality of our transformation which allows handling of partially compressible data structures is extremely important. If we had restricted the application of our technique to data ﬁelds that are always guaranteed to be compressible, we could not have achieved any compression and therefore no space savings would have resulted. We also measured the increase in code size caused by our transformations (see Fig. 5c). The increase in code size prior to linking is signiﬁcant while after linking the increase is very small since the user code is small part of the binaries. However, the reason for signiﬁcant increase in user code is because each time a compressed ﬁeld is updated, our current implementation generates a new copy of the additional code for handling the case where the data being stored may

24

Youtao Zhang and Rajiv Gupta

not be compressible. In practice it is possible to share this code across multiple updates. Once such sharing has been implemented, we expect that the increase in the size of user code will also be quite small. Program

Transformation Applied

Size Change (bytes) treeadd Com.Preﬁx/Com.Preﬁx from 28 to 20 bisort Com.Preﬁx/Com.Preﬁx from 12 to 8 tsp Com.Preﬁx/Com.Preﬁx from 36 to 32 perimeter Com.Preﬁx/Com.Preﬁx from 12 to 8 health NarrowData/NarrowData from 16 to 12 mst Com.Preﬁx/NarrowData from 16 to 12

Program

Before After Linking Linking treeadd 16.4% 0.04% bisort 40.0% 0.01% tsp 4.9% 0.18% perimeter 21.3% 1.97% health 33.7% 0.23% mst 10.7% 0.06% average 21.1% 0.41%

(a) Reduction in node size.

(c) Code size increase. Storage (bytes)

Program treeadd bisort tsp perimeter health mst average

Original 12582900 786420 5242840 4564364 566872 3414020

Small Input Total (Extra) 8402040 (13440) 549880 (25600) 4200352 (6080) 3265380 (5120) 510272 (320) 2367812 (320)

Savings 33.2 % 30.1 % 19.9 % 28.5 % 10.0 % 30.6 % 25.4 %

Original 50331636 3145716 20971480 20332620 1128240 54550532

Large Input Total (Extra) 33605684 (51260) 2301304 (204160) 16800224 (23040) 14546980 (23680) 1015124 (320) 37781828 (320)

Savings 33.2 % 26.8 % 19.9 % 28.5 % 10.0 % 30.7 % 24.9 %

(b) Reduction in heap storage for small and large inputs. Fig. 5. Impact on storage needs.

Impact on execution times. Based upon the cycle counts provided by the simplescalar simulator we studied the changes in execution times resulting from compression transformations. The impact of L2 latency on execution times was also studied. The results in Fig. 6 are for small inputs. For L2 cache latency of 100 cycles, the reduction in execution times in comparison to the original programs which use malloc range from 3% to 64% while on an average the reduction in execution time is around 30%. The reductions for higher latencies are also similar. We also compared our execution times with versions of the programs that use ccmalloc. Our approach outperforms ccmalloc in ﬁve out of the six benchmarks (our version of mst runs slightly slower than the ccmalloc version). On an average we outperform ccmalloc by nearly 10%. Our approach outperforms ccmalloc because once the node sizes are reduced, typically greater number of nodes ﬁt into a single cache line leading to a low number of cache misses. We also pay additional runtime overhead in form of extra instructions needed to carry out compression and extraction of compressed values. However, this additional

Data Compression Transformations for Dynamically Allocated Data Structures

25

execution time is more than oﬀset by the time savings resulting from reduced cache misses; thus leading to overall reduction in execution time. On an average, compression reduces the execution times by 10%, 15%, and 20% over ccmalloc for L2 cache latencies of 100, 200, and 400 cycles respectively. Therefore we observe that as the latency of L2 cache is increased, compression outperforms ccmalloc by a greater extent. In summary our approach provides large storage savings and signiﬁcant execution time reductions over ccmalloc. Comp./Orig.*100 (Latency=100 cycles) Comp./Orig.*100 (Latency=200 cycles) Comp./Orig.*100 (Latency=400 cycles) Comp./ccmalloc*100 (Latency=100 cycles) Comp./ccmalloc*100 (Latency=200 cycles) Comp./ccmalloc*100 (Latency=400 cycles)

120

percentage comparison

100

80

60

40

20

0

add

tree

t

r biso

tsp

ter

ime

per

lth

hea

t

ms

e

rag

ave

Fig. 6. Reduction in execution time due to data compression.

We would also like to point out that the use of special DCX instructions was critical in reducing the overhead of compression and extraction. Without DCX instructions the programs would have ran signiﬁcantly slower. We ran versions of programs which did not use DCX instructions for L2 cache latency of 100 cycles. The average reduction in execution times, in comparison to original programs, dropped from 30% to 12.5%. Instead of an average reduction in execution times of 10% in comparison to ccmalloc versions of the program we observed an average increase of 9% in execution times. Impact on power consumption. We also compared the power consumption for the compression based programs with that of the original programs and ccmalloc based programs (see Fig. 7). These measurements are based upon the Wattch [1] system which is built on top of the simplescalar simulator. These results track the execution time results quite closely. The average reduction in power consumption over the original programs is around 30% for the small input. The reductions in power dissipation that compression provides over ccmalloc for the diﬀerent cache latencies is also given. As we can see, on an average, compression reduces the power dissipation by 5%, 10%, and 15% over ccmalloc for L2 cache latencies of 100, 200, and 400 cycles respectively.

26

Youtao Zhang and Rajiv Gupta Comp./Orig.*100 (Latency=100 cycles) Comp./Orig.*100 (Latency=200 cycles) Comp./Orig.*100 (Latency=400 cycles) Comp./ccmalloc*100 (Latency=100 cycles) Comp./ccmalloc*100 (Latency=200 cycles) Comp./ccmalloc*100 (Latency=400 cycles)

120

percentage comparison

100

80

60

40

20

0

add

tree

t

r biso

tsp

ter

ime

per

lth

hea

t

ms

e

rag

ave

Fig. 7. Impact on in power consumption.

Impact on cache performance. Finally in Fig. 8 we present the impact of compression on cache behavior, including I-cache, D-cache and uniﬁed L2 cache behaviors. As expected, the I-cache performance is degraded due to increase in code size caused by our current implementation of compression. However, the performances of D-cache and uniﬁed cache are signiﬁcantly improved. This improvement in data cache performance is a direct consequence of compression.

I−cache:Comp./Orig.*100 I−cache:Comp./ccmalloc*100 D−cache:Comp./Orig.*100 D−cache:Comp./ccmalloc*100 U−cache:Comp./Orig.*100 U−cache:Comp./ccmalloc*100

160

140

percentage comparison

120

100

80

60

40

20

0

add

tree

rt

biso

tsp

ter

ime

per

lth

hea

Fig. 8. Impact on cache misses.

t

ms

e

rag

ave

Data Compression Transformations for Dynamically Allocated Data Structures

6

27

Related Work

Recently there has been a lot of interest in exploiting narrow width values to improve program performance [2, 12, 13]. However, our work focusses on pointer intensive applications for which it is important to also handle pointer data. A great deal of research has been conducted on development of locality improving transformations for dynamically allocated data structures. These transformations alter object layout and placement to improve cache performance [14, 6, 15]. However, none of these transformations result in space savings. Existing compression transformations [10, 7] rely upon compile time analysis to prove that certain data items do not require a complete word of memory. They are applicable only when the compiler can determine that the data being compressed is fully compressible and they only apply to narrow width non-pointer data. In contrast, our compression transformations apply to partially compressible data and, in addition to handling narrow width non-pointer data, they also apply to pointer data. Our approach is not only more general but it is also simpler in one respect. We do not require compile-time analysis to prove that the data is always compressible. Instead simple compile-time heuristics are suﬃcient to determine that the data is likely to be compressible. ISA extensions have been developed to eﬃciently process narrow width data including Intel’s MMX [9] and Motorola’s AltiVec [11]. Compiler techniques are also being developed to exploit such instruction sets [8]. However, the instructions we require are quite diﬀerent from MMX instructions because we must handle partially compressible data structures and we must also handle pointer data.

7

Conclusions

In conclusion we have introduced a new class of transformations that apply data compression techniques to compact the sizes of dynamically allocated data structures. These transformations result in large space savings and also result in signiﬁcant reductions in program execution times and power dissipation due to improved memory performance. An attractive property of these transformations is that they are applicable to partially compressible data structures. This is extremely important because according to our experiments, while the data structures in all of the benchmarks we studied are very highly compressible, they contain small amounts of uncompressible data. Even for programs with fully compressible data structures our approach has one advantage. The application of compression transformations can be driven by simple value proﬁling techniques [4]. There is no need for complex compile-time analyses for identifying fully compressible ﬁelds in data structures. Our approach is applicable to a more general class of programs than existing compression techniques: we can compress pointers as well as non-pointer data; and we can compress partially compressible data structures. Finally we have designed the DCX ISA extensions to enable eﬃcient manipulation of compressed data. The same task cannot be carried using MMX type instructions. Our main contribution is that data compression techniques can now be used to

28

Youtao Zhang and Rajiv Gupta

improve performance of general purpose programs and therefore this work takes the utility of compression beyond the realm of multimedia applications.

References 1. D. Brooks, V. Tiwari, and D. Martonosi, “Wattch: A Framework for ArchitectureLevel Power Analysis and Optimizations,” 27th International Symposium on Computer Architecture (ISCA), pages 83–94, May 2000. 2. D. Brooks and D. Martonosi, “Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance,” 5th International Symposium on High-Performance Computer Architecture (HPCA), pages 13–22, Jan. 1999. 3. D. Burger and T.M. Austin, “The Simplescalar Tool Set, Version 2.0,” Computer Architecture News, pages 13–25, June 1997. 4. M. Burrows, U. Erlingson, S-T.A. Leung, M.T. Vandevoorde, C.A. Waldspurger, K. Walker, and W.E. Weihl, “Eﬃcient and Flexible Value Sampling,” The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 160–167, Cambridge, MA, November 2000. 5. M. Carlisle, “Olden: Parallelizing Progrms with Dynamic Data Structures on Distributed-Memory Machines,” PhD Thesis, Princeton Univ., Dept. of Comp. Science, June 1996. 6. T.M. Chilimbi, M.D. Hill, and J.R. Larus, “Cache-Conscious Structure Layout,” ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 1–12, Atlanta, Georgia, May 1999. 7. J. Davidson and S. Jinturkar, “Memory access coalescing : a technique for eliminating redundant memory accesses,” ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 186–195, 1994. 8. S. Larsen and S. Amarasinghe, “Exploiting Superword Level Parallelism with Multimedia Instruction Sets,” ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 145–156, Vancouver B.C., Canada, June 2000. 9. A. Peleg and U. Weiser, MMX Technology Extension to Intel Architecture. 16(4):4250, August 1996. 10. M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 108–120, Vancouver B.C., Canada, June 2000. 11. J. Tyler, J. Lent, A. Mather, and H.V. Nguyen, “AltiVec(tm): Bringing Vector Technology to the PowerPC(tm) Processor Family,” Phoenix, AZ, February 1999. 12. Y. Zhang, J. Yang, and R. Gupta, “Frequent Value Locality and Value-Centric Data Cache Design,” The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 150–159, Cambridge, MA, November 2000. 13. J. Yang, Y. Zhang, and R. Gupta, “Frequent Value Compression in Data Caches,” The 33nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 258–265, Monterey, CA, December 2000. 14. D.N. Truong, F. Bodin, and A. Seznec, “Improving Cache Behavior of Dynamically Allocated Data Structures,” International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 322–329, Paris, France, 1998. 15. B. Calder, C. Krintz, S. John, and T. Austin, “Cache-Conscious Data Placement,” 8th International Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 139–149, San Jose, California, October 1998.

Evaluating a Demand Driven Technique for Call Graph Construction Gagan Agrawal1, Jinqian Li2 , and Qi Su2 1

2

Department of Computer and Information Sciences, Ohio State University Columbus, OH 43210 [email protected] Department of Computer and Information Sciences, University of Delaware Newark DE 19716 {li,su}@eecis.udel.edu

Abstract. With the increasing importance of just-in-time or dynamic compilation and the use of program analysis as part of software development environments, there is a need for techniques for demand driven construction of a call graph. We have developed a technique for demand driven call graph construction which handles dynamic calls due to polymorphism in object-oriented languages. Our demand driven technique has the same accuracy as the corresponding exhaustive technique. The reduction in the graph construction time depends upon the ratio of the cardinality of the set of influencing nodes and the total number of nodes in the entire program. This paper presents a detailed experimental evaluation of the beneﬁts of the demand driven technique over the exhaustive one. We consider a number of scenarios, including resolving a single call site, resolving all call sites in a method, resolving all call sites within all methods in a class, and computing reaching deﬁnitions of all actual parameters inside a method. We compare the analysis time, the number of methods analyzed, and the number of nodes in the working set for the demand driven and exhaustive analyses. We use SPECJVM programs as benchmarks for our experiments. Our experiments show for the larger SPECJVM programs, javac, mpegaudio, and jack, demand driven analysis on the average takes nearly an order of magnitude less time than exhaustive analysis.

1

Introduction

A call graph is a static representation of dynamic invocation relationships between procedures (or functions or methods) in a program. A node in this directed graph represents a procedure and an edge (p → q) exists if the procedure p can invoke the procedure q. In program analysis or compiler optimizations for object-oriented programs, call graph construction becomes a critical step for at

This research was supported by NSF CAREER award ACI-9733520 and NSF grant CCR-9808522.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 29–45, 2002. c Springer-Verlag Berlin Heidelberg 2002

30

Gagan Agrawal et al.

least two reasons. First, because the average size of a method is typically quite small, very limited information is available without performing interprocedural analysis. Second, because of the frequent use of virtual functions, accuracy and eﬃciency of the call graph construction technique is crucial for the results of interprocedural analysis. Therefore, call graph construction or dynamic call site resolution has been a focus of attention lately in the object-oriented compilation community [3,4,8,9,11,13,14,15,19,20,21,24]. We believe that with an increasing popularity of just-in-time or dynamic compilation and with an increasing use of program analysis in software development environments, there is a need for demand driven call graph analysis techniques. In a dynamic or just-in-time compilation environment, aggressive compiler analysis and optimizations are applied to selected portions of the code, and not to other less frequently executed or never executed portions of the code. Therefore, the set of procedures called needs to be computed for a small set of call sites, and not for all the call sites in the entire program. Similarly, when program analysis is applied in a software development environment, demand driven call graph analysis may be preferable to exhaustive analysis. For example, while constructing static program slices [23], the information on the set of procedures called is required only for the call sites included in the slice and depends upon the slicing criterion used. Similarly, during program analysis for regression testing [16], only a part of the code needs to be analyzed, and therefore, demand driven call graph analysis can be signiﬁcantly quicker than an exhaustive approach. We have developed a technique for performing demand driven call graph analysis [1,2]. The technique has two major theoretical properties. The worstcase complexity of our analysis is the same as the well known 0-CFA exhaustive analysis technique [18], except that our input is the cardinality of the set of inﬂuencing nodes, rather than the total number of nodes in the program representation. Thus, the advantage of our demand driven technique depends upon the ratio of the size of set of inﬂuencing nodes and the total number of nodes. Second, we have shown that the type information computed by our technique for all the nodes in the set of inﬂuencing nodes is as accurate as the 0-CFA exhaustive analysis technique. This paper presents an implementation and detailed experimental evaluation of our demand driven call graph construction technique. The implementation has been carried out using the sable infrastructure developed at McGill University [22]. Initial work on call graph construction exclusively focused on exhaustive analysis, i.e., analysis of a complete program. Many recent eﬀorts have focused on analysis when entire program may not available, or cannot be analyzed because of memory constraints [6,17,19]. These eﬀorts focus on obtaining most precision with the amount of available information. In comparison, our goal is to reduce the cost of analysis when demand-driven analysis can be performed, but not compromise the accuracy of analysis. We are not aware of any previous work on performing and evaluating demand-driven call graph analysis for the purpose of eﬃciency, even when the full program is available. Our work is also related to previous work on demand driven data ﬂow analysis [10,12]. Their work assumes

Evaluating a Demand Driven Technique for Call Graph Construction

This

x

This

y

y

This

This

31

y

This

y

This

y

cs1

cs2

x y

Fig. 1. Procedure A::P’s portion of PSG that a call graph is already available and does not, therefore, apply to the demand driven call graph construction problem. The rest of the paper is organized as follows. The demand driven call graph construction technique is reviewed in Section 2. Our experimental design is presented in Section 3 and experimental results are presented in Section 4. We conclude in Section 5.

2

Demand Driven Call Graph Construction

In this section, we review our demand driven call graph construction technique. More details of the technique are available from our previous papers [1,2]. We use the interprocedural representation Program Summary Graph (PSG), initially proposed by Callahan [5], for presenting our demand driven call graph analysis technique. Procedure A::P’s portion of PSG is shown in Figure 1. We also construct a relatively inaccurate initial call graph by performing relatively inexpensive Class Hierarchy Analysis (CHA) [7]. In presenting our technique, we use the following deﬁnitions. pred(v) : The set of predecessors of the node v in the PSG. This set is initially deﬁned during the construction of PSG and is not modiﬁed as the type information becomes more precise. proc(v) : This relation is only deﬁned if the node v is an entry node or an exit node. It denotes the name of the procedure to which this node belongs. TYPES(v): The set of types associated with a node v in the PSG during any stage in the analysis. This set is initially constructed using Class Hierarchy Analysis, and is later reﬁned through data-ﬂow propagation.

32

Gagan Agrawal et al.

THIS NODE(v): This is the node corresponding to the this pointer at the procedure entry (if v is an entry node), procedure exit (if v is an exit node), procedure call (if v is a call node) or call return (if v is a return node). THIS TYPE(v): If the vertex v is a call node or a return node,THIS TYPE(v) returns the types currently associated with the call node for the this pointer at this call site. This relation is not deﬁned if v is an entry or exit node. PROCS(S): Let S be the set of types associated with a call node for a this pointer. Then, PROCS(S) is the set of procedures that can actually be invoked at this call site. This function is computed using Class Hierarchy Analysis (CHA). We now describe how we compute the set of nodes in the PSG for the entire program that inﬂuence the set of procedures invoked at the given call site ci . The PSG for the entire program is never constructed. However, for ease in presenting the deﬁnition of the set of inﬂuencing nodes, we assume that the PSG components of all procedures in the entire program are connected based upon the initial sound call graph. Let v be the call node for the this pointer at the call site ci . Given the hypothetical complete PSG, the set of inﬂuencing nodes (which we denote by S) is the minimal set of nodes such that: 1) v ∈ S, 2) (x ∈ S) ∧ (y ∈ pred(x)) → y ∈ S, and 3) x ∈ S → THIS NODE(x) ∈ S Starting from the node v, we include the predecessors of any node already in the set, until we reach internal nodes that do not have any predecessors. For any node included in the set, we also include the corresponding node for the this pointer (denoted by THIS NODE) in the set. The next step in the algorithm is to perform iterative analysis over the set of nodes in the Partial Program Summary Graph (PPSG) to compute the set of types associated with a given initial node. This problem can be modeled as computing the data-ﬂow set TYPES with each node in the PPSG and reﬁning it iteratively. The initial values of TYPES(v) are computed through class hierarchy analysis that we described earlier in this section. If a formal or actual parameter is declared to be a reference to class cname, then the actual runtime type of that parameter can be any of the subclasses (including itself) of cname. The reﬁnement stage can be described by a single equation, which is shown in Figure 2. Consider a node v in PPSG. Depending upon the type of v, three cases are possible in performing the update: 1) v is a call or exit node, 2) v is an entry node, and 3) v is a return node. In Case 1, the predecessors of the node v are the internal nodes, the entry nodes for the same procedure, or the return nodes at one of the call sites within this procedure. The important observation is that such a set of predecessors does not change as the type information is made more precise. So, the set TYPES(v) is updated by taking union over the sets of TYPES(v) over the predecessors of the node v. We next consider case 2, i.e., when the node v is an entry node. proc(v) is the procedure to which the node v belongs. The predecessors of such a node are call nodes at all call sites at which the function proc(v) can possibly be called, as per the initial call graph assumed by performing class hierarchy analysis.

Evaluating a Demand Driven Technique for Call Graph Construction

 TYPES(v) ( p ∈ pred(v) TYPES(p) )     if v is call or exit node   TYPES(v) (

TYPES(v)=

(p ∈ pred(v)) ∧ (proc(v) ∈ PROCS(THIS

33

TYPE(p))) TYPES(p) )

  if v is an entry node   TYPES(v) ( (p ∈ pred(v)) ∧ (proc(p) ∈ PROCS(THIS TYPE(v))) TYPES(p) )   if v is a return node

Fig. 2. Data-ﬂow equation for propagating type information

Such a set of possible call sites for proc(v) gets restricted as interprocedural type propagation is performed. Let p be a call node that is a predecessor of v. We want to use the set TYPES(p) in updating TYPES(v) only if the call site corresponding to p invokes proc(v). We determine this by checking the condition proc(v) ∈ PROCS(THIS TYPE(p)). The function THIS TYPE(p) determines the types currently associated with the this pointer at the call site corresponding to p and the function PROCS determines the set of procedures that can be called at this call site based upon this type information. Case 3 is very similar to the case 2. If the node v is a return node, the predecessor node p to v is an exit node. We want to use the set TYPES(p) in updating TYPES(v) only if the call site corresponding to v can invoke the function proc(p). We determine this by checking the condition proc(p) ∈ PROCS(THIS TYPE(v)). The function THIS TYPE(v) determines the types currently associated with the this pointer at the call site corresponding to v and the function PROCS determines the set of procedures that can be called at this call site based upon this type information. Theoretical Results: The technique has two major theoretical properties [2]. The worst-case complexity of our analysis is the same as the well known 0-CFA exhaustive analysis technique [18], except that our input is the cardinality of the set of inﬂuencing nodes, rather than the total number of nodes in the program representation. Thus, the advantage of our demand driven technique depends upon the ratio of the size of set of inﬂuencing nodes and the total number of nodes. Second, we have shown that the type information computed by our technique for all the nodes in the set of inﬂuencing nodes is as accurate as the 0-CFA exhaustive analysis technique.

3

Experiment Design

We have implemented our demand driven technique using the sable infrastructure developed at McGill University [22]. In this section, we describe the design of the experiments conducted, including benchmarks used, scenarios used for evaluating demand driven call graph constructions, and metrics used for comparison. Benchmark Programs: We have primarily used programs from the most commonly used benchmark set for Java programs, SPECJVM. The 10 SPECJVM programs are check, compress, jess, raytrace, db, javac, mpegaudio, mtrt,

34

Gagan Agrawal et al.

Benchmark no. of no. of no. of classes methods PSG nodes check 20 96 3954 compress 15 35 601 jess 8 41 1126 raytrace 28 130 6518 db 6 34 1452 javac 180 1004 48147 mpegaudio 58 270 6205 mtrt 4 6 51 jack 61 261 14080 checkit 6 8 495

Fig. 3. Description of benchmarks

jack, and checkit. The total number of classes, methods, and PSG nodes for each of these benchmarks is listed in Figure 3. The number of classes ranges from 4 to 180, the number of methods ranges from 6 to 1004, and the number of PSG nodes ranges from 51 to 48147. Scenarios for Experiments: In Section 2, our technique was presented under the assumption that the call graph edges need to be computed for a single call site. In practice, demand driven analysis may be invoked under more complex scenarios. For example, one may be interested in knowing the reaching deﬁnitions for a set of variables in a method. Performing this analysis may require knowing the methods invoked at a set of call sites in the program. Thus, demand driven call graph analysis may be performed to determine the call graph edges at the call sites within this set. Alternatively, there may be interest in fully analyzing a single method or a class, and selectively analyzing codes from other methods or classes to have more precise information within the method or class. We have conducted experiments to evaluate demand driven call graph construction under the following scenarios: – Experiment A: Resolving a single call site in the program. We have only considered the call sites that can potentially invoke multiple methods after Class Hierarchy Analysis (CHA) is applied. This is the simplest case for the demand driven technique, and should require analyzing only a small set of procedures and PSG nodes in the program. – Experiment B: Computing reaching deﬁnitions of all actual parameters at all call sites within a method. Computing interprocedural reaching deﬁnitions will typically require knowing calling relationship at a set of call sites. This scenario depicts a situation in which demand driven call graph construction is invoked while computing certain data-ﬂow information on a demand basis. – Experiment C: Resolving all call sites within a method. This is more complicated than the experiment A above, and represents a more realistic case when interprocedural optimizations are applied at a portion of the program.

Evaluating a Demand Driven Technique for Call Graph Construction

35

– Experiment D: Resolving all call sites within all methods within a class. This scenario represents analyzing a single class, but performing selective analysis on portions of code from other classes to improve the accuracy of analysis within the class. Metrics Used: We now describe the metrics used for reporting the beneﬁts of demand driven call graph construction over exhaustive call graph analysis. Performing demand driven analysis will require fewer PSG nodes to be analyzed, fewer procedures to be analyzed, and should require lesser time. We individually report these three factors. Speciﬁcally, the three metrics used are: – Time Ratio: This is the ratio of the time required for demand driven analysis, as compared to exhaustive analysis. This metric evaluates the beneﬁts of using demand driven analysis, but is dependent on our implementation. – Node Ratio: This is the ratio of the number of nodes in PPSG to the total number of nodes in PSG of the entire program. This metric is an implementation independent indicative of the beneﬁts of the analysis. – Procedure Ratio: This is the ratio of the number of methods analyzed during demand driven analysis, as compared to the total number of methods in the entire program. Since each method’s portion of the full program representation used in our analysis is constructed only if that method needs to be analyzed, and is always constructed in entirety if the methods needs to be analyzed; this metric demonstrates the space-eﬃciency of demand driven call graph construction.

4

Experimental Results

We now present the results from our experiments. Our experiments were conducted on a Sun 250 MHz Ultra-Sparc processor with 512 MB of main memory. We ﬁrst present results from exhaustive analysis. Then, we present results from demand driven analysis for scenarios A, B, C, and D. Exhaustive Analysis: To provide a comparison against demand driven analysis, we ﬁrst include the results from exhaustive 0-CFA call graph construction on our set of benchmarks. The results from exhaustive analysis are presented in Figure 4. The time required for Class Hierarchy Analysis (CHA), time required for the iterative call graph reﬁnement, and the number of call sites that are not-monomorphic after applying CHA are shown here. Call sites that can potentially invoke multiple methods after CHA has been applied are the ones that can beneﬁt from more aggressive iterative analysis. The time required in CHA phase in our implementation is dominated by setting up of data-structures, and turns out to be almost the same for all benchmarks. The time required for the iterative reﬁnement phase varies a lot between benchmarks, and is roughly proportional to the size of the benchmark. Two important observations from the Figure 4 are as follows. First, only 4 of the 10 programs have call sites that are polymorphic after the results of

36

Gagan Agrawal et al.

Benchmark CHA time Iter. Analysis Polymorphic Call Sites (sec.) (sec.) After CHA check 72.3 27.7 0 compress 84.5 13.3 0 jess 96.5 59.4 0 raytrace 82.1 60.9 39 db 72.8 12.0 0 javac 85.6 2613 577 mpegaudio 73.4 462 35 mtrt 80.2 3.5 0 jack 74.1 250.7 77 checkit 73.6 5.3 0

Fig. 4. Results from exhaustive analysis

CHA are known. These 4 programs are raytrace, javac, mpegaudio, and jack. These are also the 4 largest programs among the programs in this benchmark set, comprising 28 to 180 classes and 130 to 1004 methods. For the smaller programs, CHA is as accurate as any analysis for constructing the call graph. The second observation is that for 7 of 10 programs, the total time required for exhaustive call graph construction is dominated by the CHA phase. For the three remaining programs, javac, mpegaudio, and jack, the time required for iterative analysis is 30 times, 6 times, and nearly 4 times the time required for CHA analysis, respectively. Therefore, for the smaller programs in the benchmark set, CHA analysis is suﬃcient, and they do not beneﬁt from more aggressive analysis. The dominant cost of analysis is CHA, which remains the same during demand driven call graph construction. So, these programs cannot beneﬁt from demand driven analysis. On the other hand, the time required for analysis is dominated by the iterative phase in the larger programs. A large number of call sites are polymorphic after applying CHA, and are therefore likely to beneﬁt from iterative analysis. Since the iterative analysis is applied on a much small number of nodes in the demand driven technique, these programs are likely to beneﬁt from the proposed demand driven analysis. This is analyzed in details in the remaining part of this section. Experiment A: In the ﬁrst set of experiments, we perform demand driven analysis to resolve a single call site in the program. We only consider call sites that are known to potentially invoke multiple procedures after CHA has been applied. As we described in the previous subsection, only raytrace, javac, mpegaudio, and jack contain such polymorphic call sites. Therefore, the results are only presented from these call sites. The averages for time ratio, node ratio, and procedure ratio for these 4 programs is shown in Figure 5. The analysis time compared in this table is the time

Evaluating a Demand Driven Technique for Call Graph Construction

Benchmark No. of Cases raytrace javac mpegaudio jack

39 577 35 77

37

Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 3.78 6.2% 96.6 1.5% 13.7 10.5% 341.2 13.1% 9831 20.4% 747 74.5% 15.6 3.3% 186.3 3.0% 31.9 11.8% 11.8 4.7% 422.3 2.9% 46.1 17.6%

Fig. 5. Results from experiment A

for iterative analysis only. For both demand driven and exhaustive versions, additional time is spent in performing CHA. The average of the ratio of the number of nodes that need to be analyzed during demand driven analysis is extremely low for raytrace, mpegaudio, and jack, ranging between 1.5% and 3.0%. This results in an average iterative analysis time ratio of less than 7%. Even the number of procedures that need to be analyzed is less than 20% for these three programs. The results for javac are signiﬁcantly diﬀerent, but still demonstrate gains from the use of demand driven analysis. The average node ratio is 20.4%, resulting in an average time ratio of 13.1%. However, the average procedure ratio is nearly 75%. This means that for most of the cases, a very large fraction of procedures need to be involved in demand driven analysis. Use of demand driven analysis does not result in signiﬁcant space savings for javac. After including the time for CHA, the average time ratio are 60%, 16%, 17%, and 26% for raytrace, javac, mpegaudio, and jack, respectively. The gains from demand driven analysis for raytrace are limited, because the time required for CHA exceeds the exhaustive iterative analysis time. javac, which had the highest ratio before CHA time was included, has the lowest ratio after including CHA because the time required for exhaustive iterative analysis is more than 30 times the time required for CHA. Demand driven analysis gives clear beneﬁts in the case of javac, mpegaudio, and jack, because the time required for the iterative phase dominates the time required for CHA. To further study the results from these three benchmarks, we present a series of cumulative frequency graphs. For the experiment A, cumulative frequency graphs for the benchmarks javac, mpegaudio, and jack are presented in Figures 6, 7, and 9, respectively. A point (x, y) in such a graph means that the fraction x of the cases in the experiments had a ratio of less than or equal to y. The results from javac follow an interesting trend. 56 of the 577 cases require analysis of 120 or fewer procedures, or nearly 12% of all procedures. The same set of cases requires analyzing 257 or fewer nodes, or less than 1% of all nodes. The time taken for these cases is also less than 2% of the time for exhaustive analysis. However, the ratios are very diﬀerent for the remaining cases. The next 413 cases require analysis of the same set of 837 procedures, or 83% of all procedures. The remaining cases require between 838 and 876 procedures to be

38

Gagan Agrawal et al.

analyzed. The analysis time is between 15% and 20% of the exhaustive analysis time, and the number of nodes involved for these cases is nearly 25% of the total number of nodes. The results from mpegaudio are as follows. 11 of the 35 cases require analysis of between 73 and 98 procedures, or between 27% and 36% of all procedures. The same 11 cases require analysis of between 8% and 10% of nodes, and between 2% and 4% of time. The other 24 cases require analysis of less than 12% of all procedures, and less than 1.5% of nodes and time. For jack, 61 of 77 cases require analysis of 59 or 57 procedures, or nearly 20% of all procedures. The same set of cases require between 4% and 6% of time, and 2% and 4% of all nodes. The other 16 cases involve analyzing less than 5% of all procedures, less than 1% of time, and less than 0.5% of all nodes.

0

0

10

10

−1

10

−1

Ratio

Ratio

10

−2

10

−2

10 −3

10

Time ratio Node ratio Proc ratio

Time ratio Node ratio Proc ratio

−4

10

−3

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 6. Experiment A: Cumulative frequency of time, node, and procedure ratio for javac

10

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 7. Experiment A: Cumulative frequency of time, node, and procedure ratio for mpegaudio

Experiment B: In the second set of experiments, we evaluated the performance of demand driven call graph construction when it is initiated from demand driven data ﬂow analysis. The particular data ﬂow problem we consider is the computation of reaching deﬁnitions for all actual parameters in a procedure. We report results from this experiment only on raytrace, mpegaudio, and jack. The 6 smaller programs in SPECJVM benchmark set do not contain any polymorphic call sites. Even after many attempts, we could not complete this experiment for javac, which is the largest program in this benchmark set. We believe that it was because of very large memory requirements when reaching deﬁnition and call graph construction analyses are combined. The average time, node, and procedure ratios for the three benchmarks are presented in Figure 8. As compared to the experiment A, we are reporting results from a signiﬁcantly larger number of cases, because this analysis was performed on all procedures. At the same time, for many cases in experiment B resolution of several polymorphic call sites may be required. The three ratios for mpegaudio are lower for the experiment B, as compared to the ones obtained from experi-

Evaluating a Demand Driven Technique for Call Graph Construction

39

Benchmark No. of Cases raytrace mpegaudio jack

Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 129 4.36 7.2% 354.3 5.4% 28.7 22.0% 270 5.48 1.2% 133.5 2.2% 26.8 9.9% 261 15.44 6.2% 524.9 3.7% 94.8 36.6 %

Fig. 8. Results from experiment B

0

0

10

10

−1

−1

10

Ratio

Ratio

10

−2

10

−3

−2

10

−3

10

10

Time Ratio Node ratio Proc ratio

Time Ratio Node ratio Proc ratio

−4

10

−4

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 9. Experiment A: Cumulative frequency of time, node, and procedure ratio for jack

10

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 10. Experiment B: Cumulative frequency of time, node, and procedure ratio for mpegaudio

ment A. For raytrace and jack, the reverse is true; the three ratios are higher for the experiment B. The ratio for iterative analysis time are 7.2%, 1.2%, and 6.2% for raytrace, mpegaudio, and jack, respectively. After including the time for CHA, the ratios of the time required are 60%, 14%, and 27%, respectively. We studied the results in more details for mpegaudio and jack. The cumulative frequency plots for these two benchmarks are presented in Figures 10 and 11, respectively. The results from mpegaudio are as follows. 192 of 270 cases require analysis of 33 or fewer procedures, or less than 12% of all procedures. The same set of cases require analysis of less than 2% of all nodes, and take less than 1% of time for exhaustive analysis. For the remaining cases, the number of procedures to be analyzed is distributed fairly uniformly between 66 and 118. For jack, the trends are very diﬀerent. 126 of 261 cases require analysis of 162 or 161 procedures, or nearly 62% of all procedures. The same set of cases require analysis of nearly 800 nodes, or 6% of all nodes. The time required for this set of cases is nearly 9% of the time for exhaustive iterative analysis. The portions of the program that need to be analyzed for this set of cases (48% of all cases) is almost the same. This has the following implications. If demand driven analysis is performed for one of these cases, and then needs to be performed for another case in the same set, very limited additional eﬀort will be required.

40

Gagan Agrawal et al. 0

0

10

10

−1

−1

10

10

−2

−2

Ratio

10

Ratio

10

−3

−3

10

10

−4

−4

10

10

Time ratio Node ratio Proc ratio

Time ratio Node ratio Proc ratio

−5

10

−5

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 11. Experiment B: Cumulative frequency of time, node, and procedure ratio for jack Benchmark No. of Cases raytrace javac mpegaudio jack

130 1004 270 261

10

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 12. Experiment C: Cumulative frequency of time, node, and procedure ratio for javac

Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 4.51 7.4% 358.9 5.5% 29.1 22.4% 271.1 10.3% 7634.5 15.8 % 587 58.5% 5.37 1.2% 133.5 2.1 % 26.8 9.9% 14.9 5.9% 524.9 3.7% 94.8 36.3%

Fig. 13. Results from experiment C

Experiment C: Our next set of experiments evaluated the performance of demand driven call graph construction when all call sites in a procedures had to be resolved. We present data only from raytrace, javac, mpegaudio, and jack, because they contain polymorphic call sites. For these programs, we include results from analysis of all methods, even if they do not contain any polymorphic call site. The averages of time, node, and procedure ratios are presented in Figure 13. The averages are very close to the results for experiment B. We believe that this because all call sites in a method had to be resolved for experiment C, and all cites that can potentially invoke a method had to be resolved for experiment B. The three ratios for javac are lower for experiment C, as compared to the experiment A. This is because the averages are taken over much larger number of cases in the experiment C. Many of the procedures do not require analysis of any polymorphic call site, and contribute to a lower overall average. The cumulative frequency plots for javac, mpegaudio, and jack are presented in Figures 12, 14, and 15, respectively. Results from javac for experiment C are similar to the results from experiment A, with one important diﬀerence. A larger fraction of cases can be analyzed with a small fraction of procedures and nodes. 316 of 1004 cases require between 1 and 125 procedures, or up to 12% of all procedures. The remaining 688 cases

Evaluating a Demand Driven Technique for Call Graph Construction

41

require between 837 and 907 procedures, nearly 25% of all nodes, and nearly 15% of exhaustive analysis time. Results from mpegaudio for experiment C are very similar to the results from experiment B. 192 of 270 cases (the same number as in experiment B) require analysis of at most 33 procedures, while the remaining cases need analysis of between 66 and 118 procedures. The same trend (closeness between results from experiments B and C) continues for jack.

0

0

10

10

−1

10 −1

10

−2

Ratio

Ratio

10 −2

10

−3

10

−3

10

−4

10

Time ratio Node ratio Proc ratio

Time ratio Node ratio Proc ratio

−4

10

−5

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 14. Experiment C: Cumulative frequency of time, node, and procedure ratio for mpegaudio

10

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 15. Experiment C: Cumulative frequency of time, node, and procedure ratio for jack

Experiment D: Our ﬁnal set of experiments evaluates demand driven analysis when all call sites in all procedures of a class are to be resolved. Figure 16 presents average time ratio, node ratio, and procedure ratio for raytrace, javac, mpegaudio, and jack. Even though each invocation of demand driven analysis may involve resolving several call sites, the ratio are quite small. For raytrace, mpegaudio, and jack, the averages of time ratios and node ratios are still less than 10%. The averages for javac are a bit higher, consistent with the previous experiments. The average time ratio and node ratio are 13.1% and 20.6%, respectively. Space savings are not signiﬁcant with javac, but quite impressive for the other three benchmarks. After including the time required for CHA, the average time ratio is 61% for raytrace, 16% for javac, 16% for mpegaudio, and 25% for jack. In comparison with the results from experiment C, the averages of ratios from experiment D are all higher for raytrace, javac, and mpegaudio, as one would normally expect. The surprising results are from jack, where all three ratios are lower in experiment D. The explanation for this is as follows. The results from experiment D are averaged over a smaller number of cases, speciﬁcally, 61 instead of 261 for jack. It turns out that the procedures that require the most time, number of nodes, and number of procedures to be analyzed belong to a small set of classes. Therefore, they contribute much more signiﬁcantly to the

42

Gagan Agrawal et al.

Benchmark No. of Cases raytrace javac mpegaudio jack

28 180 58 61

Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 5.32 8.7% 598.3 9.2% 41.5 31.9% 343.6 13.1% 9940 20.6% 741.3 73.8% 14.1 3.1% 280.5 4.5% 47.6 17.6 % 7.49 4.7% 291.3 2.1% 27.6 10.5%

Fig. 16. Results from experiment D

0

0

10

10

−1

10

−1

10

−2

Ratio

Ratio

10

−2

10

−3

10

−3

10 −4

10

Time ratio Node ratio Proc ratio

Time ratio Node ratio Proc ratio

−5

10

−4

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 17. Experiment D: Cumulative frequency of time, node, and procedure ratio for javac

10

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 18. Experiment D: Cumulative frequency of time, node, and procedure ratio for mpegaudio

average ratios in the results from the experiment C, than in the results from experiment D. Details of the results from javac, mpegaudio, and jack are presented in Figures 17, 18, and 19, respectively. Again, the results from javac are very diﬀerent from the results on the other two benchmarks. In javac, 20 of the 180 classes can be resolved by analyzing a small fraction of procedures. Speciﬁcally, these cases require analysis of between 1 and 63 procedures, i.e., less than 7% of all procedures in the program. However, the other 160 cases require analysis of between 837 and 963 procedures in the program. Each of the cases from this set requires analyzing nearly 25% of all the nodes in the program, and between 15% and 20% of the time for exhaustive analysis. However, the sets of inﬂuencing nodes that need to analyzed for these cases are almost identical. Our theoretical result, therefore, implies that after one of these cases has been analyzed, the time required for other cases will be very small. For mpegaudio, the number of procedures that need to be analyzed for the 58 cases ranges from 1 to 139, or from less than 1% to nearly 50%. The distribution is fairly uniform. The time required for demand driven analysis for these cases also has a fairly uniform distribution, between 0.1 second to 22.5 second, or between 0.02% to 5% of the time required for exhaustive analysis. Similarly, the

Evaluating a Demand Driven Technique for Call Graph Construction

43

0

10

−1

Ratio

10

−2

10

−3

10

Time ratio Node ratio Proc ratio −4

10

0

0.1

0.2

0.3

0.4 0.5 0.6 Cumulative Frequency

0.7

0.8

0.9

1

Fig. 19. Experiment D: Cumulative frequency of time, node, and procedure ratio for jack number of nodes ranges from 2 to 880, or from 0.03% to 13%. The results from jack are similar.

5

Conclusions

We have presented evaluation of an algorithm for resolving call sites in an object oriented program on a demand driven fashion. The summary of our results using SPECJVM benchmarks is as follows: – The time required for Class Hierarchy Analysis (CHA), which is a prerequisite for both exhaustive and demand driven iterative analysis, dominates the exhaustive call graph construction time for 7 of the 10 SPECJVM programs. However, CHA itself is suﬃcient for constructing an accurate call graph for 6 of these 7 programs. The time required for exhaustive iterative analysis clearly dominates CHA time for the three largest SPECJVM programs, javac, mpegaudio, and jack. – For resolving a single call site, demand driven iterative analysis averages at nearly 10% of the time required for exhaustive iterative analysis. The number of nodes that need to be analyzed averages at nearly 3% for mpegaudio and jack, but around 20% for javac. The number of procedures that need to be analyzed is less than 20% for mpegaudio and jack, but nearly 75% for javac. – The averages for the number of nodes and procedures analyzed and the time taken surprisingly stays low when all call sites within a class or a method are analyzed instead of a single call site. This is because the program portions that need to be analyzed for resolving diﬀerent call sites within a method or a class are highly correlated.

44

Gagan Agrawal et al.

References 1. Gagan Agrawal. Simultaneous demand-driven data-ﬂow and call graph analysis. In Proceedings of International Conference on Software Maintainance (ICSM), September 1999. 30, 31 2. Gagan Agrawal. Demand-drive call graph construction. In Proceedings of the Compiler Construction (CC) Conference, March 2000. 30, 31, 33 3. David Bacon and Peter F. Sweeney. Fast static analysis of c++ virtual function calls. In Eleventh Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA ’96), pages 324–341, October 1996. 30 4. Brad Calder and Dirk Grunwald. Reducing indirect function call overhead in C++ programs. In Conference Record of POPL ’94: 21st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 397–408, Portland, Oregon, January 1994. 30 5. D. Callahan. The program summary graph and ﬂow-sensitive interprocedural data ﬂow analysis. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, Atlanta, GA, June 1988. 31 6. R. Chatterjee, B. G. Ryder, and W. A. Landi. Relevant Context Inference. In Proceedings of the Conference on Principles of Programming Languages (POPL), pages 133–146, January 1999. 30 7. Jeﬀrey Dean, Craig Chambers, and David Grove. Selective specialization for object-oriented languages. In Proceedings of the ACM SIGPLAN’95 Conference on Programming Language Design and Implementation (PLDI), pages 93–102, La Jolla, California, 18–21 June 1995. SIGPLAN Notices 30(6), June 1995. 31 8. Greg DeFouw, David Grove, and Craig Chambers. Fast interprocedural class analysis. In Proceedings of the POPL’98 Conference, 1998. 30 9. A. Diwan, K. S. McKinley, and J. E. B. Moss. Using Types to Analyze and Optimize Object-Oriented Programs. ACM Transactions on Programming Languages and Systems, 23(1):30–72, January 2001. 30 10. E. Duesterwald, R. Gupta, and M. L. Soﬀa. A Practical Framework for DemandDriven Interprocedural Data Flow Analysis. ACM Transactions on Programming Languages and Systems, 19(6):992–1030, November 1997. 30 11. David Grove, Greg DeFouw, Jeﬀrey Dean, and Craig Chambers. Call graph construction in object-oriented languages. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, 1997. 30 12. S. Horwitz, T. Reps, and M. Sagiv. Demand interprocedural dataﬂow analysis. In In SIGSOFT ’95: Proceedings of the Third ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 104–115, 1995. 30 13. Jens Palsberg and Patrick O’Keefe. A type system equivalent to ﬂow analysis. In Conference Record of POPL ’95: 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 367–378, San Francisco, California, January 1995. 30 14. Hemant Pande and Barbara Ryder. Data-ﬂow-based virtual function resolution. In Proceedings of the Third International Static Analysis Symposium, 1996. 30 15. M. Porat, M. Biberstein, L. Koved, and M. Mendelson. Automatic detection of immutable ﬁelds in Java. In Proceedings of CASCON, 2000. 30 16. Gregg Rothermel and M. J. Harrold. Analyzing regression test selection. IEEE Transactions on Software Engineering, 1996. 30 17. Atanas Routnev, Barbara G. Ryder, and William Landi. Data-Flow Analysis of Program Fragments. In Proceedings of the Conference on Foundations of Software Engineering (FSE), pages 235–253, September 1999. 30

Evaluating a Demand Driven Technique for Call Graph Construction

45

18. O. Shivers. The semantics of Scheme control-ﬂow analysis. In Proceedings of the Symposium on Partial Evaluation and Semantics-Based Program Manipulation, volume 26, pages 190–198, New Haven, CN, June 1991. 30, 33 19. V. C. Sreedhar, M. Burke, and J. D. Choi. A Framework for Interprocedural Optimization in the Presence of Dynamic Class Loading. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000. 30 20. Vijay Sundaresan, Laurie Hendren, Chrislain Razaﬁmahefa, Raja Vallee-Rai, Patrick Lam, Etienne Gagnon, and Charles Godin. Practical virtual method call resolution for Java. In Fifteenth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA ’2000), pages 264–280. ACM Press, October 2000. 30 21. Frank Tip and Jens Palsberg. Scalable propagation-based call graph construction algorithms. In Fifteenth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA ’2000), pages 281–293. ACM Press, October 2000. 30 22. Raja Vallee-Rai. Soot: A Java ByteCode Optimization Framework. Master’s thesis, McGill University, 1999. 30, 33 23. Mark Weiser. Program slicing. IEEE Transactions on Software Engineering, 10:352–357, 1984. 30 24. A. Zaks, V. Feldman, and N. Aizikowitz. Sealed calls in java packages. In Proceedings of Conference on Object Oriented Programming Systems and Languages (OOPSLA), pages 83–92. ACM Press, October 2000. 30

A Graph–Free Approach to Data–Flow Analysis Markus Mohnen Lehrstuhl f¨ ur Informatik II, RWTH Aachen, Germany [email protected]

Abstract. For decades, data–ﬂow analysis (DFA) has been done using an iterative algorithm based on graph representations of programs. For a given data–ﬂow problem, this algorithm computes the maximum ﬁxed point (MFP) solution. The edge structure of the graph represents possible control ﬂows in the program. In this paper, we present a new, graph–free algorithm for computing the MFP solution. The experimental implementation of the algorithm was applied to a large set of samples. The experiments clearly show that the memory usage of our algorithm is much better: Our algorithm always reduces the amount of memory and reached improvements upto less than a tenth. In the average case, the reduction is about a third of the memory usage of the classical algorithm. In addition, the experiments showed that the runtimes are almost the same: The average speedup of the classical algorithm is only marginally greater than one.

1

Introduction

Optimising compilers perform various static program analyses to obtain informations needed to apply optimisations. In the context of imperative languages, the technique commonly used is data–ﬂow analysis (DFA). It provides information about properties of the states that may occur at a given program point during execution. Here, programs considered are intermediate code, e.g. three address code, register code, or Java Virtual Machine (JVM) code [LY97]. For decades, the de facto classical algorithm for DFA has been an iterative algorithm [MJ81, ASU86, Muc97] which uses a graph as essential data structure. The graph is extracted from the program, making explicit the possible control ﬂows in the program as the edge structure of the graph. Typically, the nodes of the graph are basic blocks (BB), i.e. maximal sequences of straight–line code (but see also [KKS98] for comments on the adequacy of this choice). A distinct root node of the graph corresponds to the entry point of the program. For a given graph and a given initial annotation of the root node, the algorithm computes an annotation for each of the nodes. Each annotation captures the information about the state of the execution at the corresponding program point. The exact relation between annotations and states depends on the data– ﬂow problem. However, independently of the exact relation, the annotations computed by the algorithm are guaranteed to be the greatest solution of the consistency equations imposed by the data–ﬂow problem. This result is known as the maximal ﬁxed point (MFP) solution. R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 46–61, 2002. c Springer-Verlag Berlin Heidelberg 2002

A Graph–Free Approach to Data–Flow Analysis

47

In the context of BB graphs, there is a need for an additional post–processing of the annotations. Since each BB represents a sequence of instructions, the annotation for a single BB must be propagated to the instruction level. As a result of this post–processing, each program instruction is annotated. The contribution of this paper is an alternative algorithm for computing the MFP solution. In contrast to the classical algorithm, our approach is graph–free: Besides a working set, it does not need any additional data structures (of course, the graph structure is always there implicitly in the program). The key idea is to give the program a more active role: While the classical approach transforms the program to a passive data object on which the solver operates, our point of view is that the program itself executes on the annotation. An obvious advantage of this approach is the reduced memory usage. In addition, it is handy if there is already machinery for execution of programs available. Consequently, our execution–based approach is advantageous in settings where optimisations are done immediately before execution of the code. Here it saves eﬀort to implement the analyses and it saves valuable memory for the execution. The most prominent example of such a setting is the Java Virtual Machine (JVM) [LY97]. In fact, the JVM speciﬁcation requires that each class ﬁle is veriﬁed at linking time by a data–ﬂow analyser. The purpose of this veriﬁcation is to ensure that the code is well–typed and that no operand stack overﬂows or underﬂows occur at runtime. In addition, certain optimisations cannot be done by the Java compiler producing JVM code. For instance, optimisation w.r.t. memory allocation like compile–time garbage collection (CTGC) can only be done in the JVM since the JVM code does not provide facilities to inﬂuence the memory allocation. CTGC was originally proposed in the context of functional languages [Deu97, Moh97] and then adopted for Java [Bla98, Bla99]. To validate the beneﬁts of our approach, we studied the performance of the new algorithm in competition with the classical one, both in terms of memory usage and runtime. Therefore, we applied both to a large set of samples. The experiments clearly show that the memory usage of our algorithm is much better: Our algorithm always reduces the amount of memory and reached improvements upto less than a tenth. In the average case, the reduction is about a third of the memory usage of the classical algorithm. Moreover, the runtimes are comparable in the average case: Using the classical algorithm does not give a substantial speedup. Structure of this article. We start by deﬁning some basic notions. In Section 3 the classical, iterative algorithm for computing the MFP solution is discussed brieﬂy. Our main contribution starts with Section 4 where we present the new execution algorithm, discuss its relation to the classical algorithm, and prove the termination and correctness. Experimental results presented in Section 5 give an estimation of the beneﬁts our method. Finally, Section 6 concludes the paper.

48

2

Markus Mohnen

Notations

In this section, we brieﬂy introduce the notations that we use in the rest of the paper. Although we focus on abstract interpretation based DFA, our results are applicable to other DFAs as well. The programs we consider are three–address code programs, i.e. non–empty sequences of instructions I ∈ Instr. Each instruction I is either a jump, which can be conditional (if ψ goto n) or unconditional (goto n), or an assignment (x:=y◦z). In assignments, x must be a variable, and y and z can be variables or constants. Since we consider intraprocedural DFA only, we do not need instructions for procedure calls or exits. The major point of this setting is to distinguish between instructions which cause the control ﬂow to branch and those which keep the control ﬂow linear. Hence, the exact structure is not important. Any other intermediate code, like the JVM code, is suitable as well. To model program properties, we use lattices L = A, , where A is a set, and and are binary meet and join operations on A. Furthermore, ⊥ and are least and greatest element of the lattice. Often, ﬁnite lattices are used, but in general it suﬃces to consider lattices which have only ﬁnite chains. The point of view of DFA based on abstract interpretation [CC77, AH87] is to replace the standard semantics of programs by an abstract semantics describing how the instructions operate on the abstract values A. Formally, we assume a monotone semantic functional ![.!] : Instr → (A → A) which assigns a function on A to each instruction. A data–ﬂow problem is a quadruple (P, L, ![.!], a0 ) where P = I0 . . . In ∈ Instr+ is a program, L is a lattice, ![.!] is an abstract semantics, and a0 ∈ A is an initial value for the entry of P . To deﬁne the MFP solution of a data–ﬂow problem, we ﬁrst introduce the notion of predecessors. For a given program P = I0 . . . In ∈ Instr+ , we deﬁne the function predP : {0, . . . , n} → P({0, . . . , n}) in the following way: j ∈ predP (i) iﬀ either Ij ∈ {goto i, if ψ goto i}, or i = j + 1 and Ij = goto t for some t. Intuitively, the predecessors of an instruction are all instructions which may be executed immediately before it. The MFP solution is a vector of values s0 , . . . , sn ∈ A. Each entry si is the abstract value valid immediately before the instruction Ii . It is deﬁned as the great est solution of the equation system si = j∈predP (i) ![Ij !](sj ). The well–known ﬁxed point theorem by Tarski guarantees the existence of the MFP solution in this setting. Example 1 (Constant Folding Propagation). We now introduce an example, which we use as a running example in the rest of the paper. Constant folding and propagation aims at ﬁnding as many constants as possible at compile time, and replacing the computations with the constant values. In the setting described above, we associate with each variable and each program point the information if the variable is always constant at this point. For simplicity, we assume that the program only uses the arithmetic operations on integers. We deﬁne a set C := Z {, ⊥} and a relation c1 ≤ c2 iﬀ (a) c1 = c2 , (b) c1 = ⊥, or

A Graph–Free Approach to Data–Flow Analysis

49

(c) c2 = . Intuitively, values can be interpreted in the following way: An integer means “constant value”, means “not constant due to missing information”, and ⊥ means “not constant due to conﬂict”. The relation ≤ induces meet and join operations. Hence, C, , is a (non–ﬁnite) lattice with only ﬁnite chains. Fig. 1 shows the corresponding Hasse diagram. The abstract lattice is deﬁned in terms of this lattice. Formally, let X be the set of variables of a program P . By deﬁnition, X is ﬁnite. We deﬁne the set of abstract values as C := X → C, the set of all functions mapping a variable to a value in C. Since X is ﬁnite, C is ﬁnite as well. We obtain meet and join operations C , C in the canonical way by argument–wise use of the corresponding operation on C. Hence, our lattice for this abstract interpretation is C, C , C . The abstract semantics ![.!]C : Instr → (C → C) is deﬁned in the following way: For jumps, we deﬁne ![goto l!]C and ![if ψ goto l!]C to be the identity, since jumps do not change any variable. For assignments, we deﬁne ![x:=y◦z!]C := c → c , where c = c[x/a], i.e. c is the same function as c except at argument x. The new value is deﬁned as ay ◦ az if y = ay ∈ Z or c(y) = ay ∈ Z c (x) = a := and z = az ∈ Z or c(z) = az ∈ Z ⊥ otherwise Intuitively, the value of the variable on the left–hand side is constant iﬀ all operands are either constants in the code or known to be constants during execution. For a data–ﬂow problem, the initial value will be a0 = ⊥: At the entry, no variable can be constant. Fig. 1 shows an example for a program, the associated abstractions, the equation system, and the MFP solution. This example also demonstrates why it is necessary to use the inﬁnite lattice C: The solution contains the constant ‘5’ which is not found in the program. Our presentation of these notions diﬀers slightly from the presentation found in text books. Typically, data–ﬂow problems are already formulated using an explicit graph structure. However, we want to point out that this is not a necessity. Furthermore, it allows us to formulate and prove the correctness of our algorithm without reference to the classical one.

···

−2

−1

0

1

2

⊥

Fig. 1. Hasse diagram of C, ,

···

50

Markus Mohnen

Program I0 = x := 1 I1 = y := 2 I2 = z := 3 I3 = goto 8 I4 = r := y + z

Abstractionsa x/1 y/2 z/3 (identity) n r/

c(y)+c(z) c(y), c(z) ∈ Z ⊥

I5 = if x ≤ z goto 7 (identity) n I6 = r := z + y

r/

I7 = x := x + 1

x/

c(z)+c(y) c(y), c(z) ∈ Z

otherwise n⊥c(x)+1 c(x) ∈Z ⊥

otherwise

I8 = if x < 10 goto 4 (identity) a

otherwise

Equations s0 = a 0 s1 =![I0 !]C (s0 ) s2 =![I1 !]C (s1 ) s3 =![I2 !]C (s2 )

Solution x/⊥ y/⊥ z/⊥ x/1 y/⊥ z/⊥ x/1 y/2 z/⊥ x/1 y/2 z/3

s4 =![I8 !]C (s8 )

x/⊥ y/2 z/3 r/⊥

s5 =![I4 !]C (s4 )

x/⊥ y/2 z/3 r/5

s6 =![I5 !]C (s5 )

x/⊥ y/2 z/3 r/5

r/⊥ r/⊥ r/⊥ r/⊥

s7 =![I5 !]C (s5 )![I6 !]C (s6 ) x/⊥ y/2 z/3 r/5 s8 =![I3 !]C (s3 )![I7 !]C (s7 ) x/⊥ y/2 z/3 r/5

For each abstraction only the modiﬁcation x/y as abbreviation for c → c[x/y] is given.

Fig. 2. Example for data–ﬂow problem

The approach described so far can be generalised in two dimensions: Firstly, changing to results in existential data–ﬂow problems, in contrast to universal data–ﬂow problems: The intuition is that a property holds at a point if there is a single path starting at the point such that the property holds on this path. For existential data–ﬂow problems, the least ﬁxed point is computed instead of the greatest ﬁxed point. Secondly, we can change predecessors predP to successors succP : {0, . . . , n} → P({0, . . . , n}) deﬁned as i ∈ succP (j) ⇐⇒ j ∈ predP (i). The resulting class of data–ﬂow problems are called backward problems (in contrast to forward problems), since the ﬂow of information is opposite to the normal execution ﬂow. Here, the abstract values are valid immediately after the corresponding instruction. Altogether, the resulting taxonomy has four cases. However, the algorithms for all the cases have the same general structure. Therefore, we will consider only the forward and universal setting.

3

Classical Iterative Basic-Block Based Algorithm

This section reviews the classical, graph–based approach to DFA. To make the data–ﬂow of program explicit, we deﬁne two types of ﬂow graphs: single instruction (SI) graphs and basic block (BB) graphs. For a program P = I0 . . . In , we deﬁne the SI graph SIG(P ) := ({I0 , . . . , In }, {(Ij , Ii ) | j ∈ predP (i)}, I0 ) with a node for each instruction, an edge from node Ij to node Ii iﬀ j is predecessor of i, and root node I0 . Intuitively, the BB graph results from the SI graph by merging maximal sequences of straight–line code. Formally, we deﬁne the set of basic blocks as the unique partition of P : BB(P ) = {B0 , . . . , Bm } iﬀ (a) Bj = Ij1 . . . Ijn with jk+1 = jk +1, (b) predP (j1 ) = {(j−1)n } or succP ((j−1)n ) = {j1 }, (c) |predP (jk )| = 1 for j1 < jk ≤ jn , and (d) Ijn +1 = I(j+1)1 , I01 = I0 , and Imn = In . The BB graph is deﬁned as BBG(P ) := (BB(P ), {(Bj , Bi ) | jn ∈ predP (i1 )}, B0 ).

A Graph–Free Approach to Data–Flow Analysis

I0 I1 I2



I0 I1 B0 =  I2 I3

I3 I4

B1 =

I4 I5

I5

B2 = I6

I6

B3 = I7

I7 I8

(a) SI graph

51

B4 = I8

(b) BB graph

Fig. 3. Examples for SI graph and BB graph

Example 2 (Constant Folding Propagation, Cont’d). In Fig. 3 we see the SI graph and the BB graph for the example program from the last section. Obviously, for a given ﬂow graph G = (N, E, r), the usual notions of predecessors predG : N → P(N ) and successors succG : N → P(N ), deﬁned as n ∈ predG (n ), n ∈ succG (n) : ⇐⇒ (n , n) ∈ E coincide with the corresponding notions for programs. For a given data–ﬂow problem (P, L, ![.!], a0 ), an additional pre–processing step must be performed to extend the abstract semantics to basic blocks: We deﬁne ![.!] : Instr+ → (A → A) as ![I0 . . . In !] :=![In !] ◦ · · · ◦![I0 !]. The classical iterative algorithm for computing the MFP solution of a data– ﬂow problem is shown in Fig. 4. In addition to the BB graph G it uses a working set W and an array a, which associates an abstract value with each node. The working set keeps all nodes which must be visited again. In each iteration a node is selected from the working set. At this level, we assume no speciﬁc strategy for the working set and consider this choice to be non–deterministic. By visiting all predecessors of this node, a new approximation is computed. If this approximation diﬀers from the last approximation, the new one is used. In addition, all successors of the node are put in the working set. After termination of the main loop, the post–processing is done, which propagates the solution from the basic block level to the instruction level. Example 3 (Constant Folding Propagation, Cont’d). For the example from the last section, Table 1 shows a trace of the execution of the algorithm. Each line shows the state of working set W , the selected node B, and the array a[.] at

52

Markus Mohnen Input: Data–ﬂow problem (P, L, ![.!], a0 ) where P = I0 . . . In , L = A, , Output: MFP solution s0 , . . . , sn ∈ A G = (BB(P ), E, B0 ) := BBG(P ) a[B0 ] := a0 for each B ∈ BB(P ) − B0 do a[B] := W := BB(P ) while W = ∅ do choose B ∈ W W := W − B new := a[B] for each B ∈ predG (B) do new := new![B !](a[B ]) if new = a[B] then a[B] := new; for each B ∈ succG (B) do W := W + B end end for each B ∈ BB(P ) do with B = Ik . . . Il do sk := a[B] for i := k + 1 to l do si :=![Ii−1 !](si−1 ) end end

Fig. 4. Classical iterative algorithm for computing MFP solution

the end of the main loop. To keep the example brief, we omitted all cells which did not change w.r.t. the previous line and we have chosen the best selection of nodes. The resulting MFP solution is identical to the one in Fig. 1, of course. In an implementation, the non–deterministic structure of the working set must be implemented in a deterministic way. However, both the classical algorithm described above and the new algorithm, which we describe in the next section, based on the concept of working sets. Therefore, we continue to assume that the working set is non–deterministic.

4

New Execution Based Algorithm

The new algorithm for computing the MFP solution (see Fig. 5) of a given data–ﬂow problem is graph–free. The underlying idea is to give the program a more active role: The program itself executes on the abstract values. The program counter variable pc always holds the currently executing instruction. The execution of this instruction aﬀects the abstract values for all succeeding instructions and it is propagated iﬀ it makes a change. Here we see another diﬀerence w.r.t. the classical algorithm: While the pc in our algorithm identiﬁes the instruction causing a change, the current node n in the classical algorithm

A Graph–Free Approach to Data–Flow Analysis

53

Table 1. Example Execution of classical iterative algorithm W B a[B0 ] a[B1 ] a[B2 ] a[B3 ] a[B4 ] {B1 , B2 , B3 , B4 } B0 x/⊥ y/⊥ x/ y/ x/ y/ x/ y/ x/ y/ z/⊥ r/⊥ z/ r/ z/ r/ z/ r/ z/ r/ {B1 , B2 , B3 } B4 x/1 y/2 z/3 r/⊥ {B2 , B3 } B1 x/1 y/2 z/3 r/5 {B3 } B2 x/1 y/2 z/3 r/5 {B4 } B3 x/2 y/2 z/3 r/5 {B1 } B4 x/⊥ y/2 z/3 r/⊥ {B2 } B1 x/⊥ y/2 z/3 r/5 {B3 } B2 x/⊥ y/2 z/3 r/5 ∅ B3 x/⊥ y/2 z/3 r/5

identiﬁes the point where a change is cumulated. Note that the algorithm checks whether or not the instruction makes a change by the condition new < spc which is equivalent to new spc = new and new = spc . Obviously, the execution cannot be deterministic: On the level of abstract values there is no way to determine which branch to follow at conditional jumps. Therefore, we consider both branches here. Consequently, we use a working set of program counters, just like the classical algorithm uses a working set of graph nodes. However, the new algorithm uses the working set in a more modest way that the classical: While the classical one chooses a new node from the working set in each iteration, the new one follows one path of computation as long as changes occur and the path does not reach the end of the program. This is done in the inner repeat/until loop. Only if this path terminates, elements are chosen from the working set in the outer while loop. In addition, the new algorithm tries to keep the working set as small as possible during execution of a path: Note that the instruction W := W − pc is placed inside the inner loop. Hence, even execution of a path may cause the working set to shrink. In comparison to the classical algorithm, our approach has the following advantages: – It uses less memory: There is neither a graph to store the possible control ﬂows in the program nor an associative array needed to store the abstract values at the basic block level.

54

Markus Mohnen Input: Data–ﬂow problem (P, L, ![.!], a0 ) where P = I0 . . . In , L = A, , Output: MFP solution s0 , . . . , sn ∈ A s0 := a0 for i := 1 to n do si := W := {0, . . . , n} while W = ∅ do choose pc ∈ W repeat W := W − pc new :=![Ipc !](spc ) if Ipc = (goto l) then pc := l else pc := pc + 1 if Ipc = (if ψ goto l) and new < sl then W := W + l sl := new end end if new < spc then spc := new pc := pc else pc := n + 1 end until pc = n + 1 end

Fig. 5. New execution algorithm for computing MFP solution

– The data locality is better. At a node, the classical algorithm visits all predecessors and potentially all successors. Since these nodes will typically be scattered in memory, the access to the abstract values associated with them will often cause data cache misses. In contrast, our algorithm only visits a node and potentially its successors. Typically, one of the successors is the next instruction. Since the abstract values are arranged in an array, the abstract value associated with the next instruction is the next element in the array. Here, the likelihood of cache hits is large. Recent studies show that such small diﬀerences in data layout can cause large diﬀerences in performance on modern system architectures [CHL99, CDL99]. – There is no need for pre–processing by ﬁnding the abstract semantics of a basic block ![I0 . . . In !] :=![In !] ◦ · · · ◦![I0 !]. – There is no need for a post–processing stage, which propagates the solution from the basic block level to the instruction level. Theorem 1 (Termination). The algorithm in Fig. 5 terminates for all inputs. Proof. During each execution of the inner loop at least one 0 ≤ i ≤ n exists such that value of the variable si decreases w.r.t. the underlying partial order of the lattice L. Since L only has ﬁnite chains, this can happen only ﬁnitely many times. Hence, the inner loop always terminates.

A Graph–Free Approach to Data–Flow Analysis

55

Furthermore, the working set grows iﬀ a conditional jump is encountered and the corresponding value sl decreases. Just like above, this can happen only ﬁnitely many times. Hence, there is an upper bound for the size of the working set. In addition, during each execution of the outer loop, the working set shrinks at least by one element, the one chosen in the outer loop. Hence, the outer loop always terminates. Theorem 2 (Correctness). After termination of the algorithm in Fig. 5, the values of the variables s0 , . . . , sn are the MFP solution of the given data–flow problem. Proof. To prove correctness, we can obviously consider a modiﬁed version of the algorithm, where the inner loop is removed and nodes are selected from the working set in each iteration. In this setting, no program point will be ignored forever. Hence, we can use the results from [GKL+ 94]: The selection of program point is a fair strategy and the correctness of our algorithm directly follows from the theorem on chaotic ﬁxed point iterations. To do so, we have to validate one more premise of the theorem: We have to show that the algorithm computes si = j∈predP (i) ![Ij !](sj ) for each program point 0 ≤ i ≤ n. The algorithm can change si iﬀ it visits a program point pc with pc ∈ predP (i). Let s be the value of si before the loop and s be the value after the loop. If we can show that s = s![Ipc !](spc ), we know that the algorithm computes the meet over all predecessors by iteratively computing the pairwise meet. To show that, we distinguish two cases: 1. If ![Ipc !](spc ) = new < s then s = new = s![Ipc !](spc ). 2. Otherwise, we know that ![Ipc !](spc ) = new ≥ s since ![.!] is monotone and the initial value of s is the top element. Hence we also have s = s = s![Ipc !](spc ). Example 4 (Constant Folding Propagation, Cont’d). Table 2 shows an trace of the execution of the new algorithm for the constant folding propagation example. Each line shows the state of the working set and the approximations at the end of the inner loop, and the values of the program counter pc at the beginning and the end of the inner loop (written in the column pcs in the form begin/end). During this execution, the algorithm loads the value of pc only three times from the working set: Once at the beginning and twice after reaching the end of the program (pcs = 8/9). The adaption of the execution algorithm for the other three cases of the taxonomy of data–ﬂow problems described at the end of Section 2 is straightforward: (a) Existential problems can simply be handled by replacing < by >, and (b) backward problems require a simple pre–processing which inserts new pseudo instructions to connect jump targets with the corresponding jump instructions.

5

Experimental Results

To validate the beneﬁts of our approach, we studied the performance of the new algorithm in competition with the classical one, both in terms of memory

56

Markus Mohnen

Table 2. Example execution of new algorithm W pcs s0 s1 {1, . . . , 8} 0/1 x/⊥ y/⊥ x/1 y/ z/⊥ r/⊥ z/ r/ {2, . . . , 8} 1/2 {3, . . . , 8} 2/3 {4, . . . , 8} 3/8 {4, . . . , 7} 8/9

s2 s3 s4 s5 s6 s7 s8 x/ y/ x/ y/ x/ y/ x/ y/ x/ y/ x/ y/ x/ y/ z/ r/ z/ r/ z/ r/ z/ r/ z/ r/ z/ r/ z/ r/ x/1 y/2 z/⊥ r/⊥ x/1 y/2 z/3 r/⊥ x/1 y/2 z/3 r/⊥

{5, . . . , 7} 4/5 {6, 7}

5/6

{7} ∅

6/7 7/8

{4}

8/9

∅

4/5

{7}

5/6

{7} ∅

6/7 7/8

∅

8/9

x/1 y/2 z/3 r/5 x/1 y/2 x/1 y/2 z/3 r/5 z/3 r/5 x/2 y/2 z/3 r/5 x/⊥ y/2 z/3 r/⊥ x/⊥ y/2 z/3 r/5 x/⊥ y/2 x/⊥ y/2 z/3 r/5 z/3 r/5 x/⊥ y/2 z/3 r/5

usage and runtimes. Prior to the presentation of the results, we discuss the experimental setting in more detail. We have implemented the classical BB algorithm and our new execution algorithm for full Java Virtual Machine (JVM) code [LY97]. This decision was taken in view of the following reasons: 1. As already mentioned, we see the JVM as a natural target environment for our execution–based algorithm, since it already contains an execution environment and is sensitive to high memory overhead. 2. Except for native code compilers for Java [GJS96], all compilers generate the same JVM code as target code. Consequently, we get realistic samples independent of a speciﬁc compiler. 3. Java programs are distributed as JVM code, often available for free on the internet.

A Graph–Free Approach to Data–Flow Analysis

57

Although we omitted procedure/method calls from our model, we can handle full JVM code. For intraprocedural analysis, we assume the result of method invocations to be the top element of the lattice. All these aspects allowed us to collect a large repository of JVM code with little eﬀort. In addition to active search, we established a web site for donations of class ﬁles at http://www-i2.informatik.rwth-aachen.de/~mohnen/ CLASSDONATE/. So far, we have collected 15,339 classes with a total of 98,947 methods. This large set of samples covers a wide range of applications, applets, and APIs. To name a few, it contains the complete JDK runtime environment (including AWT and Swing), the compiler generator ANTLR, the Byte Code Engineering Library, and the knowledge-based system Prot´eg´e. The classes were compiled by a variety of compilers: javac (Sun) in diﬀerent version, jikes (IBM), CodeWarrior (Metrowerks), and JBuilder (Borland). In some cases, the class ﬁles were compiled to JVM code from other languages than Java, for instance from Ada using Jgnat. In contrast to a hand–selected suite of benchmarks like SPECjvm98 [SPE], we do not impose any restrictions on the samples in the set: The samples may contains errors or even might not be working at all. In our opinion, this allows a better estimation of the “average case” a data–ﬂow analyser must face in practice. Altogther, we consider our experiments suitable for estimating the beneﬁts and drawbacks of our method.

import de.fub.bytecode.generic.*; import Domains.*; public interface JVMAbstraction { public Lattice getLattice(); public Element getInitialValue(InstructionHandle ih); public Function getAbstract(InstructionHandle ih); }

Fig. 6. Interface JVMAbstraction However, we did not integrate our experiment in a JVM. Doing so would have ﬁxed the experiment to a speciﬁc architecture since the JVM implementation depends on it. Therefore, we implemented the classical BB algorithm and our new execution algorithm in Java, using the Byte Code Engineering Library [BD98] for accessing JVM class ﬁles. The implementation directly follows the notions deﬁned in Section 2: We used the interface concept of Java to model the concepts of lattices, (JVM) abstractions, and data–ﬂow problems. For instance, Fig. 6 shows the essential parts of the interface JVMAbstraction which models JVM abstractions. Consequently, the algorithms do not depend on speciﬁc data–ﬂow problems. In contrast, our approach allows to model any data–ﬂow problem simply by providing a Java class which implements the interface JVMAbstraction.

58

Markus Mohnen

20.

40.

60.

80.

Memory 100. Reduction %

Fig. 7. Histogram of memory reduction For the experiment, we implemented constant folding propagation, as described in the previous sections. All experiment were done on a system with Pentium III at 750 Mhz, 256 MB main memory running under Linux 2.2.16 and Sun JDK 1.2.2. For each of the 98,947 JVM methods of the repository, we measured memory usage and runtimes of both our algorithm and the classical algorithm. The working set was implemented as a stack. Memory improvement. Given the number of bytes mX allocated by our algorithm and the number of bytes mC allocated by the classical algorithm, we compute the memory reduction as the percentage mX /mC ∗ 100. In the resulting distribution, we found a maximal reduction of 7.28%, a minimal reduction of 74.61%, and an average reduction of 30.83%. Moreover, the median1 is 31.28%, which is very close to the average. Hence, our algorithm always reduces the amount of memory and reached improvements upto less than a tenth! In the average case, the reduction is about a third. Fig. 7 shows a histogram of the distribution. A study of the relation of number of instructions and memory reduction does not reveal a relation between those values. In Fig. 8(a) each point represents a method: The coordinates are the number of instructions on the horizontal axis and memory reduction of the vertical axis. We have restricted the plot to the interesting range up to 1,000 instructions: While the sample set contains methods with up to 32,768 instructions, the average of instructions per method is only 40.3546 and the median is only 11. Obviously, object–orientation has a measurable impact on the structure of program. Surprisingly, there is a relation between the amount of reduction caused by BBs and memory reduction. One might expect that the classical algorithm is better for higher amounts of reduction cause by BBs. However, this turns out to 1

The median (or central value) of a distribution is the value with the property that one half of the elements of the distribution is less or equal and the other half is greater or equal.

A Graph–Free Approach to Data–Flow Analysis

(a) Memory reduction vs. number of instructions

59

(b) Memory reduction vs. basic block reduction

Fig. 8. Memory reduction of new algorithm

be a wrong: Fig. 8(b) shows that the new algorithm reduces the memory even more for higher BB reductions. Runtimes. For the study of runtimes, we use the speedup caused by the use of the classical algorithm: If tC is the runtime of the classical algorithm and tX is the runtime of our algorithm, we consider tC /tX to be the speedup. The distribution of speedups turned out to be a big surprise: Speedups vary from 291.2 down to 0.015, but the mean is 1.62, median is 1.33, and variance is only 7.49! Hence, for the majority of methods our algorithm performs as well as the BB algorithm. Fig. 9 shows a histogram of the interesting area of the distribution. Again, relating speedup on one hand and number of instructions Fig. 10(a) on the other hand did not reveal a signiﬁcant correlation. In addition, and not surprisingly, the speedup is higher for better BB reduction Fig. 10(b) .

BB Speedup 0.5

1.

1.5

2.

2.5

3.

Fig. 9. Histogram of BB speedup

60

Markus Mohnen

(a) Speedup vs. number of instructions

(b) Speedup vs. basic block reduction

Fig. 10. Speedup of classical algorithm

6

Conclusions and Future Work

We have shown that data–ﬂow analysis can be done without explicit graph structure. Our new algorithm for computing the MFP solution of a data–ﬂow problem is based on the idea of the program executing on the abstract values. The advantages resulting from the approach are less memory use, better data locality, and no need for pre–processing or post–processing stages. We validated these expectation by applying a test implementation to a large set of sample. It turned out that while the runtimes are almost identical, our approach always saves between a third and 9/10 of the memory used by the classical algorithm. In the average case, it saves two thirds of the memory used by the classical algorithm. The algorithm is very easy to implement in settings where there is already a machinery for execution of programs available, for instance in Java Virtual Machines. In addition, the absence of the graph makes the algorithm easier to implement. In the presence of full JVM code, implementing BB graphs turned out to be trickier than expected. In fact, after having implemented both approaches, errors in the implementation of the BB graphs were revealed by the correct results of the new algorithm.

References [AH87]

S. Abramsky and C. Hankin. An Introduction to Abstract Interpretation. In S. Abramsky and C. Hankin, editors, Abstract Interpretation of Declarative Languages, chapter 1, pages 63–102. Ellis Horwood, 1987. 48 [ASU86] A.V. Ahos, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniques, and Tools. Addison Wesley, 1986. 46 [BD98] B. Bokowski and M. Dahm. Byte Code Engineering. In C. H. Cap, editor, Java-Informations-Tage (JIT), Informatik Aktuell. Springer–Verlag, 1998. See also at http://bcel.sourceforge.net/. 57

A Graph–Free Approach to Data–Flow Analysis [Bla98]

61

B. Blanchet. Escape Analysis: Correctness Proof, Implementation and Experimental Results. In Proceedings of the 25th Symposium on Principles of Programming Languages (POPL). ACM, January 1998. 47 [Bla99] B. Blanchet. Escape Analysis for Object Oriented Languages: Application to Java. In Proceedings of the 14th Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), volume 34, 10 of ACM SIGPLAN Notices, pages 20–34. ACM, 1999. 47 [CC77] P. Cousot and R. Cousot. Abstract Interpretation: A Uniﬁed Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixed Points. In Proceedings of the 4th Symposium on Principles of Programming Languages (POPL), pages 238–252. ACM, January 1977. 48 [CDL99] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure deﬁnition. In PLDI’99 [PLD99], pages 13–24. 54 [CHL99] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious Structure Layout. In PLDI’99 [PLD99], pages 1–12. 54 [Deu97] A. Deutsch. On the Complexity of Escape Analysis. In Proceedings of the 24th Symposium on Principles of Programming Languages (POPL), pages 358–371. ACM, January 1997. 47 [GJS96] J. Gosling, B. Joy, and G. Steele. The Java Language Specification. The Java Series. Addison Wesley, 1996. 56 uttgen, O. R¨ uthing, and B. Steﬀen. Chaotic Fixed [GKL+ 94] A. Geser, J. Knoop, G. L¨ Point Iterations. Technical Report MIP-9403, Fakult¨ at f¨ ur Mathematik und Informatik, University of Passau, 1994. 55 [KKS98] J. Knoop, D. Kosch¨ utzki, and B. Steﬀen. Basic-Block Graphs: Living Dinosaurs? In K. Koskimies, editor, Proceedings of the 7th International Conference on Compiler Construction (CC), number 1383 in Lecture Notes in Computer Science, pages 65–79. Springer–Verlag, 1998. 46 [LY97] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. The Java Series. Addison Wesley, 1997. 46, 47, 56 [MJ81] S. S. Muchnick and N. D. Jones. Program Flow Analysis: Theory and Applications. Prentice–Hall, 1981. 46 [Moh97] M. Mohnen. Optimising the Memory Management of Higher–Order Functional Programs. Technical Report AIB-97-13, RWTH Aachen, 1997. PhD Thesis. 47 [Muc97] S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, 1997. 46 [PLD99] Proceedings of the ACM SIGPLAN ’99 Conference on Programming Language Design and Implementation (PLDI), SIGPLAN Notices 34(5). ACM, 1999. 61 [SPE] Standard Performance Evaluation Corporation. SPECjvm98 documentation, Relase 1.01. Online version at http://www.spec.org/osg/jvm98/jvm98/doc/. 57

A Representation for Bit Section Based Analysis and Optimization Rajiv Gupta1 , Eduard Mehofer2 , and Youtao Zhang1 1

Department of Computer Science, The University of Arizona Tucson, Arizona 2 Institute for Software Science, University of Vienna Vienna, Austria

Abstract. Programs manipulating data at subword level are growing in number and importance. Examples are programs running on network processors, media processors, or general purpose processors with media extensions. In addition data compression techniques which are vital for embedded system applications result in code operating on subword level as well. Performing analysis on word level, however, is too coarse grain missing opportunities for optimizations. In this paper we introduce a novel program representation which allows reasoning at subword level. This is achieved by making accesses to subwords explicit. First in a local phase statements are analyzed and accesses at subword level identified. Then in a global phase the control-flow is taken into account and the accesses are related to one another. As a result various traditional analyses can be performed on our representation at subword level very easily. We discuss the algorithms for constructing the program representation in detail and illustrate their application with examples.

1

Introduction

Programs that manipulate data at subword level are growing in number and importance. The need to operate upon subword data arises if multiple data items are packed together into a single word of memory. The packing may be a characteristic of the application domain or it may be carried out automatically by the compiler. We have identiﬁed the following categories of applications. Network processors are specialized processors that are being designed to eﬃciently manipulate packets [5]. Since a packet is a stream of bits the individual ﬁelds in the packet get mapped to subword entities within a memory location or may even be spread across multiple locations. Media processors are special purpose processors to process media data (e.g., TigerSHARC [3]) as well as general purpose processors with multimedia extensions (e.g., Intel’s MMX [1,6]). The narrow width of media data is exploited by

Supported by DARPA PAC/C Award. F29601-00-1-0183 and NSF grants CCR0105355, CCR-0096122, EIA-9806525, and EIA-0080123 to the Univ. of Arizona.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 62–77, 2002. c Springer-Verlag Berlin Heidelberg 2002

A Representation for Bit Section Based Analysis and Optimization

63

packing multiple data items in a single word and supporting instructions that are able to exploit subword parallelism. Data compression transformations reduce the data memory footprint of the program [2,9]. After data compression transformations have been applied, the resulting code operates on subword entities. Program analysis, which is the basis of optimization and code generation phases, is a challenging task for above programs since we need to reason about entities at subword level. Moreover, accesses at subword level are expressed in C (commonly used language in those application domains) by means of rather complex mask and shift operations. In this paper we introduce a novel program representation that enables reasoning about subword entities corresponding to bit sections (a bit section is a sequence of consecutive bits within a word). This is made possible by explicitly expressing manipulation of bit sections and relating the ﬂow of values among bit sections. We present algorithms for constructing this representation. The key steps in building our representation are as follows: – By locally examining the bit operations in an expression appearing on the right hand side of an assignment, we identify the bit sections of interest. In particular, the word corresponding to the variable on the left hand side is split into a number of bit sections such that adjacent bit sections are modiﬁed diﬀerently by the assignment. The assignment statement is replaced by multiple bit section assignments. – By carrying out global analysis, explicit relationships are established among diﬀerent bit sections belonging to the same variable. These relationships are expressed by introducing split and combine nodes. A split node takes a larger bit section and replaces it by multiple smaller bit sections and a combine node takes multiple adjacent bit sections and replaces them by a single larger bit section. The above representation is appropriate for reasoning about bit sections. For example, the ﬂow of values among the bit sections can be easily traced in this representation resulting in deﬁnition-use chains at the bit section level. Moreover, since our representation makes accesses at subword level explicit, processors with special instructions for packet-level addressing can be supported easily and eﬃciently by the code generator and the costly mask and shift operations can be replaced. The remainder of the paper is organized as follows. In section 2 we describe our representation including its form and its important properties. In sections 3 and 4 we present the local and global phases of the algorithm used to construct the representation. And ﬁnally concluding remarks are given in section 5.

64

Rajiv Gupta et al.

2

The Representation

This section presents our representation for bit section based analyses and optimizations. Starting point for our extensions are programs modeled as directed control ﬂow graphs (CFG) G = (N, E, entry, exit) with node set N including the unique entry and exit nodes and edge set E. For the ease of presentation we assume that the nodes represent statements rather than basic blocks1 . The construction of our representation is driven by assignment statements of the form v = t whereby the right hand side term t contains bit operations only, i.e. & (and), | (or), not (not), << (shift left), and >> (shift right). Since the term on the right hand side of such an assignment can be arbitrarily long and intricate and since our goal is to replace those assignments by a sequence of simpliﬁed assignments, we call them complex assignments. Essentially our representation is based on two transformations performed on the CFG. First we partition the original program variables into bit sections of interest. The bit sections of interest are identiﬁed locally by examining the usage of these bit sections in a complex assignment. Only complex assignments which are formed using bit operations are processed by this phase because partitioning is guided by the special semantics of bit operations. Other assignments are not partitioned since no useful additional information can be exposed in this way. Hence, in the remainder of the discussion, only complex assignments are considered. Second we relate deﬁnitions and uses of bit sections belonging to the same program variable using global analysis. The required program representation is obtained by making the outcomes of the above steps explicit in the CFG. In the remainder of this section, we illustrate the eﬀects of the above two steps and describe the resulting representation in detail. 2.1

Identifying Bit Sections of Interest

Definition 1 (Bit Section). Given a program variable v with the size of c bits, a bit section of v is denoted by vl..h (1 ≤ l ≤ h ≤ c) and refers to the sequence of bits l, l + 1, .., h − 1, h.2 The symbol := is used to denote a bit section assignment. In the following discussion, if nothing is said to the contrary, we assume for the ease of discussion that variables have a size of 32 bits. Partitioning a program variable. Given a complex assignment (v = t), the program variable v on the left hand side is partitioned into bit sections, if each of the resulting sections is updated diﬀerently from its neighboring bit sections by the term t on the right hand side of the complex assignment. In particular, the value of a bit section of the lhs variable v, say vl..h , can be speciﬁed in one of the following ways: 1 2

Handling basic blocks is straightforward. The definition includes 1-bit sections as well as whole variable sections.

A Representation for Bit Section Based Analysis and Optimization

65

– No Modiﬁcation: The value of vl..h remains unchanged because it is assigned its own value. – Constant Assignment: vl..h is assigned a compile time constant. – Copy Assignment: The value of another bit section variable is copied into vl..h . – Expression Assignment: The value of vl..h is determined by an expression which is in general simpler than t. The partitioning of variable v is made explicit in the program representation by replacing the complex assignment by a series of bit section assignments. A consequence of this transformation is that operands used in t may also have to be partitioned into compatible bit sections. Properties. There are two important properties that will be observed by our choice of bit section partitions: 1. Non-overlapping sections. The sections resulting from such partitioning are non-overlapping for individual assignments. 2. Maximal sections. Each section is as large as needed to expose the semantic information that can be extracted from a given complex assignment. In other words, further partitioning will not provide us with any more information about the values stored in the individual bits. Example. Consider the complex assignment to variable a shown in Fig. 1. If we carefully examine this assignment, we observe that this complex assignment is equivalent to the bit section assignments shown below. Note that each bit section is updated diﬀerently from its neighboring sections. Bit sections a1..4 and a17..32 are set to 0, a5..8 is involved in a copy assignment, and a13..16 is not modiﬁed at all (we have placed the assignment below simply for clarity). Bit section a9..12 is computed using an expression which is simpler than the original expression. Finally, as a consequence of a’s partitioning, variable b must be partitioned into compatible bit sections as well.

Complex Assignment a = (a & 0xf f 00) | ((b & 0xf f ) << 4) Bit Section Assignments a1..4 a5..8 a9..12 a13..16 a17..32

:= := := := :=

0 b1..4 b5..8 | a9..12 a13..16 0

/* /* /* /* /*

constant assignment */ copy assignment */ (simpler) expression */ no modification */ constant assignment */

Fig. 1. Bit Section Assignments

66

Rajiv Gupta et al.

Note that bit section assignments reveal the information that some bit sections are set to constant values, others are assigned copies of bit sections, or some are unchanged. All this information would be lost if we reasoned about the variable a as a single 32 bit entity. 2.2

Establishing Relationships among Bit Sections

After introducing bit sections for single complex assignments, the goal of this step is to establish relationships among the bit sections arising from diﬀerent complex assignments. To illustrate the need for relating bit sections, let us consider the computation of deﬁnition-use relationships as shown in Fig. 2. The computation of deﬁnition-use relationships is complicated by the fact that left or right hand side occurrences of a given program variable may be partitioned diﬀerently by diﬀerent complex assignments. In Fig. 2a partition v17..32 and partition v4..16 are created for variable v in the then-branch and else-branch, respectively. The conditional is followed by a use of partition v1..16 . However, the partitions do not match with each other and the relationships among the bit sections are hidden. Hence, we introduce special nodes in the program which create and destroy bit sections and make the relationships between bit sections explicit: A split node (denoted by ) is introduced to create a set of smaller non-overlapping bit sections from a larger bit section and a combine node (denoted by ⊕) is introduced to coalesce smaller adjacent bit sections into a single longer bit section. The introduction of split and combine nodes as shown in Fig. 2b ensures that each section name exists before it is referenced. The edges show the ﬂow of values to the use of v1..16 at the end of the code fragment. By traversing these edges we can easily determine that if the then-branch is executed, the value reaching the use of v1..16 at line 7 is deﬁned by the assignment at line 1. On the other hand, if the else-branch is executed, the values of bits 1 through 3 deﬁned at line 1 and values of bits 4 to 16 deﬁned at line 5 reach the use of v1..16 at line 7. Thus an important consequence of appropriately introducing split and combine nodes is that deﬁnition-use relationships can now be established among bit sections. Definition 2 (Split and Combine Nodes). A split node that partitions a bit section vl..h into n non-overlapping bit sections vl..s1 , vs1 +1..s2 , ..., vsn−1 +1..h is written as: (vl..s1 , vs1 +1..s2 , ..., vsn−1 +1..h ) := vl..h Conversely a combine node that merges adjacent bit sections vl..s1 ,vs1 +1..s2 , ..., vsn−1 +1..h into a single larger contiguous bit section vl..h is written as: vl..h := ⊕(vl..s1 , vs1 +1..s2 , ..., vsn−1 +1..h ) Properties. The rules for inserting split and combine nodes are derived from the following properties which shall hold for our representation. 1. Non-overlapping sections. While at diﬀerent program points a variable may be partitioned diﬀerently, at each program point we associate a unique partitioning with a program variable into non-overlapping sections.

A Representation for Bit Section Based Analysis and Optimization

67

V1..32 := ... cond

(1) v1..32 := ... (2) if (cond) then (3) v17..32 := ... (4) else (5) v4..16 := ... (6) endif (7) ... := v1..16 (a) Bit section assignments

(V1..16 ,V17..32 ):=

V1..32

(V1..3,V4..16,V17..32):=

V17..32 := …

V1..32

V4..16 := … V1..16 :=

(V1..3,V4..16)

… := V1..16

(b) Def-use chains

Fig. 2. Using Split and Combine Nodes 2. Create-before-use. Along all program paths, each right-hand side appearance of a bit section must be preceded by a left-hand side appearance of the same section. The split and combine nodes act as transition points where non-overlapping partitioning of variable is changed from one partitioning to another. Given a program point where a split or combine node is placed, the node makes explicit the relationship among bit sections that existed immediately preceding the program point and following the program point. Minimal representation. The properties described above which are essential for our representation can be realized by diﬀerent placements of split and combine nodes. Of course, our goal is to meet the required properties with minimal insertions of split and combine nodes which leads to the following two minimality criteria: 1. The number of split and combine nodes introduced along some path shall be minimal. This can be achieved by ensuring that if a bit section is used repeatedly along a path, it is created once before its ﬁrst reference and destroyed only after its last reference. 2. The lifetime of a bit section, i.e. the period during which the section name exists, shall be minimal (under the above criterion). This can be achieved by creating a section at the latest program point where it is needed and destroying it at the earliest program point where it is no longer needed. The lifetime of a bit section starts and ends at a split or a combine node. More speciﬁcally, the appearance of a bit section on the left hand side of a split or combine node represents the start point of the bit section’s lifetime. The appearance of a bit section on the right hand side of a split or combine node represents the end point of the bit section’s lifetime. The two minimality criteria mentioned above imply that the lifetimes of the bit sections are chosen such that they are long enough to reduce the need for

68

Rajiv Gupta et al.

split and combine nodes but not any longer. These criteria result in the following placement strategy for split and combine nodes. Earliest point placement of combine nodes. A combine node that destroys a bit section name is inserted at the earliest program point where the bit section is not (partially) anticipable, that is, it is known that an appearance of the bit section will no longer be encountered and therefore the bit section is no longer needed. Latest point placement of split nodes. A split node that creates a bit section name is inserted at the latest program point at which the bit section is live but not (partially) available, that is, it does not already exist. Example. Consider the code fragment of Fig. 3 together with the corresponding CFG with the inserted split and combine nodes displayed in dark boxes. Split nodes are placed at the latest program points immediately preceding the references to v1..16 , v17..32 , and v24..32 at nodes 1, 2, and 5 in order to create those bit sections, since none of them are partially available. On the other hand, combine nodes are placed at the earliest program points just after the references to v17..32 and v24..32 at nodes 3 and 6 in order to destroy those bit sections since both are not used any more. Finally, since bit sections v1..16 and v17..32 are referenced in the then-branch of the second conditional statement but not in the else-branch, the earliest program point to destroy those bit sections is the very ﬁrst statement of the else-branch at node 4. Minimal representation is not unique. In some situations multiple solutions are equally good under our criteria. As an example consider two consecutive ifstatements with a use of v1..16 in the then-branch and a use of v1..32 in the else-branch as shown in Fig. 4a. If we decide to preserve bit section v1..16 at the

v1..32 if () ... := else ... := endif if () ... := ... := else ... := endif ... :=

:= ... then v1..16

V1..32 := … 1

(V1..16,V17..32):=

V1..32

… := V1..16

v17..32

V1..32

2

(V1..16,V17..32)

4

(V1..16,V17..32):= … := V17..32 …

then v1..16 v17..32 3

v24..32

… := V1..16

V1..32 :=

… := V17..32

(V1..23,V24..32):=

V1..32 :=

(V1..16,V17..32)

5

(V1..23,V24..32)

6

… := V24..32 V1..32:=

v1..32

V1..32

… := V1..32

(a) Sample code

(b) Earliest/latest insertion of combines/splits

Fig. 3. Placement of Split and Combine Nodes

A Representation for Bit Section Based Analysis and Optimization

69

end of the ﬁrst if-statement, we get solution Fig. 4b. On the other hand, if we decide to preserve bit section v1..32 , we get solution Fig. 4c. In both solutions three nodes are inserted, however, taking the left branches of the if-statements we have one inserted node in the ﬁrst solution and three nodes in the second solution. On the other hand, taking the right branches we have two inserted nodes in the ﬁrst solution but none in the second one. However, we consider either choice as equally good since both sections v1..16 and v1..32 have at the end of the ﬁrst if-statement future references and both result in equal number of split and combine nodes. Note that our algorithm yields solution Fig. 4b. v1..32 if () ... := else ... := endif if () ... := else ... := endif ... :=

:= ... then v1..16 v1..32

V1..32 := …

V1..32 := …

(V1..16,V17..32):=

… := ... V1..16

V1..32

(V1..16,V17..32):=

… := ... V1..32

(V1..16,V17..32):=

V1..32

… := ... V1..16

V1..32:=

v1..32 v1..32

(a) Source code

(V1..16,V17..32)

… := ... V1..32

(V1..16,V17..32)

(V1..16,V17..32):=

V1..32

… := ... V1..32

… := ... V1..16

… := … V1..32

(b) Preserving v1..16

… := ... V1..32

… := ... V1..16

V1..32:=

then v1..16

V1..32

… := … V1..32

(c) Preserving v1..32

Fig. 4. Multiple Minimal Solutions

3

Local Phase: Identifying Relevant Bit Sections

In this phase the partitioning of left hand side (lhs) variables of complex assignments is determined under the constraint that each adjacent bit section shall be computed diﬀerently. This is done in two steps: First a bottom-up traversal of the right hand side (rhs) expression is carried out during which the bit sections required for the lhs variable are determined. Second the bit section assignments are generated in a top-down traversal of the rhs expression tree. I. Finding bit sections of the lhs variable. The rhs expression tree is traversed in a bottom-up order and each node in the expression tree is annotated with bit sections of the expression’s operands that contribute to the computation of bit sections of the intermediate value represented by the node. In our algorithm the intermediate value associated with an expression tree node during evaluation of the expression is denoted by nval. We also use the following basic notations: – var : {[(l, h), s]}. Bit section value nvall+s+1..h+s is a function of bit section varl+1..h . If s is 0, the bit sections of var and nval refer to the same bit

70

Rajiv Gupta et al.

positions. Otherwise, a non-zero value of s indicates that the bit sections refer to diﬀerent bit positions which is achieved by using left or right shift (<<, >>). – 0/1 : {[(l, h), s]}. Bit section value nvall+s+1..h+s = 0/1, that is, we have a constant bit section with all bits being equal 0 or 1. For the ease of presentation the following short hand notations are used as well: – var : {[(l, m, h), s]} ≡ var : {[(l, m), s], [(m, h), s]}. Short hand notation for expressing adjacent bit sections of variable var. – var : {[(l1 , h1 ), s1 ], [(l2 , h2 ), s2 ], ...} or 0/1 : {[(l1 , h1 ), s1 ], [(l2 , h2 ), s2 ], ...}. Short hand notation for multiple non-adjacent bit sections. Finally, we introduce the following operations for value ranges. These operations are used in the computation of bit sections throughout the paper. (l1 , h1 ) ∩ (l2 , h2 ) =

(l1 , h1 ) ∪ (l2 , h2 ) =

(max(l1 , l2 ), min(h1 , h2 )) if max(l1 , l2 ) < min(h1 , h2 ) φ otherwise

 {(l1 , l2 ), (l2 , h2 ), (h2 , h1 )} = (l1 , l2 , h2 , h1 ) if l1 < l2 < h2 < h1     {(l1 , l2 ), (l2 , h1 ), (h1 , h2 )} = (l1 , l2 , h1 , h2 ) if l1 < l2 < h1 < h2 {(l2 , l1 ), (l1 , h2 ), (h2 , h1 )} = (l2 , l1 , h2 , h1 ) if l2 < l1 < h2 < h1

    {(l2 , l1 ), (l1 , h1 ), (h1 , h2 )} = (l2 , l1 , h1 , h2 ) if l2 < l1 < h1 < h2 {(l1 , h1 ), (l2 , h2 )}

(l1 , h1 ) − (l2 , h2 ) =

otherwise

 {(l1 , l2 ), (h2 , h1 )} if l1 < l2 < h2 < h1    (l1 , l2 )  if l1 < l2 < h1 < h2 (h2 , h1 )

   φ

(l1 , h1 )

if l2 < l1 < h2 < h1 if l2 < l1 < h1 < h2 otherwise

Algorithm. Visit the nodes in the expression tree in a bottom-up order applying steps 1 and 2 to them. Identify the bit sections of the lhs variable in step 3. 1. Compute node annotations exploiting characteristics of operators and operands. – Variable leaf node. Annotate node with x : {[(0, 32), 0]}, where variable x is the operand associated with the leaf node (and 32 is the bit width). – Constant leaf node. Annotate node with a set of bit sections each of which contains only 0’s or 1’s, that is, 0 : {[(l01 , h01 ), s01 ], [(l02 , h02 ), s02 ], ...} and 1 : {[(l11 , h11 ), s11 ], [(l12 , h12 ), s12 ], ...}. – Bitwise And (&) operator. We exploit the following properties in computing the annotations for the And node: a&1 = a, a&0 = 0. operand annotations &’s annotation var : {[(l1 , h1 ), s1 ]}, 0 : {[l2 , h2 ], 0]}, var : {[(l − s1 , h − s1 ), s1 ]} where 0 : {[(l2 , h2 ), 0]} (l , h ) = (l1 + s1 , h1 + s1 ) − (l1 + s1 , h1 + s1 ) ∩ (l2 , h2 ) var : {[(l1 , h1 ), s1 ]}, var : {[(l1 , h1 ), s1 ]}, 1 : {[(l2 , h2 ) − (l1 + s1 , h1 + s1 ) ∩ (l2 , h2 ), 0]} 1 : {[(l2 , h2 ), 0]}

A Representation for Bit Section Based Analysis and Optimization

71

– Bitwise Or (|) operator. We exploit the following properties in computing the annotations for the Or node: a|1 = 1, a|0 = a. operand annotations |’s annotation var : {[(l1 , h1 ), s1 ]}, 1 : {[l2 , h2 ], 0]}, var : {[(l − s1 , h − s1 ), s1 ]} where 1 : {[(l2 , h2 ), 0]} (l , h ) = (l1 + s1 , h1 + s1 ) − (l1 + s1 , h1 + s1 ) ∩ (l2 , h2 ) var : {[(l1 , h1 ), s1 ]}, var : {[(l1 , h1 ), s1 ]}, 0 : {[(l2 , h2 ) − (l1 + s1 , h1 + s1 ) ∩ (l2 , h2 ), 0]} 0 : {[(l2 , h2 ), 0]}

– Not operation (not nval). Corresponding to constant bit sections in nval create constant bit sections for the not node where 0 bit sections are converted into 1 bit sections and 1 bit sections are converted into 0 bit sections. That is, if 0/1 : {[(l, h), 0], ..} annotates nval, then 1/0 : {[(l, h), 0], ..} annotates not. – Constant left shift (nval << c where c is a constant ≤ 32). From a bit section that is an annotation of nval, compute bit sections that annotate the << node as follows. nval’s annotation <<’s annotation var/1 : {[(l, h), s]} 0 : {[(0, c), 0]} and var/1 : {[(l − s − c, h − s − c), s + c]}, where (l , h ) = (l + s + c, h + s + c) ∩ (0, 32) 0 : {[(l , h ), 0]}, where (l , h ) = (l + c, h + c) ∩ (0, 32) 0 : {[(l, h), 0]} and 0 : {[(0, c), 0]} 0 : {[(0, h + c) ∩ (0, 32), 0]} 0 : {[(0, h), 0]}

– Constant right shift (nval >> c). From a bit section that is an annotation of nval, compute bit sections that annotate the >> node as follows. The following is applicable for shifting of an unsigned value. In case of a signed value, if the sign is known, similar rules can be derived. nval’s annotation >>’s annotation var/1 : {[(l, h), s]} 0 : {[(32 − c, 32), 0]} and var/1 : {[(l − s + c, h − s + c), s − c]}, where (l , h ) = (l + s − c, h + s − c) ∩ (0, 32) 0 : {[(32 − c, 32), 0]} and 0 : {[(l, h), 0]} 0 : {[(l , h ), 0]}, where (l , h ) = (l − c, h − c) ∩ (0, 32) 0 : {[(l − c, 32) ∩ (0, 32), 0]} 0 : {[(l, 32), 0]}

2. Ensure all bits within a section are computed identically. Closer examination of bit sections of diﬀerent operand variables that annotate a given node can reveal whether further splitting of these bit sections is required to ensure that each resulting bit section is computed by exactly one expression. The bit section var1 : {[(l1 , h1 ), s1 ]} is split by bit section var2 : {[(l2 , h2 ), s2 ]}, denoted by var1 /var2 , at a node in the expression tree by means of the following rule:

72

Rajiv Gupta et al.

 var1 : {[(l1 , l2 + s2 − s1 , h2 + s2 − s1 , h1 ), s1 ]}    if l1 + s1 < l2 + s2 < h2 + s2 < h1 + s1     var1 : {[(l1 , l2 + s2 − s1 , h1 ], s1 }  

var1 : {[(l1 , h1 ), s1 ]} if l1 + s1 < l2 + s2 < h1 + s1 < h2 + s2 = var var2 : {[(l2 , h2 ), s2 ]} 1 : {[(l1 , h2 + s2 − s1 , h1 ), s1 ]}     if l2 + s2 < l1 + s1 < h2 + s2 < h1 + s1      var1 : {[(l1 , h1 ), s1 ]} otherwise

The splitting is performed by considering every ordered pair of bit sections. As we can see, the above bit sectioning is performed to distinguish between bit sections which are computed diﬀerently by both bit sections var1 and var2 . More precisely, we distinguish a bit section which is computed from both var1 and var2 from one which is computed only from var1 . 3. Identify bit sections for the lhs variable. After steps 1 and 2, the annotations of the root node of the expression tree are used to identify the bit sections of the variable on the left hand side. Let us assume that the width of a word is 32 bits, then we split the initial bit section of the lhs variable varlhs : {[(0, 32), 0]} if parts are computed diﬀerently. More formally, new bit sections are obtained by a repeated evaluation of varlhs : section any : {[(l, h), s]} for each annotation any : {[(l, h), s]} at the root node of the rhs tree. II. Generating bit section assignments. In this step we generate the bit section assignments corresponding to the bit sections identiﬁed for a lhs variable of a complex assignment. Given a bit section vl+1..h , the expression which has to be assigned to vl+1..h is returned by the function call genexp((l, h), eroot), where eroot is the root node of the entire expression tree, i.e., for each bit section (l, h) for a lhs variable v we call vl+1..h := simplif y(genexp((l, h), eroot)). Function simplify is the last step in which trivial patterns like “a|0” or “a&1” are reduced to “a”. As shown in Fig. 5, genexp() traverses the expression examining the bit sections that annotate each node in order to ﬁnd those that contribute to bits l + 1..h. If only one of the bit sections at a node contributes to bits l + 1..h, a traversal of the subtree is not required any more. In this case the operand is a sequence of h−l bits belonging to a variable or it consists of constant (0 or 1) bits. If multiple bit sections contribute to bits l + 1..h, then the operator represented by the current node is included in the expression and the subexpressions that are its operands are identiﬁed by recursively applying genexp() to the descendants.

A Representation for Bit Section Based Analysis and Optimization

73

genexp((l, h), e) { BS = φ f or each section any : [(el, eh), es] ∈ set of annotations of node e do if range (l, h) is contained in range (el + es, eh + es) then BS = BS ∪ {any : [(el, eh), es]} endif endf or if BS == {any : [(el, eh), es]} then return (”anyl−es+1..h−es ”) else let e.lchild and e.rchild be expression trees f or operands of e case e.op of e.op == ”not” : return (”not” genexp((l, h), e.lchild); e.op == ” << c” : return(genexp((l − c, h − c), e.lchild); e.op == ” >> c” : return(genexp((l + c, h + c), e.lchild); e.op == ”&” : return(genexp((l, h), e.lchild) ”&” genexp((l, h), e.rchild); e.op == ”|” : return(genexp((l, h), e.lchild) ”|” genexp((l, h), e.rchild)); end case endif }

Fig. 5. Generating Bit Section Assignments Step 1: 1:{[(8,16),0], 0:{[(0,8),0], [(16,32),0]}

Step 1: a:{[(0,32),0]} a

b

0xff00

Step 1: a:{[(8,16),0]} 0:{[(0,8),0], [(16,32),0]}

Step 1: b:{[(0,32),0]}

&

e

Step 1: 1:{[(0,8),0]} 0:{[(8,32),0]} 0xff

Step 1: b:{[(0,8),0]} 2

0:{[(8,32),0]}

4

&

Step 1: b:{[(0,8),4]} 0:{[(0,4),0],

<< e 3

[(12,32),0]}

e1

a Step 3: a:{[(0,4,8,12,16,32),0]}

|

Step 1: a:{[(8,16),0]} b:{[(0,8),4]} 0:{[(0,4),0], [(16,32),0]}

Step 2: a:{[(8,12,16),0]} b:{[(0,4,8),4]} 0:{[(0,4),0], [(16,32),0]}

=

Fig. 6. Identifying Relevant Bit Sections for a Complex Assignment

74

Rajiv Gupta et al.

Example. The example in Fig. 6 illustrates how the bit sections of lhs variable a of Fig. 1 are determined. The nodes are dealt with in a bottom-up manner. At the leave nodes, variables are initialized with whole bit sections, while constants are partitioned such that sequences of 0 or 1 are identiﬁed by bit sections. The annotations at the &-nodes indicate that some bits are 0 while others are derived from bit sections of variables a and b. Now let us apply the genexp() algorithm to the example. For bit sections a1..4 and a17..32 we ﬁnd the contributing bit sections 0 : [(0, 4), 0] and 0 : [(16, 32), 0] which annotate the root node e1 resulting in an assignment of constant 0. For both a5..8 and a13..16 we ﬁnd a single contributing bit section that annotates e1 . From b : [(0, 4), 4] we obtain that a5..8 is assigned b1..4 and from a : [(12, 16), 0] we get that a13..16 is assigned to itself, that is, it remains unchanged. Finally, for bit section a9..12 we detect that there are two contributing bit sections, a : [(8, 12), 0] and b : [(4, 8), 4]. Therefore the operator | at node e1 is part of the expression and we must traverse the descendant nodes to locate the operands. In this case we ﬁnd the operands a9..12 and b5..8 at the left and right nodes e2 and e3 respectively.

4

Global Phase: Placement of Split and Combine Nodes

In this phase global analysis is performed to relate the bit sections introduced in the local phase to each other by inserting split and combine nodes. Note that the analysis for insertion of splits and combines of one variable is independent of other variables. Backward and forward propagation of bit sections. In order to determine whether an existing bit section should be eliminated using a combine node at a given program point, we must know whether the bit section is used later on in the program. This is accomplished by computing anticipable bit section references in a backward analysis. Similarly for determining whether a bit section should be created using a split node at a program point, we must know if the bit section already exists. This is accomplished by computing available bit sections in a forward analysis. The values of data ﬂow variables involved in this analysis are a set of bit sections belonging to a program variable. As before, each bit section is represented by a range (l, h) which denotes bits l + 1 through h. We already deﬁned and used operations ∩, ∪, and − for value ranges. Since the data ﬂow values are sets of value ranges, we deﬁne analogous operations over a set of value ranges whereby thenew set is computed by considering every pair. We denote these operations by , , and −−. Given the above operations, the computation of anticipable (B) and available (F ) bit sections for node n and variable v is shown below. The B and F sets are computed at the beginning (in) and end (out) of each node. Ref [n, v] is a local set that represents the bit sections of v that are referenced on the lhs or rhs at

A Representation for Bit Section Based Analysis and Optimization

75

node n.3 Anticipable Bit Sections : Backward Analysis Initialize Bout [exit, v] = φ P ropagate Bin [s, v] Bout [n, v] = s ∈ succ(n) Ref [n, v] Bin [n, v] = (Bout [n, v] −− Ref [n, v]) Available Bit Initialize Fin [entry, v] P ropagate Fin [n, v] = Fout [n, v] =

Sections : F orward Analysis = φ

Fout [p, v] p ∈ pred(n) (Fin [n, v] −− Ref [n, v]) Ref [n, v]

Combine and Split Node Insertion. The insertion points are determined based on the results of the forward and backward analysis. This is done in three steps. First, we identify all candidate sections at each program point that may be either eliminated or created at that point. We refer to these sections as combine candidates and split candidates, respectively. The combine candidate set, CCin/out (n, v), and the split candidate set SCin/out (n, v), for a program point n and variable v are identiﬁed as follows. If a bit section of v is available at n’s entry (exit) but not anticipable at n’s entry (exit), then the combine node needed to eliminate the bit section is a legal candidate for insertion at entry (exit) of n. Similarly if a bit section of v is anticipable at n’s entry (exit) but not available at n’s entry (exit), then the split node needed to create the bit section is a legal candidate for insertion at entry ∗ ∗ and Bin/out denote the solutions of (exit) of n. In the equations below Fin/out the equation systems. Note that the only sections of interest are those that are smaller than (0, 32) and therefore (0, 32) is never included in the CC and SC sets. This is because creation of (0, 32) does not require a split as there is no larger section from which (0, 32) can be split and elimination of (0, 32) does not require a combine because there is no larger section into which (0, 32) can be merged. Combine and Split Candidates ∗ ∗ (n, v) and (l, h) ∈ / Bin/out (n, v)} CCin/out (n, v) = {(l, h) : (l, h) ∈ Fin/out −− {(0, 32)} ∗ ∗ (n, v) and (l, h) ∈ / Fin/out (n, v)} SCin/out (n, v) = {(l, h) : (l, h) ∈ Bin/out −− {(0, 32)}

In the second step we identify the earliest points for combines and latest points for splits for insertion of combine and split nodes. This can be done easily 3 It should the semantics of the operators −− and , (X −− be noted that due to Ref )

Ref is NOT equal to X

Ref .

76

Rajiv Gupta et al.

from the results of the ﬁrst step by comparing the CC and SC sets of predecessor and successor nodes. For example, if a section is present in the CCin set of a node, but not in any of the CCout sets of its predecessor nodes, then the entry point of the node is the earliest point at which the combine can be placed. Combine and Split Insertion P oints = CCin (n, v) EarliestCCin (n, v) EarliestCCout(n, v) = CCout (n, v) = SCin (n, v) LatestSCin (n, v) = SCout (n, v) LatestSCout(n, v)

−− CCout (p, v) p∈ pred(n) −− CCin (n, v) −− SCout (n, v) −− SCin (s, v) s∈ succ(n)

Finally in the third step we insert combine and split nodes. To this end, during backward and forward analysis as well as during computation of combine and split candidates, the bit sections in the data ﬂow sets are distinguished to fall into two categories: Those which are added to the sets because references to them are encountered and those which are added as a consequence of adding the ones which are referenced. Only the bit sections that are marked as being directly referenced are considered during insertion of split and combine nodes. The insertion conditions are given below. After the combine nodes the split nodes are inserted. The portions of the combine and split nodes which are not speciﬁed and marked by dots are determined by examining the remaining bit sections that exist at that program point. Combine and Split Insertion if (l, h) ∈ EarliestCCin/out(n, v) then insert combine node ’v.... := ⊕ (...., vl+1..h , ....)’ if (l, h) ∈ LatestSCin/out(n, v) then insert split node ’(...., vl+1..h , ....) := v.... ’

5

Concluding Remarks

We presented a novel program representation which supports analyses at bit section level. Instead of coping with subword accesses for each optimization, our representation makes those accesses explicit enabling to realize traditional analysis on subword level very easily. In a local analysis phase we analyze statement by statement and identify bit sections which could be of interest for subsequent optimization phases. Then we relate those bit sections to each other by introducing split and combine nodes.

References 1. T. M. Conte, P. K. Dubey, M. D. Jennings, R. B. Lee, A. Peleg, S. Rathnam, M. Schlansker, P. Song, and A. Wolfe, “Challenges of Combining General-Purpose and Multimedia Processors,” IEEE Computer, Vol. 30, No. 12, pages 33–37, Dec. 1997. 62

A Representation for Bit Section Based Analysis and Optimization

77

2. J. Davidson and S. Jinturkar, “Memory access coalescing : a technique for eliminating redundant memory accesses,” ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 186–195, 1994. 63 3. J. Fridman, “Data Alignment for Sub-Word Parallelism in DSP,” IEEE Workshop on Signal Processing Systems (SiPS), pages 251-260, 1999. 62 4. S. Larsen and S. Amarasinghe, “Exploiting Superword Level Parallelism with Multimedia Instruction Sets,” ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 145–156, Vancouver B. C., Canada, June 2000. 5. X. Nie, L. Gazsi, F. Engel, and G. Fettweis, “A New Network Processor Architecture for High Speed Communications,” IEEE Workshop on Signal Processing Systems (SiPS), pages 548-557, 1999. 62 6. A. Peleg and U. Weiser, “MMX Technology Extension to Intel Architecture,” IEEE Computer, 16(4):42-50, August 1996. 62 7. M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 108–120, Vancouver B. C., Canada, June 2000. 8. J. Wagner and R. Leupers, “C Compiler Design for an Industrial Network Processor,” ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES), pages 155-164, June 2001. 9. Y. Zhang and R. Gupta, “Data Compression Transformations for Dynamically Allocated Data Structures,” International Conference on Compiler Construction (CC), Grenoble, France, April 2002. 63

Online Subpath Profiling David Oren, Yossi Matias , and Mooly Sagiv School of Computer Science, Tel-Aviv University, Israel {doren,matias,msagiv}@post.tau.ac.il

Abstract. We present an eﬃcient online subpath proﬁling algorithm, OSP, that reports hot subpaths executed by a program in a given run. The hot subpaths can start at arbitrary basic block boundaries, and their identiﬁcation is important for code optimization; e.g., to locate program traces in which optimizations could be most fruitful, and to help programmers in identifying performance bottlenecks. The OSP algorithm is online in the sense that it reports at any point during execution the hot subpaths as observed so far. It has very low memory and runtime overheads, and exhibits high accuracy in reports for benchmarks such as JLex and FFT. These features make the OSP algorithm potentially attractive for use in just-in-time (JIT) optimizing compilers, in which proﬁling performance is crucial and it is useful to locate hot subpaths as early as possible. The OSP algorithm is based on an adaptive sampling technique that makes eﬀective utilization of memory with small overhead. Both memory and runtime overheads can be controlled, and the OSP algorithm can therefore be used for arbitrarily large applications, realizing a tradeoﬀ between report accuracy and performance. We have implemented a Java prototype of the OSP algorithm for Java programs. The implementation was tested on programs from the Java Grande benchmark suite and exhibited a low average runtime overhead.

1

Introduction

A central challenge facing computer architects, compiler writers and programmers is to understand a program’s dynamic behavior. In this paper we develop the ﬁrst proﬁling algorithm with the following properties: (i) it is online, and thus well suited for JIT-like compilation and dynamic optimizations, where decisions have to be made early in order to control the rising cost of missed opportunity that results from prediction delay [7]; and (ii) proﬁling information is recorded for subpaths that start at arbitrary program points. Related works are described in Section 4.

Research supported in part by an Alon Fellowship and by the Israel Science Foundation founded by The Academy of Sciences and Humanities Research supported in part by the Israel Science Foundation founded by The Academy of Science and Humanities

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 78–94, 2002. c Springer-Verlag Berlin Heidelberg 2002

Online Subpath Profiling

79

1.1 Hot Subpaths Considering arbitrary subpaths presents a considerable performance challenge. As the number of subpaths under consideration could be in the hundreds of millions, maintaining a full histogram of all subpaths is prohibitively expensive both in runtime and in memory overheads. Figure 1 presents a situation where several cold paths include a common section of code [3456]. This common section is hot, even though the paths that contain it are cold.

Fig. 1. Several cold paths sharing a common hot subpath, [3456]. This code segment may be part of a loop, or may be called numerous times from other functions

1.2

Main Results

In this paper, we present a new online algorithm for Online Subpath Profiling, OSP, that records hot subpaths which start at arbitrary basic block boundaries. The OSP algorithm can report an estimate of the I; hottest subpaths in a given program on a given run. This can be used by a programmer, an optimizing compiler or a J I T compiler to locate "hot" areas where optimizations pay off. Whereas other profiling algorithms are typically limited to certain path types, the OSP algorithm identifies arbitrary hot subpaths in the program. The OSP algorithm is online in the sense that it reports at any point during program execution the hot subpaths as observed so far. It has very low memory and runtime overheads, and exhibits high accuracy in reports. For example, consider the JLex [5] program for generating finite state machines from regular expressions. The OSP algorithm accurately identifies the 5 hottest subpaths when profiling this program on the provided sample input. The memory overhead

80

David Oren et al.

is 45 kilobytes, compared to 170 kilobytes used by JLex. The runtime overhead is 64%, and could be as low as 17% with an appropriate implementation of the proﬁler. The online nature of the OSP algorithm is demonstrated for the FFT program. At every point during its execution, the hottest subpaths observed so far are reported with high accuracy. This feature makes the OSP algorithm very attractive for use in JIT-like compilers, in which proﬁling performance is crucial and it is essential to locate hot subpaths as early as possible. The JLex program generates approximately 22 million subpaths of length up to 1024 basic blocks. From this input a sample of about 2000 subpaths is suﬃcient to correctly identify the 5 hottest subpaths. Results for FFT are even more favorable, as elaborated in Section 3. The OSP algorithm is based on an adaptive sampling technique presented by Gibbons and Matias [8] that makes eﬀective utilization of memory with small overhead. Both memory and runtime overheads can be controlled, and the OSP algorithm can therefore be used for arbitrarily large applications, realizing a tradeoﬀ between accuracy and performance. The accuracy depends on the skew level of the distribution of the subpaths. The higher the skew the better the performance, which is an attractive feature as the importance of the proﬁler is greater for skewed distributions. 1.3

Prototype Implementation

We have implemented a simple prototype of the OSP algorithm in Java for Java programs, using the Soot [17] framework for program instrumentation. The architecture of the implementation is described in Figure 2. The OSP algorithm is called by a proﬁling agent, sitting on top of the JVM. It may accept input parameters such as available memory and a limit on runtime overhead; it continuously reports hot subpaths that can be fed back into the JVM for optimization. We tested the algorithm on 4 programs from the Java Grande benchmark suite [9], on JLex [5] and on Sun’s java javac compiler [15]. We measured the runtime overhead, the memory overhead and the accuracy of the results. The runtime overhead averages less than 20%, and the memory overhead ranges from 40 to 65 kilobytes, compared to 100 to 170 kilobytes used by the programs. The OSP algorithm identiﬁes most of the hottest subpaths in

knobs report o

OSP

e

input

'

Agent

S

/ JVM

/ output

Fig. 2. The OSP Architecture

Online Subpath Proﬁling

81

each of the tested programs. This shows that even for low memory and runtime overhead we can obtain very accurate reports of the program behavior. 1.4

Outline of the Rest of this Paper

Section 2 describes the online subpath proﬁling algorithm. Section 3 describes a simple prototype implementation and experimental results. Related works are discussed in Section 4. Conclusions and further works are discussed in Section 5.

2

The Online Subpath Profiling Algorithm

The OSP algorithm avoids the full cost of counting all subpaths by: (i) sampling a fraction of the executed subpaths, (ii) maintaining the sample in a concise manner, obtaining a sample that is considerably larger than available memory, and (iii) identifying hot subpaths and deriving a highly accurate estimate of their count from subpaths frequencies in the sample. 2.1

The Algorithm

The OSP algorithm is based on the hot-list algorithm presented in [8]. Given a sequence of items the hot-list algorithm maintains a uniform random sample of the sequence items in a concise manner, namely as pairs of (id, count). The sampling probability depends on the actual skewness of the data, and is adapted dynamically during execution. We extend the hot-list algorithm for subpaths, and maintain a concise sample of subpaths. At every sample point the OSP algorithm determines the length of the subpath to be sampled according to a predetermined distribution. The sampled subpath is encoded into a subpath id, and is either inserted into the resulting histogram (if it was not there already), or the subpath’s count is incremented. If necessary, the sampling probability is adapted, and the elements in the sampled set are resampled. Using concise samples ensures eﬃcient utilization of memory. Instead of maintaining a multiset of ids, each id has a corresponding counter, and thus a frequently occurring element will not require a large memory footprint. With an allowed memory footprint m, and an average count G, the eﬀective sample size is m × G. Thus, G can be deﬁned as the gain obtained from using concise samples. The exact gain depends on the distribution of the elements in the input set. The OSP algorithm’s pseudo-code is given in Figure 3. The method enterBlock is triggered for each basic block and determines whether or not sampleBlock needs to be invoked. The sampleBlock method — the core of the algorithm — is executed for a very small fraction of the basic blocks, namely those which are part of a subpath selected to be in the sample. The algorithm maintains two variables: skip, which holds the number of basic blocks that will be skipped before the next sampling begins; and length, which holds the length of the subpath we wish to sample.

82

David Oren et al.

At the beginning of each basic block the enterBlock method is called. If a path is currently sampled, this method calls sampleBlock. Otherwise, if the next block is to be sampled (skip is 0), the length of the next sampled subpath is selected at random from a predetermined probability distribution. The sampleBlock method appends the current basic block to the subpath which is currently sampled, using an implementation speciﬁc encoding. When this subpath is of the required length, the sampled set is updated by calling the updateHotList method. The sampling probability determines the selection of skip in the chooseSkipValue method. The updateHotList method is responsible for maintaining the hot-list.

void enterBlock(BasicBlock b) { void sampleBlock(BasicBlock b) { if (sampling) subpath.appendBlock (b); sampleBlock(b); if (--length == 0) { else { updateHotList(subpath.id); if (--skip == 0) { skip = chooseSkipValue(); length = choosePathLength(); subpath = new SubPath(); sampling = true; sampling = false; } } } } } (length) sampled blocks

(skip)

(length) sampled blocks

(skip)

Fig. 3. The basic OSP algorithm Note that the probability selections of skip, length and the resampling parameters are chosen so that at any given point the maintained histogram consists of a random sample representing the subpaths observed so far. The sampling can be uniform, or it can be appropriately biased, e.g., the probability of a subpath being sampled can be a function of its length. Let us consider an example of the algorithm in action on the fragment of the control ﬂow graph shown in Figure 1. At program startup, the OSP algorithm decides how many basic blocks should be skipped before sampling begins (using the chooseSkipValue function), and assigns this value to the skip variable. Let this value be 2. The algorithm is then called at the beginning of basic blocks 1 and 2a , each time decreasing the value of skip by one. When skip becomes 0, at the beginning of block 2a , the algorithm decides how long a path should be sampled (using the choosePathLength function), and goes into sampling mode. Let us assume the algorithm has decided to sample a path of length 4. The next four times it is called (blocks 3, 4, 5 and 6), the algorithm will append the identiﬁer of the current basic block to the identiﬁer of the path being generated. Once the identiﬁer for path [3456] has been generated,

Online Subpath Proﬁling

83

the algorithm will update the sampled set with this new subpath id. Finally, the algorithm will decide how many blocks are to be skipped before sampling begins again, and will switch into skipping mode. Every time subpath [3456] is sampled, its count in the sample is incremented. Note that it will be sampled at a rate about 3 times the rate of subpath [4567b ], about 6 times the rate of subpath [4567a], and over 20 times the rate of subpaths 1 [12b 34] and [2b 345]. Also note that even for a sampling probability of about 40 , it is expected to be sampled approximately 150 times, enabling a very accurate estimate of its count. 2.2

Complexity Analysis

The skipping overhead, in the enterBlock method, is O(1) operations per block, with a constant depending on the exact implementation of the skipping process. The sampling overhead, in the sampleBlock method, is O(1) operations per sampled block. The cost of table resampling is controlled by setting the new sampling probability, and can be made to be amortized O(1) per sampled block [8]. Since the number of sampled blocks is a small fraction of the total number of executed blocks, the total sampling overhead is o(n), where n is the number of executed blocks, and is o(1) amortized per executed block. A more detailed analysis is given in [12,13]. 2.3

Special Considerations

Sampling and Skipping The sampling and counting are performed using a hot-list algorithm [8]. The hot-list algorithm is given an estimate of the input size, and a permissible memory footprint. From these values an initial sampling frequency f is computed, and each subpath is sampled with probability f1 . Let m be the permissible memory footprint, G the expected gain and n the expected input size, then n (1) m×G Instead of deciding for each subpath whether it should be sampled or not, a skip value is computed [18]. This value represents how many subpaths must be skipped before one should be sampled. The skip values are chosen so that their expected value is f , and for large values of f the performance gain can be important. f=

Subpaths For performance reasons, we observe that it is advantageous to only consider subpaths whose length is a power of two. Since the number of subpaths increases (quadratically) with the number of basic blocks, and the number of subpaths in the input aﬀects accuracy for a given sample size, we improve performance by limiting the input set. Our choice provides signiﬁcant reduction in the noise that exists in the sample set. Moreover, for any hot subpath of length k, we can ﬁnd a subpath of length at least k2 which is part of the sample space.

84

David Oren et al.

Path Length Bias Once we have decided a subpath should be sampled, we have to decide how long a subpath should be sampled. It has been suggested that shorter hot subpaths will yield better possibilities for optimization (see [10] and its deﬁnition of minimal hot subpaths). Thus, in the current implementation we have decided to prefer shorter paths. Paths are sampled with a geometric probability distribution, with a path of length 2n , n ≥ 1 being sampled with probability 21n . Preferring shorter subpaths also increases the probability of ﬁnding minimal subpaths. In the case of loops, for instance, sampling longer subpaths will often yield the concatenation of several iterations of the loop. An important feature of the algorithm is that it can accommodate other biases towards path lengths. Path length could be selected by any probability distribution; e.g., geometric (as above), uniform, or one which provides bias towards longer paths. The random selection of length is performed by the method choosePathLength and the algorithm works correctly for any selected distribution. Concise Samples and Resampling The hot-list algorithm maintains a list of concise samples of the sampled subpaths. This list can be thought of as a histogram: for each sampled subpath we hold an identiﬁer, and a count representing how many times it has been sampled so far. Since each sampled subpath uses the same amount of memory even if it is sampled numerous times, the use of concise samples increases the eﬀective sample size. The beneﬁt resulting from the use of concise samples depends on the program being proﬁled. Proﬁling a program having a small number of very hot subpaths will beneﬁt greatly from the use of concise samples. At the other extreme, proﬁling a program where the subpaths are evenly distributed will not beneﬁt from them. If at some point during execution the sample exceeds its allocated memory footprint, f is increased, all elements in the sample are resampled with a prob ability ff (where f is the previous sampling probability), and all new elements are sampled with the new probability. This ensures that the algorithm uses a limited amount of memory, which can be determined before the program starts. Encoding Each basic block can be assigned a unique integer identiﬁer. We now need a function f that given a path P = b1 b2 · · · bn where bi are basic blocks, will generate a unique identiﬁer for the path. Ideally, we could ﬁnd a function f that is sensitive to permutation, but not to rotation. Formally, given two paths, P1 = b1 b2 · · · bn and P2 , then f (P1 ) = f (P2 ) iﬀ there is some j such that P2 = bj · · · bn b1 · · · bj−1 . Reporting the Results At any point during program execution the subpaths in the sample can be reported. It is important to remember that not all subpaths in the sample have the same accuracy. Intuitively, the higher the count of

Online Subpath Proﬁling

85

the subpath in the sampled set, the higher the accuracy of the count, and the probability that this subpath is hot. We can either report only subpaths whose count in the sampled set exceeds some threshold, or report the k hottest subpaths in the sampled set. For each reported subpath, an estimate of its accuracy is given [8]. 2.4

A Framework for Proﬁlers

The description of the algorithm given here is very general. The behavior of the algorithm can be modiﬁed extensively by changing certain elements. Hence, the algorithm can serve as a framework for proﬁling under various preferences or constraints. It is very important to remember that many of the decisions presented here — limiting ourselves to paths of length 2n , giving a higher sampling probability to shorter paths, for instance — are implementation details, and do not stem from any limitation in the algorithm itself. It would be very easy to collect information on paths of arbitrary length, or on any diﬀerent subset of paths — for instance, paths of length 1.5n . Another possibility is to modify the counting method to more accurately identify changes in the working set of the proﬁled program. This could be done using a sliding window that would take into account just the latest encountered subpaths, or with an aging function that would give more weight to more recent subpaths.

3

Prototype Implementation

We have implemented a prototype in Java, using the Soot framework [17]. In the prototype implementation, proﬁling a program consists of two steps: ﬁrst, the program to be proﬁled is instrumented. The class ﬁles are processed, and calls to the subpath proﬁler are added at the beginning of each basic block. Once the program is instrumented, it can be run and proﬁled on any given input. Instrumentation could also be performed dynamically, by modifying the Java class loader. Multi-threaded programs are handled by associating a diﬀerent subpath proﬁler with each running thread. This guarantees that subpaths from diﬀerent threads are kept separately, and also reduces synchronization overhead between the diﬀerent threads. The invocations to the updateHotList method are synchronized. Our initial experience indicates that this does not create synchronization overhead, since this method is rarely invoked. Since we are not notiﬁed when a thread ends, we periodically check whether the thread associated with each subpath proﬁler is still active, and if not, we make the subpath proﬁler eligible for garbage collection. In the prototype implementation, we did not implement JIT-like optimizations. Instead, when the JVM exits, a report is generated. For each path in the sampled set, its description and count are displayed.

86

David Oren et al.

In the current implementation the enterBlock method is part of the Java code. Hence it becomes the dominant factor in the total runtime overhead. A preferred implementation would be to have this method run in the JVM itself, in which case the sampling overhead is expected to become dominant. Therefore, in the measurements we have considered these two overheads separately. Path Representation For the reference implementation, we did not focus on path representation, and only implemented a simple path representation scheme. Path description is kept as a list of strings, each string describing a basic block. The lists are generated dynamically and entail some overhead, especially for long paths. It is important to remember that these descriptions are not strictly necessary. If the OSP algorithm is used in a JIT compiler, no output is necessary, and the descriptions of the hot subpaths are of no interest — each subpath can be identiﬁed with a unique integer id. However, even is these descriptions are required, they are not needed during program execution, but only when the report is displayed. Therefore, if memory becomes an issue, a possible solution would be to keep the path description not in memory, but in secondary storage. Each path description would have to be written to the disk only once, thus maintaining time overhead at acceptable levels. More complete solutions would involve developing a memory eﬃcient representation of the paths: for instance, a naive subpath description could contain a description of the block where it begins, and for each subsequent branch a bit signifying whether this branch was taken or not. A path of length n would thus require c + (n − 1) bits for its description, where c is the number of bits required to store the identiﬁer of the starting basic block. Since the Java bytecode contains multiple branch instructions (used with the switch construct) the actual encoding would have to be more complex. A diﬀerent solution altogether would be to represent the subpaths using tries. With tries it will be possible to eﬃciently check whether a subpath is already part of the sampled set, increase the count of an existing subpath, and add a new subpath. Using tries will require a way to convert paths to a canonical form, to make sure the trie is not sensitive to rotation. More details can be found in [12,13]. Encoding The encoding of subpaths determines how subpaths are grouped together for purposes of counting and reporting. The current implementation uses an encoding consisting of the subpath length, and of running the exclusiveor operator over block identiﬁers. This encoding is simple, eﬃcient, and groups together diﬀerent permutations of the same path. The exclusive-or encoding has a signiﬁcant drawback: it disregards blocks that occur an even number of times. In order to evaluate the quality of the results, we have run the proﬁler with a diﬀerent encoding as well. These tests

Online Subpath Proﬁling

87

have shown that the results obtained by the exclusive-or encoding are correct, in spite of its drawback. The implications of this encoding and other possible encodings are presented in [12,13]. 3.1

Results

We have run the proﬁler on four programs from the Java Grande benchmark suite [9], on the JLex utility [5] and on the javac Java compiler [15]. All programs were run on a computer with a 1.2GHz Athlon processor, and 512MB of memory running Sun’s JDK 1.3.1 on Windows 2000. Table 1 shows the sizes of those programs. It is important to remember that from the proﬁler’s view, what matters is not the number of lines of code in the program, but the program’s dynamic size (its trace length).

Table 1. For each program we show the number of basic blocks encountered during execution, the number of subpaths of length 2n where 2 ≤ n ≤ 5 and the number of distinct subpaths. For JLex there are two separate entries, one showing the number of subpaths of length up to 32, the other the number of subpaths of length up to 1024 Program Basic blocks Subpaths Distinct subpaths JLex (1024) 2,212,208 22,120,044 828,772 JLex (32) 2,212,208 11,060,983 37,985 FFT 169,867,487 849,337,378 870 HeapSort 124,039,672 620,198,303 1,095 MolDyn 1,025,640,629 5,128,203,088 6,316 RayTrace 1,367,934,068 6,839,670,283 6,800 javac 9,838,697 49,191,773 462,813

The table also displays the number of subpaths encountered during program execution, as well as the number of distinct subpaths encountered. The subpaths are those of length 2n , where n ≤ 5. For JLex, it was also possible to obtain accurate results for paths of length up to 1024. This was not done for the other programs, since extremely long runtimes would have been needed. These results show the size of the input data set over which the OSP algorithm works. It is also interesting to note that, even for a very limited subpath length, obtaining accurate results required an extremely large amount of time — more than an hour for FFT and HeapSort, almost ten hours on MolDyn and RayTrace. Runtime Overhead Table 2 shows the runtime overhead of the proﬁler. The total runtime overhead ranges from 31% to 286%. The sampling overhead (the overhead generated by the sampleBlock method) is much smaller, ranging from 3% to 56%.

88

David Oren et al.

Most of the runtime overhead is created by the skipping process. If the proﬁler is incorporated into the JVM — for instance, in order to use it for JIT compiling — the skipping process will have much lower overhead. In such a case, the total runtime overhead will be similar to the sampling overhead presented here. Further understanding of the overhead created by the proﬁler can be gained by examining the ﬁrst section of the Java Grande benchmark suite. These benchmarks check raw performance of the JVM, by measuring how many operations of various kinds are performed per second. For instance, a loop containing additions of ints will see a ten fold slow-down. On the other hand, a loop containing divisions of longs will slowdown only by a factor of 1.18. Creating an array of 128 longs will have an even smaller slowdown factor of 1.04.

Table 2. The running time in seconds of the original and the instrumented programs, and the time the algorithm spent in sampling mode. The two last columns display the total runtime overhead, and the overhead generated by the sampling process itself, without taking into account the cost of deciding when to sample a path Program JLex FFT HeapSort MolDyn RayTrace javac

Time Instrumented Only-sampling Total Overhead Sampling Overhead 0.390 0.640 0.070 64.10% 17.95% 21.080 27.649 2.123 31.16% 10.07% 1.982 6.238 1.111 214.73% 56.05% 10.064 33.878 1.101 236.63% 10.94% 11.997 46.356 0.450 286.40% 3.75% 1.31 3.63 0.230 177.10% 17.55%

Sampling and Eﬃciency Tradeoﬀ Table 3 displays the number of sampled subpaths as recorded by our implementation of the OSP algorithm. The second and third columns are the number of sampled subpaths with and without repetitions. The Gain column displays the average count of a subpath in the sampled set, i.e., the gain obtained by using concise samples. The f column shows the sampling frequency, as deﬁned in Equation 1. We impose a minimum limit on f , since low values of f generate high overhead and do not contribute to the accuracy of the results being obtained. This was important for the FFT program, where the gain is very high. In the original FFT run, for instance, the sampling probability was one in 40. The results were similar, but the total runtime overhead was 145% (compared to 31% in the ﬁnal run), and the sampling overhead was 102% (compared to 10%). As has already been mentioned, the OSP overhead does not depend only on the sampling probability. The HeapSort program performs very simple operations on integers (comparisons and assignments). Since the cost of sampling, relative to these simple operations, is high, the sampling overhead is higher for this program than for others.

Online Subpath Proﬁling

89

Table 3. The number of subpaths in the sample with and without repetition, the gain obtained by using concise samples (the ratio between columns two and three), and the sampling frequency f at the end of the program Program # subpaths # distinct subpaths Gain f JLex 2,183 891 2.45 1,000 FFT 168,885 314 537.85 1,000 HeapSort 10,217 475 21.50 12,304 MolDyn 2,530 353 7.17 400,000 RayTrace 5,276 443 11.90 260,000 javac 281 263 1.07 32,000

Memory Overhead Table 4 shows the memory overhead of the proﬁler. The programs’ memory footprint (for both the instrumented and the uninstrumented versions) was measured at the end of the execution. The programs’ memory footprint varies between 100 and 200 kilobytes, and the proﬁler’s is about 50 kilobytes. For simplicity, we used a straightforward representation of sampled subpaths. Thus, the actual memory required during a proﬁling run may be higher. With a diﬀerent implementation this can be avoided, as suggested earlier in this section. Table 4. Memory usage of the diﬀerent programs. The instrumented memory does not take into account the memory needed for maintaining the output of the algorithm Program Program footprint Instrumented footprint Overhead JLex 169,728 213,032 43,304 FFT 107,416 147,168 39,742 HeapSort 107,400 156,360 48,960 MolDyn 111,800 152,664 40,864 RayTrace 108,016 173,816 65,800

Accuracy of Results Table 5 compares the results obtained by the OSP implementation with results obtained for a proﬁler, that collects information about all subpaths (with no sampling). For brevity, we only show the results for FFT. Similar results were obtained for JLex. For each subpath, an estimated count was computed, by multiplying its count in the sample by the sampling probability and by the a priori probability of sampling a path of that length. The table shows, for each of the ten hottest subpaths in the sample, its rank in the accurate results. We can see that the estimated count is very close to the accurate one. For example, the count of the hottest subpath was estimated with a precision of 0.94%, and of the second hottest with a precision of 0.11%.

90

David Oren et al.

Table 5. For the hottest paths in the sample we show their true rank as obtained by counting all subpaths, their count in the sample and in the full results, their estimated count and the error in the estimation. For each path we also show its length. The table is sorted by estimated count Sample rank Exact rank Sample count Est. count 1 1 27,006 108,024,000 2 2 6,479 103,664,000 3 3 12,841 102,728,000 4 4 39,545 79,090,000 5 6 2,372 18,976,000 6 11 4,322 8,644,000 7 12 4,226 8,452,000 8 10 4,200 8,400,000 9 9 4,155 8,310,000 10 8 4,022 8,044,000

Exact count 109,051,898 103,782,188 101,713,904 79,691,780 14,679,016 8,388,604 8,388,520 8,388,608 8,388,608 8,388,608

Error Length 0.94% 4 0.11% 16 1.00% 8 0.76% 2 29.27% 8 3.04% 2 0.76% 2 0.14% 2 0.94% 2 4.11% 2

Table 6. Stops after every 10 millions blocks. At each stop point, we show the rank in the sample of the 5 highest ranking subpaths in the full count. Note that the 5 highest ranking subpaths are not necessarily the same at each stop point True Rank 6% 12% 18% 24% 30% 36% 1 2 6 1 2 2 1 2 3 4 2 1 1 2 3 1 2 3 3 3 4 4 4 1 8 4 4 3 5 5 5 7 5 5 5

In spite of the proﬁler’s preference for short paths, we can see that the hottest paths were of non-trivial length. Incremental Results The algorithm can, at any point during program execution, give an estimate of the hottest subpaths encountered so far. In order to test this capability, we have stopped the FFT example at several equally spaced points. At each of these points, we took the 5 hottest subpahts in the accurate subpath count, and checked their rank in the report of the sampling proﬁler. We can see in Table 6 that during program execution the intermediary results obtained by the sampling proﬁler match the “true” results obtained by a full count of all subpaths with high accuracy. Similar results were obtained for JLex. Arbitrary Length In order to perform a sanity check on our decision to limit ourselves to paths of length 2n , we have run a diﬀerent version of the proﬁler, which is able to sample paths of arbitrary lengths. The length of the paths sampled varies from 2 to 1024, with the probability of selecting a path of length n 1 . being approximately 10n

Online Subpath Proﬁling

91

As expected, the results were much more noisy, with the hottest subpaths being sampled no more than 3 times. In spite of this, the results are acceptable, with the hottest subpaths corresponding to those obtained when the path lengths where limited to 2n . Still, the low count of the results means they are not accurate with high probability. Therefore, running the OSP algorithm with arbitrary path length would require a larger sampling probability, and a larger memory overhead, to make sure paths are sampled often enough for results to be meaningful.

4

Related Work

The original Ball-Larus path proﬁling algorithm recorded the execution frequency of intraprocedural, acyclic paths [4]. The program was instrumented in such a way that each path would generate a unique identiﬁer during program execution. Ammons, Ball and Larus extended acylic path proﬁling [1]. They associated hardware metrics other than execution frequency with paths. They also introduced a runtime data structure to approximate interprocedural paths. In practice [10] these linkages were imprecise, and this method does not connect paths across loop iterations. Another interprocedural extension of the Ball-Larus path proﬁling technique is described by Melski and Reps [11]. Paths in this technique do not cross loops. Interprocedural paths are assigned a unique identiﬁer statically. Larus [10] later described a new approach to path proﬁling, which captures a complete picture of the program’s dynamic behavior. He introduced whole program paths, which are a complete compact record of a program’s entire control ﬂow. A whole program path crosses both loop and procedure boundaries, and so provides a practical basis for interprocedural path proﬁling. Since the whole program path can be quite large (hundreds of megabytes), it has to be compressed, and compression is achieved by representing the WPP as a grammar. The grammar is over an alphabet of symbols representing acyclic paths, but the algorithm can be adapted to run over an symbols representing vertices or edges. Once the WPP for a program has been collected and compacted, it is possible to run diﬀerent analyses on this representation of program ﬂow. Larus presents one such analysis, which identiﬁes hot subpaths. The WPP approach requires two stages: data collection and analysis. Hence, it cannot be used by a JIT compiler to locate hot subpaths during program execution. Duesterwald and Bala [7] analyze online proﬁling and its application to JIT compilation. Online proﬁling is a diﬀerent challenge than oﬄine proﬁling: the longer the program execution is proﬁled, the later will predictions be made and, consequently, the lower will be the potential beneﬁt of the predictions. They have shown that prediction delay is a signiﬁcant factor in evaluating the quality of a prediction scheme. Thus, while intuition may call for longer and more elaborate proﬁling, the opposite is true: less proﬁling actually leads to more eﬀective

92

David Oren et al.

predictions. We believe it would be interesting to combine hot subpath proﬁling with their results. Taub, Shechter and Smith present an idea for reducing proﬁling overhead [16]. This approach produces binaries that can to periodically record aspects of their executions in great detail. It works because program behavior is predictable, and it suﬃces to collect information during only part of the program run-time. After a speciﬁed number of executions, the instrumentation can remove itself from the program code, and generate no more overhead. In [2], Arnold and Ryder proposed to maintain two versions of the program in memory — one instrumented, and one almost uninstrumented. The program execution can then jump between these two versions, collecting enough data for eﬀective proﬁling, but keeping the overhead low. The technique as presented there is diﬀerent from the OSP algorithm in several details — back-edges return to the uninstrumented code, independently of the proﬁler — but their framework could be adapted for use by the OSP algorithm. Bala, Duesterwald and Banerjia present in [3] a dynamic optimization system called Dynamo. Dynamo is implemented as a native code interpreter that runs on top of the native processor. Once hot traces are located they are aggressively optimized, and the next occurrences of those traces will run natively. Hot traces may begin only at certain predetermined points, so the results obtained by the OSP algorithm, where no such restriction exists, are more general in nature (as can be seen in Figure 1). It would be interesting to integrate the OSP algorithm into Dynamo, in order to evaluate its beneﬁts and to compare both methods. A diﬀerent approach of using sampling for proﬁling using a combined software and hardware solution is described in [14]. Adaptive sampling techniques have been used in related ﬁelds, such as value proﬁling [6].

5

Conclusions

In this paper we demonstrated an eﬃcient technique for online subpath proﬁling, which is based on an adaptive sampling technique. The OSP algorithm has been implemented as a prototype, and has been successfully tested on several Java programs. If the proﬁler is incorporated into the JVM, the skipping process can be incorporated into the JVM as well. As was mentioned, the proﬁler overhead consists of two parts — the one caused by the skipping process, and the one caused by the sampling process. Once the skipping process is part of the JVM, its overhead could be lowered. For a discussion of possible optimizations when incorporating proﬁling into a JVM, see [2]. Once the OSP algorithm is fully integrated into a JVM, its output could be used to locate possible candidates for JIT compilation. It is possible to modify the proﬁler so that it will take the context of subpaths into account. For example, enterBlock can be modiﬁed to prefer paths starting at a back edge, or any other paths interesting to the user.

Online Subpath Proﬁling

93

One of the main advantages of the OSP algorithm over other methods is that it can cross loop and procedure boundaries. The Ball-Larus path proﬁler loses information about the context of a path and its correlation to other paths. For example, consider a loop which contains an if-clause, which separates odd from even iterations. The subpath proﬁler will sample two hot subpaths, one for the behavior occurring for odd iterations, one of the behavior occurring for ones. However, the subpath proﬁler will do more than that. Another hot subpath that will be sampled is the subpath consisting of the concatenation of these two behaviors. An optimizing compiler could use this information to create a specialized unrolled version of the loop that would not contain branching instructions. The algorithm can also be extended to give a priori costs to paths, and to use this costs to aﬀect the probability of sampling paths. For a more in-depth description see [12,13].

Acknowledgments We would like to thank Evelyn Duesterwald, Jim Larus, David Melski, Ran Shaham and Eran Yahav for their helpful comments and Alex Warshavski for his assistance in using Soot.

References 1. G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with ﬂow and context sensitive proﬁling. ACM SIGPLAN Notices, 32(5):85–96, 1997. 91 2. M. Arnold and B. G. Ryder. A framework for reducing the cost of instrumented code. In SIGPLAN Conference on Programming Language Design and Implementation, pages 168–179, 2001. 92 3. V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent dynamic optimization system. In SIGPLAN Conference on Programming Language Design and Implementation, pages 1–12, 2001. 92 4. T. Ball and J. R. Larus. Eﬃcient path proﬁling. In International Symposium on Microarchitecture, pages 46–57, 1996. 91 5. E. Berk and C. S. Ananian. JLex – A lexical analyzer generator for Java. Available at http://www.cs.princeton.edu/˜appel/modern/java/JLex. 79, 80, 87 6. M. Burrows. Eﬃcient and ﬂexible value sampling. In Proceedings of the 9th Conference on Architectural Support for Programming Languages and Operating Systems, November 2000. 92 7. E. Duesterwald and V. Bala. Software proﬁling for hot path prediction: Less is more. In Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 202–211, 2000. 78, 91 8. P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the ACM SIGMOD, pages 331–342, 1998. 80, 81, 83, 85 9. JGF. The java grande forum benchmark suite. Available at http://www.epcc.ed.ac.uk/javagrande. 80, 87

94

David Oren et al.

10. J. R. Larus. Whole program paths. In SIGPLAN Conference on Programming Language Design and Implementation, pages 256–269, 1999. 84, 91 11. D. Melski and T. W. Reps. Interprocedural path proﬁling. In International Conference on Compiler Construction, pages 47–62, 1999. 91 12. D. Oren. Online subpath proﬁling. Master’s thesis, Tel-Aviv University, 2002. 83, 86, 87, 93 13. D. Oren, Y. Matias, and M. Sagiv. Online subpath proﬁling. Technical report, Tel Aviv University, 2002. 83, 86, 87, 93 14. S. Sastry, R. Bodik, and J. Smith. Rapid proﬁling via stratiﬁed sampling. In the 28th International Symposium on Computer Architecture, July 2001. 92 15. Sun. The Java2 Platform Standard Edition. Available at http://java.sun.com/j2se/1.3. 80, 87 16. O. Taub, S. Schechter, and M. D. Smith. Ephemeral instrumentation for lightweight program proﬁling. Technical report, Harvard University, 2000. 92 17. R. Vallee-Rai, E. Gagnon, L. J. Hendren, P. Lam, P. Pominville, and V. Sundaresan. Optimizing java bytecode using the soot framework: Is it feasible? In Proceedings of the International Conference on Compiler Construction, pages 18–34, 2000. 80, 85 18. J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37–57, 1985. 83

Precise Exception Semantics in Dynamic Compilation Michael Gschwind and Erik Altman IBM T.J. Watson Research Center Yorktown Heights, NY 10598

Abstract. Maintaining precise exceptions is an important aspect of achieving full compatibility with a legacy architecture. While asynchronous exceptions can be deferred to an appropriate boundary in the code, synchronous exceptions must be taken when they occur. This introduces uncertainty into liveness analysis since processor state that is otherwise dead may be exposed when an exception handler is invoked. Previous systems either had to sacriﬁce full compatibility to achieve more freedom to perform optimization, use less aggressive optimization or rely on hardware support. In this work, we demonstrate how aggressive optimization can be used in conjunction with dynamic compilation without the need for specialized hardware. The approach is based on maintaining enough state to recompute the processor state when an unpredicted event such as a synchronous exception may make otherwise dead processor state visible. The transformations necessary to preserve precise exception capability can be performed in linear time.

1

Introduction

Dynamic compilation is a powerful technique to optimize programs based on execution behavior and to respond to changes in the execution proﬁle. Dynamic optimization can be used either as a technique in its own right, or in combination with binary translation techniques. Dynamic optimization includes techniques to perform code layout for improved memory behavior, optimize frequently executed program paths, speculatively execute instructions or use value prediction [7,6,16,5,4,3,10]. A number of other optimization techniques are also highly eﬀective in conjunction with dynamic optimization by exploiting runtime program proﬁle data, such as dead code elimination, code sinking, unspeculation or partial redundancy elimination [9]. These techniques are even more useful for binary translation where the original ISA may cause the program to compute extraneous state which is hard to emulate and unnecessary, such as the computation of condition codes as a side eﬀect of every instruction [11]. To produce correct execution behavior, dynamic optimization has to be conservative in analyzing and optimizing programs. In particular, the visible state of the program has to match the state of the unoptimized program at any point R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 95–110, 2002. c Springer-Verlag Berlin Heidelberg 2002

96

Michael Gschwind and Erik Altman

during program execution. This requirement imposes signiﬁcant restrictions on the types of optimizations which can be performed without impacting program correctness, because synchronous interrupts can expose parts of the state that are otherwise invisible. In the DAISY dynamic translation project we found that on a 4-wide machine running the SPECint95 benchmarks, ILP can be reduced by up to 18%, and by an average of 10% by the requirement that every (possibly dead) result be placed in the architected register of the source architecture and further that each result be placed in the architected register in order [2]. Consider the following code sequence: 1 2 3

add r4,r3,r4 lwz r3,0(r9) add r4,r3,r3

Clearly, the instruction at line 1 is dead, but a page fault caused by the load instruction at line 2 could make the value of r4 visible to the exception or signal handler. In many cases, the only action taken by the handler may be to store and restore the value in r4, but if the handler bases any actions on the values stored in register r4, the program may fail. Thus, many dynamic optimizers have either severely restricted the amount of dead program state computation which can be eliminated [7]. Some dynamic optimizers have included a ‘safe mode’ which disables such optimizations [4,3], but this is undesirable since this approach (1) requires to identify which program rely on extensive program state analysis in their exception handler, and (2) such programs are over their entire execution, even if no exception ever occurs. In this work, we present a solution to allowing dead state eliminating techniques during dynamic optimization while retaining exact program behavior. In particular, this approach is based upon deferring materialization of otherwise dead code to the few instances where its results may be accessed by a synchronous exception handler. This is achieved by invoking a repair function provided by the dynamic optimizer environment which repairs the state of the program before actually passing control to a native synchronous exception handler (or its translation, in dynamic binary translation). Dynamic compilation is key to eﬃciently implementing this technique and taking full advantage of it with other optimizations. A static compiler faces an exponential growth in ﬁxup code as operations are speculatively moved past multiple branches, while a dynamic compiler only generates these fragments in the (rare) event they are actually needed. This paper is structured as follows: we give an overview of the basic approach in Section 2. We present a sample algorithm for the elimination of dead code in Section 3 and discuss applications to other optimization techniques such instruction scheduling and unspeculation in Section 4. We describe program state repair in Section 5. We present initial results in Section 6. We discuss related work in Section 7 and draw our conclusions in Section 8.

Precise Exception Semantics in Dynamic Compilation

2

97

Basic Approach

The technique at the heart of this approach is annotation of generated code to allow a native exception handler to repair the state of the program to reﬂect the in-order state at any point in program execution where a synchronous exception can arise. Considering the example from the Introduction, we note that when an exception does not occur, eliminating instruction 1 would be a legal transformation. By introducing a repair step before the transfer of control to the exception handler, a legal code sequence can be achieved. Consider the following code, which has eliminated instruction 1, but annotated instruction 2 with repair actions to perform before control is passed to the exception handler. 1 2 3

*** on exception, repair: r4 = r3+r4 lwz r3,0(r9) add r4,r3,r3

Thus, when instruction 2 raises an exception, the repair actions will restore the value of r4 to that seen in the original program, but otherwise an instruction has been eliminated. This corresponds to the control ﬂow graph (CFG) transformation in Figure 1 if the exception handler is viewed as a branch in the control ﬂow graph which is arguably correct. To make program transformations based on dead state eliminating techniques safe for use in dynamic optimization, several steps are necessary. During the optimization phase, enough information must be retained to regenerate eliminated state. This includes information both about the operations which were eliminated, as well as preserving the input values feeding the operation.

Original CFG add

r4,r3,r4

lwz

r3,0(r9) Likely

add

r4,r3,r3

Transformed CFG *** Transform

lwz

Unlikely

Exception Handler

r3,0(r9)

Likely add

r4,r3,r3

Unlikely add

r4,r3,r4 Exception Handler

Fig. 1. Control Flow Graph Transformation for Repair Code

98

Michael Gschwind and Erik Altman

When code is emitted, information about the eliminated computations has to be emitted into the translation cache so it can later be used by the repair mechanism. And, ﬁnally, when an exception occurs, a repair function must interpret the information about eliminated state and recompute it so as to restore the entire program state before control is passed to a translation of the native exception handler.

3

Algorithm for Dead Code Elimination

Our algorithm for these optimizations is best demonstrated for the simplest case, dead code elimination. While dead code elimination is not very useful for a properly optimized program in the context of dynamic optimization, many optimizations can be reformulated as having precise exception semantics by leaving the original operations in place as dead operations computing values solely for the purpose of maintaining precise exception state. Dead code elimination can then be used to eliminate these operations. Also, dead code elimination is extremely useful when used in conjunction with binary translation where it can be used to eliminate extraneous state introduced by the ISA, e.g., by condition code setting instructions in CISC ISAs such as the Intel x86 or IBM System/390 [11]. We will assume that the original program representation has been converted to an Internal Representation (IR) which has a single result value per operation. In the case of instructions with multiple results (such as compute and set condition code instructions), a machine instruction will be represented by multiple IR operations. We also assume that the IR is in SSA form.1 Our algorithm uses a register equivalence list for liveness analysis and register allocation, to ensure that input values of eliminated instructions will be available if they are needed to compute the exact program state. We will denote a live-range register equivalence as s3 ≡ s4 , s7 indicating that at any point in the IR that a symbolic register s3 is mentioned, symbolic registers s4 and s7 are to be considered live as well for the purpose of register allocation. The algorithm iterates over an operation list representing a single translation group, and ﬁnds operations with a dead result. These operations can be eliminated, provided their result can be reconstructed in the event of an exception. To ensure this, the algorithm adds a use of the dead target symbolic register name after the instruction killing the result2 , and adds an entry to the register equivalence list which equates the dead result symbolic register to the symbolic input registers of the dead operation.3 These two steps ensure that all input registers of deleted operations are live to the latest point where the target register 1 2

3

Most dynamic compilers work on basic blocks or extended basic blocks, so this transformation is straightforward. The use node represents the use along the exception control ﬂow. Since there are no instructions along that path the uses along that arc can be folded into the mainline control ﬂow at the conceptual control ﬂow split at exception raising instructions. The use node for the original register serves a dual function – it represents the minimum range of validity of the repair note, and consequently how long the input

Precise Exception Semantics in Dynamic Compilation

99

foreach operation op if dead ( target (op) ) convert2repairnote (op); %% Deletes op and inserts as repair note foreach instruction killing target (op) insert use ( target (op) ) insert equivalence ( target (op) == sources (op) )

1 2 3 4 5 6

Fig. 2. Basic Algorithm

may be live. Then, the dead instruction is removed and replaced by a repair note in the IR. (The diﬀerence between an actual instruction and a repair note can be a single bit ﬂag ﬁeld in the IR structure.) This yields the algorithm in Figure 2, which is linear (O(N ) where N is the number of instructions in a CFG) in both time and the size of data structures. This algorithm can successfully deal in a single pass with a group of dead instructions which are dependent on each other, provided the liveness check at line 2 is transitive, i.e., a source register to any instruction is only live if its output is live. The transitive closure of live and dead values can be computed in a single backward sweep of the dependence graph, and hence is O(n).

and.

r4,r3,r4

| | lwz r3,0(r9) | add r4,r3,r3 | | addi r5,r3,80 | lwz r3,0(r10) | addi. r5,r3,1 | | | | | PowerPC | ASSEMBLY CODE | (a) |

1 2 3 4

s4’ sc0’ s3’ s4’’

= s3 & s4 = (s3 & s4) cmp 0 = [s9] = s3’ + s3’

| | | | | 5 s5’ = s3’ + 80 | 6 s3’’ = [s10] | 7 s5’’ = s3’’ + 1 | | 8 sc0’’ = (s3’’ + 1) cmp 0 | | | INITIAL | INTERMEDIATE REPRESENTATION | (b) |

1 { s4’ 2 { sc0’ 3 s3’ 4 s4’’ use s4’ 5 { s5’ 6 s3’’ 7 s5’’ use s5’ 8 sc0’’ use sc0’

= = = = ; = = = ; = ;

s3 & s4 } (s3 & s4) cmp 0 } [s9] s3’ + s3’ s4’ == < s3,s4 > s3’ + 80 } [s10] s3’’ + 1 s5’ == < s3’ > (s3’’ + 1) cmp 0 sc0’ == < s3,s4 >

INTERMEDIATE REPRESENTATION AFTER ANNOTATION (c)

Fig. 3. Example: PowerPC Destination registers are at left. A “.” after an operation means set condition register 0 by comparing the result to 0 registers need to be available. If some output rX of a repair note A rX = r1 OP r2 is required as input of another repair note B rZ = rX OP r3, this will extend the live range of rX, and, because register live range equivalence is transitive, also r1 and r2.

100

Michael Gschwind and Erik Altman

Consider the operation of this algorithm on the PowerPC code sequence in Figure 3(a), and recall that excepting operations such as loads represent control ﬂow points for our purposes. The initial IR after SSA conversion for this code is in Figure 3(b). The ﬁrst operation (s4’ = s3 & s4) is dead after IR Op 4 (s4’’ = s3’ + s3’), so as shown in Figure 3(c), a use of s4 is inserted and s4 ≡ s3 , s4 . IR Op 2 results from the fact that the PowerPC operation and. sets condition register 0. The value in condition register 0 is dead at IR Op 8, so we insert a use of sc0 after Op 8, and sc0 ≡ s3 , s4 , as can be seen in Figure 3(c). Finally, IR Op 5 is dead at IR Op 7, so we insert a use of s5 after Op 7, and s5 ≡ s3 , as can again be seen in Figure 3(c). When this code is converted into the target assembly code, register allocation will be performed on the symbolic registers and a register map table describing how to reload physical registers from the translation to achieve the original state [2]. For eliminated registers, this table will contain the names and formulae of the symbolic registers. To reduce storage requirements, side tables may also be dynamically recomputed when an exception occurs [15]. Note that this algorithm overly conservative because repair is not necessary if no instruction can trigger a synchronous exception between the point of the original instruction and the point where its result is killed. Also, repair needs to be possible only up to the last instruction which can cause a synchronous exception. Thus, as shown in Figure 4, we can reformulate the algorithm to insert the use operator to keep alive a value only to the last possible exception point. Dead values whose live range does not span an exception point are not backstopped by a ‘use’ node and will be deleted by a subsequent dead code elimination pass unless they are needed to feed a repair note which may be evaluated to reconstruct the precise exception state. This algorithm takes O(N 2 ) time, where N is again the number of instructions in a CFG. As mentioned earlier, we assume that the CFG is an extended basic block with a single entry point and multiple exits. Figure 5 illustrates a CFG for an extended basic block with P = 5 paths. Each path in the CFG is traversed in a depth-ﬁrst manner, keeping track of (1) the last excepting operation on a path and (2) the last instruction on a path to write each register, as depicted in our final recursive algorithm in Figure 6. To make this point clearer, consider the (extended basic block) CFG in Figure 5. Taking the leftmost path, P1, instruction 1 is ﬁrst encountered. It writes to register r4. A bit later on this path, instruction 3 can raise a synchronous exception, as noted by the E. Finally at the end of this path, instruction 5 writes to register r4, thus killing (on this path) the result computed by instruction 1. If instruction 1 is dead on all paths, then the algorithm: – Notes that instruction 3 — the last excepting op — represents a potential use of r4. – Saves the information needed to compute r4 if an exception does occur at instruction 3. – Converts the killed ins, instruction 1, to a repair note and deletes it.

Precise Exception Semantics in Dynamic Compilation

101

foreach operation OP if dead ( target (OP) ) insert equivalence ( target (OP) == sources (OP) ) repair ever := FALSE; for all paths p starting at OP repair path := FALSE for all operations I on path p if operation I can cause synchronous exception repair ever := TRUE; repair path := TRUE; last excepting op := I; if operation I kills target (OP) && repair path insert use ( target (OP), last excepting op ) next path;

convert2repairnote (OP);

Fig. 4. Algorithm augmented so as to avoid repair notes if they are not needed

1 2 Instruction dfn 3

E

r4=

9

4 5

P1

r4=

7

r4= E

6

P2

= Synchronous Exception Causing Ins

8

P3

E

10

E 11

P4

r4=

12

P5

Fig. 5. CFG for Extended Basic Block of code with single entry and multiple exits

102

Michael Gschwind and Erik Altman

final (OP, prev writer, last excepting op) { if (!OP) { // Handle end of recursion forall src { first_use[src].op = NULL; first_use[src].intervening_exception = NONE; } return first_use; } if operation OP can cause synchronous exception last_excepting_op := OP; curr_result_reg := target(OP) killed_ins := prev_writer[curr_result_reg]; prev_writer[curr_result_reg] := OP; if killed_ins != NONE { if (dead (killed_ins)) { // it is dead along all paths; computation can be removed totally insert_equivalence ( target(killed_ins) == sources(killed_ins)) convert2repairnote(killed_ins); if (dfn [last_excepting_op] >= dfn[killed_ins]){ insert_use (target(killed_ins), last_excepting_op) } else { set_candidate_for_delete(killed_ins); } } else { // instruction is live among some paths, but dead on current path // candidate for code sinking (PRE), will be performed below } } if ! branch (OP) { first_use = final (OP->left, prev_writer, last_excepting_op) } else { first_use_left = final (OP->left, prev_writer, last_excepting_op) first_use_right = final (OP->right, prev_writer, last_excepting_op) // register-wise combination on control flow splits first_use = combine (first_use_left, first_use_right) } // perform sinking if possible, inserting repair note if necessary push_op_down(OP, first_use[curr_result_reg].op); if (first_use[curr_result_reg].intervening_exception) { insert_use (target(OP), first_use[curr_result_reg].intervening_exception) insert_equivalence ( target(OP) == sources(OP)) convert2repairnote(OP); } forall src in sources(OP){ first_use[src].op = OP; first_use[src].intervening_exception = NONE; } if operation OP can cause synchronous exception forall regnames defined in architecture if first_use[src].intervening_exception == NONE first_use[src].intervening_exception = OP; return first_use; }

Fig. 6. Final algorithm

Precise Exception Semantics in Dynamic Compilation

103

Similar actions occur on path P4. When instruction 4 is encountered, it writes to register r4 and hence kills the result of instruction 1. However, no excepting instructions have been encountered on this path, hence no repair note need be added. Continuing down path P4, instruction 10 is noted as last excepting op. At instruction 11, register r4 is written, thus killing the result computed at instruction 7. If instruction 7 is dead on all paths, then 3 steps akin to those above on path P1 are performed. Note that the algorithm in Figure 6 uses dfn — a depth first numbering of nodes — to determine whether an excepting operation has occurred between an operation and its killer. This dfn represents the relative position of each instruction on a path, and is monotonically increasing from the start to the end of any path. For example, on path P1 in Figure 5, the instructions’ dfn’s are 1, 2, 3, 4, 5, while on path P4 they are 1, 2, 7, 9, 10, 11. The algorithm described here uses recursive descent to visit each node in the control ﬂow graph in depth ﬁrst order. The bottom half of this algorithm performs code sinking (partial redundancy elimination) on the upward pass of the recursive descent algorithm. Each node is visited twice (during the downward and the upward pass), so we posit that the algorithm is O(N ). For each register name, the ﬁrst use following the current op is maintained in first_use. On control ﬂow splits, data from both paths is combined. The combine function propagates upward the first_use of the a register if it is only used along a single path, or deﬁnes the control ﬂow split as the ﬁrst use if the register is used along both paths. (Other types of combine operations are possible, but lead to code duplication. This is a trade-oﬀ which could make good use of proﬁle data available in a dynamic compilation system.) This algorithm can be further extended to consider register pressure when making optimization decisions, since in some circumstances the optimization technique presented here can extend two live ranges to eliminate one dead liverange, thereby increasing register pressure and forcing the register allocator to spill registers to memory.

1 2 3 4 5 6

7 8

{ {

s4’ sc0’ s3’ use s4’ s4’’ { s5’ s3’’ use s5’ use sc0’ s5’’ sc0’’

= = = ; = = = ; ; = =

s3 & s4 } (s3 & s4) cmp 0 } [s9] s4’ == < s3, s4 > s3’ + s3’ s3’ + 80 } [s10] s5’ == < s3’ > sc0’ == < s3, s4 > s3’’ + 1 (s3’’ + 1) cmp 0

Fig. 7. Reduced Live Range of Repair

104

Michael Gschwind and Erik Altman

Applying the modiﬁed algorithm to the example, the live range of repair notes is reduced as can be seen by comparing Figure 7 to Figure 3(c). However, none are actually eliminated in this particular example.

4

Other Optimizations

The algorithm presented in the previous section can be adapted trivially to schedule instructions later than their original schedule (code sinking). A repair note is then inserted in the original instruction slot. Note that no special provisions have to be made to preserve the input values of the repair note, since they are also an input to the rescheduled instruction:

foreach operation op if schedule below ( op ) %% Deletes op and inserts as repair note. convert2repairnote (op); Unspeculation (partial redundancy elimination) can be handled by a combination of dead code elimination along paths where a computation is redundant, and code sinking for those paths where the instruction is needed. A similar approach can also be applied to other optimizations, such as constant propagation, constant folding and commoning, where the original code becomes dead and is treated as described in Section 3. An approach based on repairing state can also be used to eliminate memory operations if disambiguation is possible at dynamic compile time. However, this is only possible in a uniprocessor context, as multiprocessor conﬁgurations may introduce additional producers and consumers for memory values which cannot be adequately analyzed. When performing instruction scheduling during dynamic optimization, state repair can also be used to achieve precise exceptions. We give a list scheduling algorithm modiﬁed to incorporate state repair for achieving precise exception semantics.

do { ready_ins := initially_ready(CFG); ins := select_ins(ready_ins); if (ins can cause exception){ predecessors := predecessors (ins, CFG); issue_repair_notes_from_list (predecessors); } ready_ins := ready_ins UNION successors(ins, CFG) } until (ready_ins = EMPTY_SET)

Precise Exception Semantics in Dynamic Compilation

5

105

Repair Handler

Since repair notes are rarely evaluated (only on synchronous exceptions), no actual code is generated. Instead, the repair notes are stored in compact form in main memory, and interpreted by an interpretative evaluator on demand. Thus, the cost of repair notes consists of time penalties when entering the exception handler to interpret the repair notes associated with the current instruction group, and the cost to store the repair notes and the interpretative evaluator for the repair notes. When a synchronous exception occurs, control ﬁrst passes to the repair handler. To compute the entire program state, the repair handler sequentially evaluates all repair notes in a single forward sweep. Then, all registers are assigned to their “home locations” (typically, the identity mapping) before control is transferred to the translation of the exception handler.

[Initially, all registers are assumed to be in their home locations] 0x00 0x04 0x08 0x0C 0x10

lwz add lwz addi cmpi

R32, 0(R9) R3, R32, R32 R33, 0(R10) R5,R33,1 CR0,R5,0

[ [ [ [ [

r3 := R32 ] r4 := R3 ] r3 := R33 ] -unchanged- ] -unchanged- ]

Fig. 8. Annotations mapping physical to architected registers

Consider again the previous code example, which may have been assembled into the PowerPC code fragment in Figure 8. Because the algorithm did not consider register pressure, the optimized code fragment requires more than the original number of registers, leading to register numbers greater than R31. It is desirable to utilize available registers if the target architecture has more registers than the source architecture, but could lead to performance degradation otherwise, making consideration of register pressure an important aspect. Register mappings are updated incrementally, and indicated after each assembly instruction, as shown in Figure 8. Target architecture registers are indicated by capitalized register names. Figure 9 shows the repair notes stored for the code fragment in Figure 8. Note that to reduce the number of bits necessary for storing the symbolic registers associated with repair notes, their (separate) name space can also be allocated using coloring, as is done in Figure 9.

106

Michael Gschwind and Erik Altman

S0 = R3 & R4 SC0 = (R3 & R4) cmp 0 0x00: [r4 := S0; cr0 := SC0 ] S0 = R3 + 80 0x08: [r5 := S0; cr0 := SC0 ]

Fig. 9. Repair Notes

6

Results

To evaluate the performance potential of dead code elimination and code sinking in dynamic optimization environments, we used the DAISY environment to evaluate the optimization opportunity. This evaluation was performed for two systems, IBM PowerPC and IBM System/390. To gauge the performance opportunity, we applied the algorithm in Figure 6 to the DAISY group intermediate representation to determine the number of intermediate operations that can be eliminated from the execution path. The intermediate operations are deﬁned as having a single destination and a variable number of inputs. PowerPC and System/390 instructions requiring multiple destinations were cracked into a sequence of simpler instructions. Figure 10 shows the number of intermediate operations eliminated compared to the case when optimization is retarded by conservative assumptions about excepting instructions. This number of operations directly reﬂects the number of primitive operations which must be executed on a VLIW platform (such as BOA [10]). The number similarly reﬂects the number of simple micro-ops that would be executed in a layered instruction set implementation of a superscalar. The percentage of IR operations which can be eliminated in the benchmarks presented have been computed for two diﬀerent system operation points of the DAISY dynamic compilation system. These correspond to aggressive and conservative ILP extraction policies, labelled (a) and (c) respectively for each of the benchmarks in Figure 10. For PowerPC code almost 5% of primitive operations are removed on average in the aggressive case compared to only about 3% in the conservative case. For reasons explained below, more primitive operations are removed for System/390 code: 12% and 10% on average respectively for the aggressive and conservative cases. The code which was analyzed here was compiled with high optimization levels to mirror typical tuned SPEC code. Hence, any optimization opportunity found here is over what any state-of-the-art oﬄine compiler can achieve. In particular for System/390, an additional improvement is achieved by eliminating the computation of dead condition codes which are nearly always set as a byproduct of System/390 arithmetic instructions. Since emulating condition codes of one architecture on another often requires a long sequence of instructions, eliminating these computations is particularly important for achieving good performance in system emulation [10].

Precise Exception Semantics in Dynamic Compilation

107

% eliminated 25

20

15

10

5

0 a

c

a

compress

c

a

gcc

c

a

go

c

a

ijpeg

c

a

li

c

a

m88ksim

c

a

perl

c

a

tpcc

c vortex

% eliminated 25

20

15

10

5

0 a

c gcc

a

c go

a

c ijpeg

a

c li

a m88ksim

c

a

c perl

a

c system

Fig. 10. Dead code elimination opportunities for IBM PowerPC (top) and IBM System/390 (bottom)

108

Michael Gschwind and Erik Altman

Diﬀerences between the aggressive and conservative ILP extraction policies were the result of several factors: the thresholds used for determining when groups are extended, the maximum allowable group size, and the inﬁnite resource ILP target [8]. Aggressive group formation policy generates larger instruction groups in an eﬀort to extract more ILP from the code. As expected, larger group size did lead to more opportunity for dead code elimination, since all registers must be considered live on group transitions.

7

Related Work

While early work on dynamic compilation was concerned mostly with reducing the overhead per translated instruction, the applicability of more aggressive optimizations has become an issue in more recent work. Special purpose optimizations for deferring the full materialization of condition codes have been performed in previous architecture emulation systems, such as Wabi [13]. However, this type of deferred materialization has usually required that all source values be copied to deﬁned storage to be used for later materialization. This required signiﬁcantly more overhead than the present approach, but was a signiﬁcant performance improvement compared to full instantiation of condition codes which usually requires quite complex operations to match the semantics of the emulated architecture. DAISY explores the use of aggressive ILP optimizations in dynamic binary translation [7,6,11,8,1]. DAISY uses aggressive speculation, but performs inorder commit operations to the emulated processor state to achieve precise exceptions. DAISY exploits the atomic nature of VLIW instructions in the target architecture to perform dead code elimination in the scope of a single long instruction word. The DYNAMO dynamic optimization system performs dynamic optimization on HP-PA binaries, with the target being the HP-PA instruction set [4,3]. DYNAMO allows for aggressive optimizations, but uses program annotation or a user-selectable conservative optimization mode to deal with binaries where precise exception behavior is an issue. The Transmeta binary translation system for Intel x86 code [14] and the BOA system for IBM PowerPC code [10] use a hardware rollback/commit scheme to ensure precise exception behavior. At the beginning of each translation, the entire processor state is checkpointed at the entry of each translation group. When an exception is raised, the entire processor state is rolled back to the translation fragment entry state, and then the interpreter interprets instructions sequentially to compute all processor state. Le [15] and Altman et al. [2] show how a repair mechanism can be used to reduce the cost of register allocation in binary translation. Maintaining full program state for exception handling is related to the problem of presenting the full program state of optimized programs to debuggers. In both cases, otherwise unused state which is not being computed by the optimized program may be accessed [12].

Precise Exception Semantics in Dynamic Compilation

109

The constraints for a presenting program state in a debugger are diﬀerent, since it may be acceptable to devote more time to both the compilation and state recovery process. On the other hand, an optimizing compiler must be able to deal with a state query at arbitrary points, whereas a dynamic compilation system is aware that such queries are by restricted to those points where an instruction can raise a synchronous exception. In fact, debuggers running under a dynamic compilation system present an interesting mix of these dual requirements. In DAISY, we solve this by detecting code modiﬁcation or similar events (due to the setting of a breakpoint) and can dynamically recompile the eﬀected code.

8

Conclusion

Maintaining precise exceptions is an important aspect of achieving full compatibility with a legacy architecture. While asynchronous exceptions can be deferred to an appropriate boundary in the code, synchronous exceptions must be taken when they occur. This introduces uncertainty into the liveness analysis since otherwise dead processor state may be exposed when an exception handler is invoked. Previous systems either had to sacriﬁce full compatibility to achieve more freedom to perform optimization, use less aggressive optimization or rely on hardware support. In this work, we have demonstrated how aggressive optimization can be used in conjunction with dynamic compilation without the need for specialized hardware. The approach is based on maintaining enough state to recompute the processor state when an unpredicted event such as a synchronous exception may make otherwise dead processor state visible. The transformations necessary to preserve precise exception capability can be performed in linear time.

References 1. E. Altman and K. Ebcio˘ glu. Simulation and debugging of full system binary translation. In Proc. of the 13th International Conference on Parallel and Distributed Computing Systems, pages 446–453, Las Vegas, NV, August 2000. 108 2. E. Altman, K. Ebcio˘ glu, M. Gschwind, and S. Sathaye. Eﬃcient instruction scheduling with precise exceptions. In preparation. 96, 100, 108 3. V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: A transparent Dynamic Optimization System. SIGPLAN PLDI, pages 1–12, June 18-21, 2000, Vancouver, BC, June 2000. 95, 96, 108 4. V. Bala, E. Duesterwald, and S. Banerjia. Transparent dynamic optimization: The design and implementation of Dynamo. Technical Report 99-78, HP Laboratories, Cambridge, MA, June 1999. 95, 96, 108 5. H. Chung, S.-M. Moon, and K. Ebcio˘ glu. Using value locality on VLIW machines through dynamic compilation. In Proc. of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pages 69–76, December 1999. 95

110

Michael Gschwind and Erik Altman

6. K. Ebcio˘ glu and E. Altman. DAISY: dynamic compilation for 100% architectural compatibility. In Proc. of the 24th Annual International Symposium on Computer Architecture, pages 26–37, Denver, CO, June 1997. ACM. 95, 108 7. K. Ebcio˘ glu and E. Altman. DAISY: dynamic compilation for 100% architectural compatibility. Research Report RC20538, IBM T. J. Watson Research Center, Yorktown Heights, NY, 1996. 95, 96, 108 8. K. Ebcio˘ glu, E. Altman, S. Sathaye, and M. Gschwind. Execution-based scheduling for VLIW architectures. In Euro-Par ’99 Parallel Processing – 5th International Euro-Par Conference, number 1685 in Lecture Notes in Computer Science, pages 1269–1280. Springer Verlag, Berlin, Germany, August 1999. 108 9. K. Ebcio˘ glu, R. Groves, K. Kim, and G. Silberman. VLIW compilation techniques in a superscalar environment. In Proc. of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, volume 29 of SIGPLAN Notices, pages 36–48, Orlando, FL, June 1994. ACM. 95 10. M. Gschwind, E. Altman, S. Sathaye, P. Ledak, and D. Appenzeller. Dynamic and transparent binary translation. IEEE Computer, 33(3):54–59, March 2000. 95, 106, 108 11. M. Gschwind, K. Ebcio˘ glu, E. Altman, and S. Sathaye. Binary translation and architecture convergence issues for IBM System/390. In Proc. of the International Conference on Supercomputing 2000, Santa Fe, NM, May 2000. ACM. 95, 98, 108 12. J. Hennessey. Symbolic Debugging of Optimized Code. ACM Transactions on Programming Languages and Systems, July 1982, Volume 4, Issue 3, pages 323– 344, ACM Press. 108 13. P. Hohensee, M. Myszewski, and D. Reese. WABI CPU emulation. In Hot Chips VIII, Palo Alto, CA, 1996. 108 14. E. Kelly, R. Cmelik, and M. Wing. Memory controller for a microprocessor for detecting a failure of speculation on the physical nature of a component being addressed. US Patent 5832205, November 1998. 108 15. B. Le. An out of order execution technique for runtime binary translators. In Proc. of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, volume 33 of SIGPLAN Notices, pages 151–158, San Jose, CA, 1998. ACM. 100, 108 16. S. Sathaye, P. Ledak, J. LeBlanc, S. Kosonocky, M. Gschwind, J. Fritts, Z. Filan, A. Bright, D. Appenzeller, E. Altman, and C. Agricola. BOA: Targeting multigigahertz with binary translation. In Proc. of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pages 2–11, December 1999. 95

Decompiling Java Bytecode: Problems, Traps and Pitfalls Jerome Miecznikowski and Laurie Hendren Sable Research Group, School of Computer Science, McGill University {jerome,hendren}@cs.mcgill.ca

Abstract. Java virtual machines execute Java bytecode instructions. Since this bytecode is a higher level representation than traditional object code, it is possible to decompile it back to Java source. Many such decompilers have been developed and the conventional wisdom is that decompiling Java bytecode is relatively simple. This may be true when decompiling bytecode produced directly from a speciﬁc compiler, most often Sun’s javac compiler. In this case it is really a matter of inverting a known compilation strategy. However, there are many problems, traps and pitfalls when decompiling arbitrary veriﬁable Java bytecode. Such bytecode could be produced by other Java compilers, Java bytecode optimizers or Java bytecode obfuscators. Java bytecode can also be produced by compilers for other languages, including Haskell, Eiffel, ML, Ada and Fortran. These compilers often use very diﬀerent code generation strategies from javac. This paper outlines the problems and solutions we have found in our development of Dava, a decompiler for arbitrary Java bytecode. We ﬁrst outline the problems in assigning types to variables and literals, and the problems due to expression evaluation on the Java stack. Then, we look at ﬁnding structured control ﬂow with a particular emphasis on issues related to Java exceptions and synchronized blocks. Throughout the paper we provide small examples which are not properly decompiled by commonly used decompilers.

1

Introduction

Java bytecode is a stack-based program representation executed by Java virtual machines. It was originally designed as the target platform for Java compilers. Java bytecode is a much richer and higher-level representation than traditional low-level object code. For example, it contains complete type signatures for methods and method invocations. The high-level nature of bytecode makes it reasonable to expect that it can be decompiled back to Java; all of the necessary information is contained in the bytecode. The design of such a decompiler is made easier if it only decompiles bytecode produced by speciﬁc compilers, for example the popular javac available with Sun’s JDKs. In this case the problem is mostly one of inverting a known compilation strategy. The design of a decompiler is also simpliﬁed if it does not need to determine the exact types of R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 111–127, 2002. c Springer-Verlag Berlin Heidelberg 2002

112

Jerome Miecznikowski and Laurie Hendren

all variables, but instead inserts spurious type casts to “ﬁx up” code that has unknown type. We solve a more diﬃcult problem, that of decompiling arbitrary, veriﬁable bytecode. In addition to handling arbitrary bytecode, we also try to ensure that the decompiled code can be compiled by a Java compiler and that the code does not contain extraneous type casts or spurious control structures. Such a decompiler can be used to decompile bytecode that comes from many sources including: (1) bytecode from javac; (2) bytecode that has been produced by compilers for other languages, including Ada, ML, Eiﬀel and Scheme; or (3) bytecode that has been produced by bytecode optimizers. Code from these last two categories many cause decompilers to fail because they were designed to work speciﬁcally with bytecode produced by javac and cannot handle bytecode that does not ﬁt speciﬁc patterns. To achieve our goal, we are developing a decompiler called Dava, based on the Soot bytecode optimization framework. In this paper we outline the major problems that we faced while developing the decompiler. We present many of the major diﬃculties, discuss what makes the problems diﬃcult, and demonstrate that other commonly used decompilers fail to handle these problems properly. Section 2 of this paper describes the problems in decompiling variables, types, literals, expressions and simple statements. Section 3 introduces the problem of converting arbitrary control ﬂow found in bytecode to the control ﬂow constructs available in Java. Section 4 discusses the basic control ﬂow constructions, while the speciﬁc problems due to exceptions and synchronized blocks are examined in more detail in Section 5. Related work and conclusions are given in Section 6.

2

Variables, Types, Literals, Expressions and Simple Statements

In order to illustrate the basic challenges in decompiling variables and their types, consider the simple Java program in Figure 1(a), page 114. Classes Circle and Rectangle deﬁne circle and rectangle objects. Both of these classes implement the Drawable interface, which speciﬁes that any class implementing it must include a draw method. To illustrate the similarities and diﬀerences between the Java representation and the bytecode representation, focus on method f in class Main. Figure 1(b) gives the bytecode generated by javac for this method. 2.1

Variables, Literals and Types

First consider the names and signatures of methods. All of the key information for methods originally from Java source is completely encoded in the bytecode. Both the method names and the type signatures are available for the method declarations and all method invocations. However, the situation for variables is quite diﬀerent.

Decompiling Java Bytecode: Problems, Traps and Pitfalls

113

In the Java source each variable has a name and a static type which is valid for all uses and deﬁnitions of that variable. In the bytecode there are only untyped locations — in method f there are 4 stack locations and 5 local locations. The stack locations are used for the expression stack, while the local locations are used to store parameters and local variables. In this particular example, the javac compiler has mapped the parameter i to local 0, and the four local variables c, r, d and is fat are mapped to locals 1, 2, 3 and 4 respectively. The mapping of oﬀsets to variable names and the types of variables must be inferred by the decompiler. Another complicating factor in decompiling bytecode is that while Java supports several integral data types, including boolean, char, short and int, at the bytecode level the distinction between these types is only made in the signatures for methods and ﬁelds. Otherwise, bytecode instructions consider these types as integers. For example, at Label2 in Figure 1(b) the instruction iload 4 loads an integer value for is fat from line 16 in Figure 1(a), which is a boolean value in the Java program. This mismatch between many integral types in Java and the single integer type in bytecode provides several challenges for decompiling. These diﬃculties are illustrated by the result of applying several commonly used decompilers. Figure 2 shows the output from three popular decompilers, plus the output from our decompiler, Dava. Jasmine (also known as the SourceTec Java Decompiler) is an improved version of Mocha, probably the ﬁrst publicly available decompiler[10,7]. Jad is a decompiler that is free for non-commercial use whose decompilation module has been integrated into several graphical user interfaces including FrontEnd Plus, Decafe Pro, DJ Java Decompiler and Cavaj[6]. Wingdis is a commercial product sold by WingSoft [16]. In our later examples we also include results from SourceAgain, a commercial product that has a web-based demo version[14].1 Our tests used the most current releases of the software available at the time of writing this paper, namely Jasmine version 1.10, Jad version 1.5.8, Wingdis version 2.16, and SourceAgain version 1.1. Each of the results illustrate diﬀerent approaches to typing local variables. In all cases the variables with types boolean, Circle and Rectangle are correct. The major diﬃculty is in inferring the type for variable d in the original program, which should have type Drawable. The basic problem is that on one control path d is assigned an object of type Circle, whereas on the other, d is assigned an object of type Rectangle. The decompiler must ﬁnd a type that is consistent with both assignments, and with the use of d in the statement d.draw();. The simplest approach is to always chose the type Object in the case of diﬀerent constraints. Figure 2(a) shows that Jasmine uses this approach. This produces incorrect Java in the ﬁnal line where the variable object needs to be cast to a Drawable. Jad correctly inserted this cast in Figure 2(c). Wingdis exhibits a bug on this example, producing no a variable for the original d, and incorrectly emitting a static call Drawable.draw();. 1

The demo version does not support typing across several class ﬁles, so it is not included in our ﬁrst ﬁgure.

114

Jerome Miecznikowski and Laurie Hendren

public class Circle implements Drawable { public int radius; public Circle(int r) { radius = r; } public boolean isFat() { return(false); } public void draw() { // code to draw ... } } public class Rectangle implements Drawable { public short height,width; public Rectangle(short h, short w) { height=h; width=w; } public boolean isFat() { return(width > height); } public void draw() { // code to draw ... } } public interface Drawable { public void draw(); } public class Main { public static void f(short i) { Circle c; Rectangle r; Drawable d; boolean is_fat; if (i>10) // 6 { r = new Rectangle(i, i); // 7 is_fat = r.isFat(); // 8 d = r; // 9 } else { c = new Circle(i); // 12 is_fat = c.isFat(); // 13 d = c; // 14 } if (!is_fat) d.draw(); // 16 } // 17 public static void main(String args[]) { f((short) 11); } }

(a) Original Java Source

.method public static f(S)V .limit stack 4 .limit locals 5 .line 6 iload_0 bipush 10 if_icmple Label1 .line 7 new Rectangle dup iload_0 iload_0 invokenonvirtual Rectangle/(SS)V astore_2 .line 8 aload_2 invokevirtual Rectangle/isFat()Z istore 4 .line 9 aload_2 astore_3 goto Label2 .line 12 Label1: new Circle dup iload_0 invokenonvirtual Circle/(I)V astore_1 .line 13 aload_1 invokevirtual Circle/isFat()Z istore 4 .line 14 aload_1 astore_3 .line 16 Label2: iload 4 ifne Label3 aload_3 invokeinterface Drawable/draw()V 1 .line 17 Label3: return .end method (b) bytecode for method f

Fig. 1. Example program source and bytecode generated by javac As shown in Figure 2(d), our decompiler correctly types all the variables and does not require a spurious cast to Drawable. The complete typing algorithm is presented in our paper entitled “Eﬃcient Inference of Static Types for Java Bytecode”[5]. The basic idea is to construct a graph encoding type constraints. The graph contains hard nodes representing the types of classes, interfaces, and the base types; and soft nodes representing the variables. Edges in the graph are inserted for all constraints that must be satisﬁed by a legal typing. For example, the statement d.draw(); would insert an edge from the soft node for d to the hard node for Drawable. Once the graph has been created, typing is performed by collapsing nodes in the graph until all soft nodes have been associated with hard nodes. In this case the soft node for d would be collapsed into the hard node for Drawable. There do exist bytecode programs that cannot be statically typed, and for those programs we resort to assigning types that are too general and inserting down casts where necessary. However, we have found very few cases

Decompiling Java Bytecode: Problems, Traps and Pitfalls public static void f(short s) { Object object; boolean flag; if (s > 10) { Rectangle rectangle = new Rectangle(s, s); flag = rectangle.isFat(); object = rectangle; } else { Circle circle = new Circle(s); flag = circle.isFat(); object = circle; } if (!flag) object.draw(); } (a) Jasmine public static void f(short word0) { Object obj; boolean flag; if (word0 > 10) { Rectangle rectangle = new Rectangle(word0, word0); flag = rectangle.isFat(); obj = rectangle; } else { Circle circle = new Circle(word0); flag = circle.isFat(); obj = circle; } if(!flag) ((Drawable) (obj)).draw(); } (c) Jad

115

public static void f(short short0) { boolean boolea4; if (((byte)short0) <= 10) { Circle circle1= new Circle(short0); boolea4= circle1.isFat(); } else { Rectangle rectan2= new Rectangle(((short)short0), ((short)short0)); boolea4= rectan2.isFat(); } if (boolea4 == 0) Drawable.draw(); } (b) Wingdis public static void f(short s0) { boolean z0; Rectangle r0; Drawable r1; Circle r2; if (s0 <= 10) { r2 = new Circle(s0); z0 = r2.isFat(); r1 = r2; } else { r0 = new Rectangle(s0, s0); z0 = r0.isFat(); r1 = r0; } if (z0 == false) r1.draw(); return; }

(d) Dava

Fig. 2. Decompiled code for method f where such casts need to be inserted, and in general our approach leads to many fewer casts than simpler typing algorithms. The decompiled code produced by Wingdis, Figure 2(b), demonstrates the diﬃculties produced by diﬀerent integral types. This decompiler inserts spurious typecasts for all uses of the variable short. Furthermore, constants as well as variables must be assigned the correct integral type. For example, a call to method f with a constant value must be made as f((short) 10); in order to avoid a type conﬂict between the type of the argument (int) and the type of the parameter (short). 2.2

Expressions and Simple Statements

From our example we can also see that javac uses a very simple code generation strategy. Basically each simple statement in Java is compiled to a series of bytecode instructions, where the assumption is that the Java evaluation stack is empty before the statement executes and is empty after the statement executes. For example, consider the bytecode generated for statement 8 (see the line with // 8 in Figure 1(a) and the bytecode generated at the directive .line 8 in Figure 1(b)). In this case the object reference stored in local 2 is pushed

116

Jerome Miecznikowski and Laurie Hendren

on the stack, the isFat method is invoked, which pops the object reference and pushes isFat’s return value, and ﬁnally the return value is popped from the stack and stored in local 4. The expression stack had height 0 at the beginning of the statement and height 0 at the end of the statement. This straight forward code generation strategy makes it fairly simple for a decompiler to rebuild the statement. However, many other bytecode sequences could express the same computations. Consider the example in Figure 3. Figure 3(a) gives the original bytecode as produced by javac, whereas Figure 3(b) gives an optimized version of the bytecode. The optimized version uses 5 fewer instructions and 3 fewer locals.2 An example of a simple optimization is found at line 7. At this point the second iload 0 instruction has been replaced with a dup instruction. A more complex optimization makes use of the expression stack to save the values. For example, rather than storing the result of line 7 and then reloading it at line 8, the value is just left on the stack. Furthermore, since this same value is needed later, its value is duplicated (third dup at line 7). Line 8 demonstrates that the return value from the call to isFat can just be left on the stack. The swap instruction at line 8 exchanges the boolean value on top of the stack with the object reference just below it. Line 9 stores the object reference from the top of the stack and Line 12 uses the boolean value that is now on top of stack for the infne test. When the optimized code from Figure 3(b) is given to the other decompilers, they all fail because the bytecode does not correspond to patterns they expect (see Figure 4, page 118). Jasmine and Jad emit error messages saying that the control ﬂow analysis fails and emit code that is clearly not Java. Wingdis emits code that resembles Java but is clearly not correct as the calls to the method isFat have been completely missed, and the type for the left operand of == is an object rather than a boolean. SourceAgain also produces something that looks like Java, but it is also incorrect since it allocates too many objects and has lost the boolean variable. Our Dava decompiler produces exactly the same Java code as for the unoptimized class ﬁle, except for the names of the local variables. Figure 2(d) contains no variables starting with $, whereas in Figure 4(e) three variables do start with $. In our generated code we preﬁx variables with $ to indicate variables corresponding to stack locations in the bytecode. Dava is insensitive to the input bytecode because it is built on top of the Soot framework which transforms the bytecode into an intermediate representation called Grimp[13,15]. Soot begins by reading bytecode and converting it to simple three address statements (this intermediate form is called Jimple). When generating Jimple the stack locations become specially named variables. Soot then uses U-D webs to separate diﬀerent variables that may share the same local oﬀset in bytecode, and ﬁnally performs simple code cleanup and the typing algorithm. 2

It should be noted that this is not a contrived example; it merely illustrates the problems we encountered when applying other decompilers to bytecode produced by Java bytecode optimizers (even very simple peephole optimizers) and to bytecode produced by compilers for other languages.

Decompiling Java Bytecode: Problems, Traps and Pitfalls .method public static f(S)V .limit stack 4 .limit locals 5 .line 6 iload_0 bipush 10 if_icmple Label1 .line 7 new Rectangle dup iload_0 iload_0 invokenonvirtual Rectangle/(SS)V astore_2 .line 8 aload_2 invokevirtual Rectangle/isFat()Z istore 4 .line 9 aload_2 astore_3 goto Label2 .line 12 Label1: new Circle dup iload_0 invokenonvirtual Circle/(I)V astore_1 .line 13 aload_1 invokevirtual Circle/isFat()Z istore 4 .line 14 aload_1 astore_3 .line 16 Label2: iload 4 ifne Label3 aload_3 invokeinterface Drawable/draw()V 1 .line 17 Label3: return .end method (a) original bytecode

117

.method public static f(S)V .limit stack 4 .limit locals 2 .line 6 iload_0 bipush 10 if_icmple Label1 .line 7 new Rectangle dup iload_0 dup invokenonvirtual Rectangle/(SS)V dup .line 8 invokevirtual Rectangle/isFat()Z swap .line 9 astore_1 goto Label2 .line 12 Label1: new Circle dup iload_0 invokenonvirtual Circle/(I)V dup .line 13 invokevirtual Circle/isFat()Z swap .line 14 astore_1 .line 16 Label2: ifne Label3 aload_1 invokeinterface Drawable/draw()V 1 .line 17 Label3: return .end method

(b) optimized bytecode

Fig. 3. Original bytecode as generated by javac and optimized bytecode Given the typed Jimple, an aggregation step rebuilds expressions and produces Grimp. Grimp is the starting point for our restructuring algorithms described in the next section.

3

Control Flow Overview

The last major phase of our decompiler recovers a structured representation for a method’s control ﬂow. There may be more than one structured representation for any given control ﬂow graph (CFG), so in Dava, we focused on producing a correct restructuring that would be easy to understand. Other goals, such as fast restructuring or representing control ﬂow with a restricted set of control ﬂow statements, are possible but not explored in Dava. For correctness, we use a graph theoretic approach and focused on the capabilities of the Java grammar. For us, the key question was: “For any given set of control ﬂow features in the CFG, can we represent it with pure Java?” When answering this question we must consider the following:

118

Jerome Miecznikowski and Laurie Hendren

public static void f(short s) { Object object; if (s <= 10) goto 24 else 6; expression new Rectangle dup 1 over 0 expression s dup 1 over 0 invoke Rectangle. dup 1 over 0 invoke isFat swap pop object expression new Circle(s) dup 1 over 0 invoke isFat swap pop object if != goto 47 object.draw(); } (a) Jasmine public static void f(short word0) { Rectangle rectangle; if(word0 <= 10) break MISSING_BLOCK_LABEL_24; rectangle = new Rectangle(word0, word0); rectangle.isFat(); Object obj; obj = rectangle; break MISSING_BLOCK_LABEL_38; Circle circle = new Circle(word0); circle.isFat(); obj = circle; JVM INSTR ifne 47; goto _L1 _L2 _L1: break MISSING_BLOCK_LABEL_41; _L2: break MISSING_BLOCK_LABEL_47; ((Drawable) (obj)).draw(); } (c) Jad

public static void f(short short0) { if ((((byte)short0) <= 10)? (Circle circle1= new Circle(short0)): (Rectangle rectan1= new Rectangle( ((short)short0), ((short)short0))) == false) { Drawable.draw(); } }

(b) Wingdis public static void f(short si) { Object obj; Object tobj; Object tobj1; if( si > 10 ) { Object tobj2; tobj = new Rectangle( si, si ); tobj2 = ((Rectangle) tobj).isFat(); obj = new Rectangle( si, si ); } else { tobj = new Circle( si ); tobj1 = ((Circle) tobj).isFat(); obj = new Circle( si ); } if( tobj1 == 0 ) ((Drawable) obj).draw(); } (d) SourceAgain

public static void f(short s0) { boolean $z0; Drawable r0; Rectangle $r1; Circle $r2; if (s0 <= 10) { $r2 = new Circle(s0); $z0 = $r2.isFat(); r0 = $r2; } else { $r1 = new Rectangle(s0, s0); $z0 = $r1.isFat(); r0 = $r1; } if ($z0 == false) r0.draw(); return; }

(e) Dava

Fig. 4. Decompiled code for optimized method f 1. Every control ﬂow statement in Java has exactly one entry point, and one or more exit points. 2. Java provides labeled blocks, labeled control ﬂow statements, and labeled breaks and continues. With these, it is possible to represent any CFG that forms a directed acyclic graph (DAG) in pure Java. Consider the following.

Decompiling Java Bytecode: Problems, Traps and Pitfalls

119

We can topologically sort the statements from the bytecode representation of such a DAG and place a labeled block around the ﬁrst node. We now represent any control ﬂow from the ﬁrst node to the second as a labeled break out of our newly created labeled block. Next, we place a labeled block around the ﬁrst two statements, and represent any control ﬂow going to the third statement as labeled breaks out of the second block. Similarly, we can place a labeled block around the ﬁrst three statements, and so on. Although this will produce an ugly restructuring, it illustrates that it is possible to restructure any control ﬂow DAG. 3. The representation of a strongly connected component in the CFG must include at least one Java language loop. There is no direct representation, then, for strongly connected components with two or more entry points, since there is no control ﬂow statement in the grammar that supports more than one entry point. If such a strongly connected component is found, it must somehow be transformed to a semantically equivalent strongly connected component with only a single entry point. 4. The Java language provides exception handling with try, catch, and finally statements. Unfortunately, the Java bytecode exception handling mechanism is more ﬂexible than these statements, and may produce control ﬂow that is not directly expressible in the Java language. 5. The Java language provides object locking with synchronized statements. As with exception handling, the object locking mechanism in the Java bytecode speciﬁcation is more ﬂexible than the speciﬁcation of the synchronized statement, and may produce lockings in the bytecode that are not directly expressible in the Java language. For readability, we felt that a terse representation of control ﬂow should be easier to understand than a diﬀuse one. In Dava, we attempt this secondary goal by building Java language statements that each represent as many of the CFG features as possible with the intention of minimizing the number of statements produced altogether. Although not necessarily an optimal solution, it has, in practice, yielded excellent results. 3.1

A Brief Introduction to SET Restructuring

The restructuring phase of Dava uses three intermediate representations to perform its function: 1) Grimp, a list of typed, unstructured program statements, which loosely corresponds to the method’s bytecode instruction stream, 2) a CFG representing the control ﬂow from the Grimp representation, and 3) a Structure Encapsulation Tree (SET)[9]. The Grimp representation is fed to the restructurer, which produces the CFG and the SET. The ﬁnished SET is very similar to an abstract syntax tree, and the ﬁnal Java language output is obtained simply by traversing it. The CFG is built by ﬁnding all the potential successors to each Grimp statement. All branches in Java bytecode are direct, so this is a straightforward task.

120

Jerome Miecznikowski and Laurie Hendren

The only novel feature of this CFG is that is distinguish edges representing normal control ﬂow from those representing the throwing of an exception. The SET is built in 6 phases. A more complete description can be found in our paper entitled “Decompiling Java Using Staged Encapsulation”[9]; here we provide a brief overview. Each phase searches for a speciﬁc type of feature in the CFG and produces structured Java language statements that can represent that feature. The Java statement is then bundled with the set of nodes (wrapped Grimp statements) from the CFG that would correspond to its body. Since every structured Java statement has only one entry point, we can usually use dominance to determine the body. For example, a while statement would consist of the appropriate condition expression plus those statements from the CFG that the condition dominates, minus those statements reachable by the control ﬂow from the condition that escapes the loop. The structured bundle is then nested in the SET such that the set of statements in the bundle is a subset of those in its parent node and a superset of those in its children nodes. In this way the SET can be built up in any arbitrary order of node insertion. Note also that the properties searched for in the CFG (ie. dominance and reachability) are transitive, which guarantees us that the superset/subset relations between SET bundles and their children will always hold.

4

Basic Control Flow Constructs

A decompiler must be able to ﬁnd if, switch, while, and do-while statements, labeled blocks, and labeled breaks and continues. Many decompilers use reduction based restructuring. These work by searching the CFG for local patterns that directly correspond to those produced by Java grammar productions. When a pattern is found it is reduced to a single node in the CFG and the search is repeated. This process is iterated until no more reductions can be found. In general this approach is diﬃcult because the library of patterns that are matched against does not cover all possible patterns in the CFG. At some point, one may not ﬁnd any more reductions, but still have not reduced the program to a single structured statement. In contrast, Dava searches for features in the control ﬂow graph in order of how ﬂexibly they can treated. For example, strongly connected components must be represented by loops, which is an inﬂexible requirement. Accordingly, the conditions of loops are to be found before the conditions of if statements. 4.1

Loops

The most general way to characterize cyclic behavior in the CFG is to begin by searching for the strongly connected components (SCC). For each SCC, we build a Java loop. By examining the properties of the entry and exit points in the SCC we can determine which type of Java loop (while, do-while or while(true)) is suitable for the structured representation. Once we know the type of loop, we

Decompiling Java Bytecode: Problems, Traps and Pitfalls

121

know which statement in the CFG yields us the conditional expression (if any) for the structured loop, and we can ﬁnd the loop body. We know that for every iteration of a Java loop, if the loop is conditional, the condition expression must be evaluated, or if the loop is unconditional, the entry point statement must be executed. To ﬁnd nested loops, we simply remove the condition statement, or the entry point statement, from the CFG and reevaluate to see if any SCCs remain. This process is iterated until no more SCCs are found. This process seems to be more robust than reduction based techniques. Consider the small, if somewhat contrived, example in ﬁgure 5, page 122. Method foo() has no real purpose other than to illustrate the performance of a restructurer on diﬃcult, loop based control ﬂow. The original Java source was compiled with javac and the resulting bytecode class was not modiﬁed in any way. This example has two interesting components, (1) the outer loop only executes if an exception is thrown, and (2) if the inner loop exits normally, the next statement that aﬀects program state is the return. We can see that only Dava produces correct, recompilable code, though it does not greatly resemble the original program. Jad alone produces code that is reminiscent of the original, but unfortunately it is neither correct nor recompilable. We may encounter multi-entry point SCCs. Here the input does not directly correspond to a Java structured program, so all decompilers will output ugly Java code. There are several solutions, but all involve transforming the CFG. Our solution converts the multi-entry point SCC to a single entry point SCC by breaking the control ﬂow to the original entry points and rerouting it to a dispatch statement. This dispatch then acts as the single entry point and redirects control to the appropriate destination. 4.2

Labeled Statements, Blocks, and break and continue Statements

As shown in section 3, page 117, labeled blocks can resolve any diﬃculties in restructuring control ﬂow DAGs. In Dava, once we have found all the nodes for the SET from the CFG, we then determine if any of the control ﬂow necessitates the introduction of labeled statements, labeled blocks, breaks or continues. Once this phase is done, we have fully restructured our target program. One might expect control ﬂow necessitating the use of these statements to present diﬃculties to pattern-based decompilers since (1) the code produced by these statements is not fully structured, and (2) human programmers rarely exercise these features. It seems, however, that much work has been done on this problem as several other decompilers, notably Jad and SourceAgain, deal well with producing labeled statements, blocks, breaks, and continues.

122

Jerome Miecznikowski and Laurie Hendren

public int foo( int i, int j) { while (true) { try { while (i < j) i = j++/i; } catch (RuntimeException re) { i = 10; continue; } break; } return j; }

public int foo(int i, int j) { while(true) try { while(i < j) i = j++ / i; break MISSING_BLOCK_LABEL_25; } catch(RuntimeException runtimeexception) { i = 10; } return j; }

(a) Original Java Source

public int foo(int i, int j) { RuntimeException e; for (i = j++ / i; i < j; i = j++ / i) /* null body */ ; return j; pop e i = 10; } (c) Jasmine

(b) Jad

public int foo(int i, int j) { while( i < j ) i = j++ / i; return j; }

(d) SourceAgain

public int foo(int int1, int int2) { // WingDis cannot analyze control flow // of this method fully

public int foo(int i0, int i1) { int $i2;

B0:

while (true) { try { if (i0 < i1) { $i2 = i1; i1 = i1 + 1; i0 = $i2 / i0; continue; } } catch (RuntimeException $r2) { i0 = 10; continue; } return i1; }

goto B3; B1: try { goto B3; B2: int1= int2++ / int1; B3: if (int1 < int2)goto B2; } B4: goto B8; B5: catch (RuntimeException null) { B6:

}

int1= 10; B7: goto B3; }

(e) Wingdis

(f) Dava

Fig. 5. Decompiled code for method foo()

5

Exceptions and Synchronized Blocks

Java bytecode and the Java language treat exception handling in very diﬀerent ways. Bytecode is simply a numbered sequence of virtual machine instructions. Here, exception handling is speciﬁed by a table, where each entry holds a starting instruction number, a ﬁnishing instruction number, a reference to an exception class, and a pointer to a handler instruction. If an exception is thrown, the virtual machine runs through the table checking to see if the current instruction is in the instruction range given by any of the table entries. If it is in range, and

Decompiling Java Bytecode: Problems, Traps and Pitfalls

123

the thrown exception matches the table entry’s exception class, then control is transferred to that entry’s handler instruction. In bytecode, regular control ﬂow imposes few restrictions on exception handling. Control ﬂow may enter or exit at any instruction within a table entry’s area of protection, and does not have to remain constantly within that area once it enters. Multiple control ﬂow paths may enter a single area of protection at diﬀerent points, and diﬀerent areas of protection may overlap arbitrarily. The handler instruction may be anywhere within the class ﬁle, limited by the constraints of bytecode veriﬁcation, including within the table entry’s own area of protection. Finally, more than one exception table entry may share the same exception handler. In short, exception handling in Java bytecode is mostly unstructured. By contrast, exception handling in the Java language uses the try, catch and finally grammar productions and is highly structured. There is only one entry point to a try statement, control ﬂow within it is contiguous, and each of these Java statements nests properly. There is no way to make try statements partially overlap each other. Also, each try must be immediately followed by a catch and/or a finally statement. There may be any number of catch statements but no more than one finally. If an exception is thrown and is not caught in a catch statement, then the method in which this occurs must declare that it throws that exception. Method declarations must agree between subclasses and superclasses. Therefore, if some method m1 declares a throws and overrides or is overridden by another method m2 , then m2 must also declare the throws. There is a complication to the throws declaration rule. Object locking is provided in Java with the synchronized() statement. If a thrown exception causes control to leave a synchronized() statement, the Java language speciﬁcation requires that the object lock be released. This is accomplished in the bytecode by catching the exception, releasing the lock in the exception handler and ﬁnally rethrowing the exception. This exception handling should not be translated into try catch statements, but remains masked by the synchronized() statement. Consequently, throws that are to be implied by a synchronized() statement’s exception handling are not explicitly put in the Java language representation, and therefore are also ignored in the method declaration. There are numerous consequences from this “semantic gap” in exception handling. An area of protection must be represented by a try statement, and handlers by a catch or finally. However, a try statement has only one entry point. So, an area of protection with more than one entry point must be split into as many parts as there are entry points. Each of these new areas of protection share the same handler, but a catch statement can only be immediately preceded by a single try. To reconcile this, the handler statement (at least) must be duplicated for each area of protection. If two areas of protection overlap but neither fully encapsulates the other, we must break up at least one of the areas to allow the resulting try statements to either be disjoint or nest each other properly.

124

Jerome Miecznikowski and Laurie Hendren

a b g c e d f normal control flow

exceptional control flow

(a) Original control flow graph public void foo() { System.out.println("a"); System.out.println("b"); try { System.out.println("c"); System.out.println("d"); } // Misplaced declaration of // an exception variable catch(D this) { System.out.println("e"); } System.out.println("g"); return; this; System.out.println("f"); return; } (c) Jad public void foo() { System.out.println("a"); try { System.out.println("b"); try { System.out.println("c"); System.out.println("d"); } catch (Exception e0) { System.out.println("e"); } } catch (RuntimeException e0) { System.out.println("g"); } }

(e) Wingdis

public void foo() { System.out.println("a"); label_0: { try { System.out.println("b"); } catch (RuntimeException $r9) { System.out.println("g"); break label_0; } try { System.out.println("c"); } catch (RuntimeException $r9) { System.out.println("g"); break label_0; } catch (Exception $r5) { System.out.println("e"); break label_0; } try { System.out.println("d"); } catch (Exception $r5) { System.out.println("e"); break label_0; } } System.out.println("f"); return; } (b) Dava

public void foo() { System.out.println("a"); System.out.println("b"); System.out.println("c"); System.out.println("d"); pop this System.out.println("e"); System.out.println("f"); return; pop this System.out.println("g"); }

(d) Jasmine public void foo() { System.out.println( "a" ); label_9: { try { System.out.println( "b" ); try { System.out.println( "c" ); break label_9; } catch( Exception exception1 ) { System.out.println( "e" ); } } catch( RuntimeException runtimeexception1 ) { System.out.println( "g" ); } System.out.println( "f" ); return; } System.out.println( "d" ); } (f) SourceAgain

Fig. 6. Decompiled code for method foo()

Decompiling Java Bytecode: Problems, Traps and Pitfalls

125

Although these problems do not normally appear in bytecode generated by javac, they still may arise in perfectly valid Java bytecode. Consider the example control ﬂow graph in ﬁgure 6(a), page 124. Here, we created a class ﬁle by hand that has a straight line of statements a b c d f with two areas of protection. If a RuntimeException is thrown in area of protection [b c], control ﬂow is directed to g. If, however, an Exception is thrown in area of protection [c d], control ﬂow is directed to e. We cannot simply represent the two areas as two try statements because they will not be able to nest each other properly. The correct solution to this problem is to break the two areas of protection into three try statements, and to split and aggregate their handlers into appropriate catch statements, as shown in the output from Dava in ﬁgure 6(b). Again, other decompilers seem to rely on the bytecode reﬂecting an already structured program, and produce incorrect output. For example, Wingdis’ output in 6(e) looks close to a correct solution. However, besides omitting statement f, the chief problem is that statement d has been placed in two areas of protection, which violates the semantics of the original control ﬂow graph. The output program does operate correctly, but only because the illegal RuntimeException exception handler is masked oﬀ by the correct Exception exception handler. Since this masking only occurs because RuntimeException happens to be a subclass of Exception, it is not likely part of a correct general approach. Object locking with synchronized() statements poses even greater problems. Java bytecode provides locking with monitorenter and monitorexit instructions. The Java virtual machine speciﬁcation only states that for any control ﬂow path within a method, the number of monitorexits performed on some object should equal the number of monitorenters. The precise conditions for representing the locked object’s “critical section” with synchronized() statements may not exist within the target program, or equally likely, multiple “critical sections” may intersect without either nesting the other. These problems cannot be represented with synchronized() statements. Luckily, it is possible to build an implementation of monitors in pure Java and to replace the monitor instructions with static method calls to this implementation. As well as providing a solution for “unrepresentable” situations, this fallback mechanism gives the decompiler writer a choice about how aggressively to try to build synchronized() statements. At the most aggressive extreme, one might try to transform the control ﬂow graph so as to maximize the representation of object locking with synchronized() statements, using the fallback mechanism only where provable necessary. At the other extreme, one might always use fallback mechanism. We began in Dava by trying to make the most aggressive synchronized() statement restructurer possible. Through testing, however, we found that the most important issue for synchronized() restructuring is good exception handling. Since the set of features necessary in the bytecode to produce

126

Jerome Miecznikowski and Laurie Hendren

synchronized() blocks is both complex and speciﬁc, it turns out that the occurrence of the proper feature set is almost always the result of a synchronized() block in the bytecode’s source. As such, it is already in a form that is easily restructured and an aggressive approach provides little improvement over simple pattern matching.

6

Related Work and Conclusions

To our knowledge there are few papers on the complete problem of decompiling arbitrary bytecode to Java. There are many tools including the decompilers we tested in this paper, however there is very little written about the design and implementation of those tools. The implementation of the Krakatoa decompiler has been described in the research literature[11], however, we were unable to test this decompiler because it is not publically available. Krakatoa uses an extended version of Ramshaw’s goto-elimination technique [12], which produces legal, though somewhat convoluted, Java structures by introducing loops and multi-level breaks. Krakatoa then applies a series of rewrite rules to this structured representation where each rule attempts to replace a program substructure with a more “natural” one. Such a relatively strong restructurer may be able to handle complicated loops. While it is not clear from the paper how the typing and expression building works, Krakatoa appears to use the same approach as the decompilers we tested. All program examples come from bytecode generated from javac. This approach does not address the problems with exceptions and synchronization. There has been related work on restructuring Java and other high-level languages. Research on restructuring can usually be divided into restructuring with gotos, versus eliminating gotos. The independent works of Baker[2] and Cifuentes[3] are prominent examples of the ﬁrst category while Erosa[4] and Z. Ammarguellat[1] are good examples of the second. These are general approaches and would require modiﬁcations to deal with the special requirements of Java, such as dealing with synchronization and exceptions. Knoblock and Rehof[8]. have worked on ﬁnding static types for Java programs. Their approach diﬀers from ours in that it works on an SSA intermediate representation and may change the type hierarchy when types conﬂict due to interfaces. This paper has presented some of the problems, traps and pitfalls encountered when decompiling arbitrary, veriﬁable Java bytecode. We demonstrated the problems in dealing with variables, literals and types, and showed how existing decompilers deal with the typing problem by inserting spurious type casts (or by producing incorrect code). We showed that bytecode that has been optimized is not correctly decompiled by any of the four decompilers we tested. This demonstrates that such decompilers target bytecode that has been produced by a known compilation strategy, such as that used by javac. We discussed the overall problem of control ﬂow structuring and showed that even control ﬂow produced by javac can be diﬃcult to handle. Finally, we demonstrated byte-

Decompiling Java Bytecode: Problems, Traps and Pitfalls

127

code allows for more general use of exceptions and synchronizations than what is produced from Java. In all cases our Dava compiler was able to produce a correct Java program. Now that we have a robust decompiler, we will begin to concentrate on a postprocessor that converts control ﬂow constructs into idioms likely to be used by a programmer, and on mechanisms for choosing readable variable names for parameters and local variables. We will also continue to stress test the decompiler by decompiling class ﬁles from a variety of sources. The decompiler will be released as part of the Soot framework, and will be publically available. Currently, interested parties can contact the ﬁrst author for a “preview version” of the software.

References 1. Z. Ammarguellat. A control-ﬂow normalization algorithm and its complexity. IEEE Transactions on Software Engineering, 18(3):237–250, March 1992. 126 2. B. S. Baker. An algorithm for structuring ﬂowgraphs. Journal of the Association for Computing Machinery, pages 98–120, January 1977. 126 3. C. Cifuentes. Reverse Compilation Techniques. PhD thesis, Queensland University of Technology, July 1994. 126 4. A. M. Erosa and L. J. Hendren. Taming control ﬂow: A structured approach to eliminating goto statements. In Proceedings of the 1994 International Conference on Computer Languages, pages 229–240, May 1994. 126 5. E. M. Gagnon, L. J. Hendren, and G. Marceau. Eﬃcient inference of static types for Java bytecode. In Static Analysis Symposium 2000, Lecture Notes in Computer Science, pages 199–219, Santa Barbara, June 2000. 114 6. Jad - the fast JAva Decompiler. http://www.geocities.com/SiliconValley/Bridge/8617/jad.html. 113 7. SourceTec Java Decompiler. http://www.srctec.com/decompiler/. 113 8. T. Knoblock and J. Rehof. Type elaboration and subtype completion for java bytecode. In Proceedings 27th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages., 2000. 126 9. J. Miecznikowski and L. Hendren. Decompiling Java using staged encapsulation. In Proceedings of the Working Conference on Reverse Engineering, pages 368–374, October 2001. 119, 120 10. Mocha, the Java Decompiler. http://www.brouhaha.com/~eric/computers/mocha.html. 113 11. T. A. Proebsting and S. A. Watterson. Krakatoa: Decompilation in Java (Does bytecode reveal source?). In 3rd USENIX Conference on Object-Oriented Technologies and Systems (COOTS’97), pages 185–197, June 1997. 126 12. L. Ramshaw. Eliminating go to’s while preserving program structure. Journal of the Association for Computing Machinery, 35(4):893–920, October 1988. 126 13. Soot - a Java Optimization Framework. http://www.sable.mcgill.ca/soot/. 116 14. Source Again - A Java Decompiler. http://www.ahpah.com/. 113 15. R. Vall´ee-Rai, E. Gagnon, L. Hendren, P. Lam, P. Pominville, and V. Sundaresan. Optimizing Java bytecode using the Soot framework: Is it feasible? In D. A. Watt, editor, Compiler Construction, 9th International Conference, volume 1781 of Lecture Notes in Computer Science, pages 18–34, Berlin, Germany, March 2000. Springer. 116 16. WingDis - A Java Decompiler. http:/www.wingsoft.com/wingdis.html. 113

Forwarding in Attribute Grammars for Modular Language Design Eric Van Wyk1 , Oege de Moor1 , Kevin Backhouse1, and Paul Kwiatkowski2 1

Oxford University Computing Laboratory 2 Microsoft Corporation

Abstract. Forwarding is a technique for providing default attribute definitions in attribute grammars that is helpful in the modular implementation of programming languages. It complements existing techniques such as default copy rules. This paper introduces forwarding, and shows how it is but a small extension of standard higher-order attribute grammars. The usual tools for manipulating higher-order attribute grammars, including the circularity check (which tests for cyclic dependencies between attribute values), carry over without modiﬁcation. The closure test (which checks that each attribute has a deﬁning equation) needs modiﬁcation, however, because the resulting higher-order attribute grammars may contain spurious attributes that are never evaluated, and indeed that need not be deﬁned.

1

Motivation

The modular deﬁnition of programming languages is a long-standing problem, and a lot of work has been devoted to its solution in the context of attribute grammars e.g. [1,3,8,10,11,13,14,19,17,21,25,29,32]. Some of these proposals take inspiration from the object-oriented paradigm, advocating the use of inheritance to achieve modularisation. Others take inspiration from functional programming, by employing higher-order functions to achieve a separation of concerns. The present paper is a modest contribution towards these developments, by showing how a certain form of inheritance called forwarding can be achieved in higherorder attribute grammars. In our view, forwarding is the main innovative idea in the design of the Intentional Programming system [27,30]. That system, until recently under development at Microsoft, is an environment for interactive language design, similar in spirit to many of the above attribute grammar systems. The structure of the paper is as follows. First we present a number of motivating examples that introduce forwarding, along with the idea of productionvalued attributes. Next, we show how a complete grammar that was composed using forwarding can be expanded to an ordinary higher-order attribute grammar. Finally, we demonstrate that the standard circularity test for higher-order attribute grammars can be applied to such modular descriptions. The standard closure test does however need some modiﬁcation, and we argue that the desired eﬀect can be achieved through an appropriate implementation of the circularity test. That test might thus be more appropriately named the definedness test. R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 128–142, 2002. c Springer-Verlag Berlin Heidelberg 2002

Forwarding in Attribute Grammars for Modular Language Design

1.1

129

A Forwarding Example: Record Invariants

Consider a programming language with record types, and the usual with construct for concisely referring to the ﬁeld names. Assume that we wish to add a new feature, namely that of record invariants, which state some invariant relationship between ﬁeld values. At the end of each with clause, it is checked that the invariant is actually satisﬁed. To keep the example simple, we avoid the problem of name capture by prohibiting the invariant from referring to global variables. As we discuss in our technical report [30], the more general problem is easily solved by correctly maintaining the environment. Here is an example of a program that uses record invariants: let rec type = record { f1 :: int, f2 :: int } invariant f1 ≡ 2 ∗ f2 r :: rec type in with r begin f1 := 4; f2 := 8 end This program fragment in the augmented language is equivalent to the following fragment in the base language: let rec type = record { f1 :: int, f2 :: int } r :: rec type in with r begin f1 := 4; f2 := 8; if ¬f1 ≡ 2 ∗ f2 then error “invariant fails” end Our challenge is to implement this extension as a small, modular addition to the base language deﬁnition. Of course this notion of language extension, where the new feature is rewritten to existing idioms, is extremely common. It is the basis of the idea that it suﬃces to deﬁne a small, elegant core language on which richer features are then built. Let us call the grammar production for the new with construct with’. Essentially we would like to deﬁne its semantics through the rewrite rule with’ r ss ⇒ with r (ss ++ if ¬r .type.invariant then error “invariant fails” ) In particular, we do not wish to deﬁne each of the attributes for with’ anew: that would require detailed knowledge of all semantic aspects of the base language. Note that on the right-hand side of this rewrite, we are referring to the invariant of the type of r. This is a new piece of information that has to be added to every record type. Note that the invariant is in fact represented as a syntax tree,

130

Eric Van Wyk et al.

so here we have an example of a higher-order attribute. Deﬁning new language features in this way, by expanding new productions to old, is called forwarding. Forwarding is similar to but subtly diﬀerent from syntax [33] and semantic macros [20]. Like semantic macros, forwarding does give access to semantic information, for instance the attribute r .type.invariant . Such semantic information is not available in syntax macros. Forwarding is diﬀerent from both semantic and syntax macros in that not all attribute queries on a new production such as with’ are forwarded to the expanded form. For example, an attribute that deﬁnes a pretty-printing of the original program would be deﬁned for with’ directly. If we relied on forwarding to deﬁne the pretty-printing, it would show the expanded form, which is clearly undesirable. A typical use of forwarding thus states the expansion into primitive terms as a rewrite rule, but it also deﬁnes a number of attributes whose values are speciﬁc to the original higher-level construct. This also sets attribute-grammars-with-forwarding apart from language processors that are mainly based on reﬂection, such as MetaML [26] or ’C [9]. Forwarding is very close to higher-order attribute grammars. As we shall see shortly, the only substantial diﬀerence is that here the “copy rules” for all relevant attributes are automatically generated. There are a number of minor diﬀerences, in particular that forwarding is commonly used in conjunction with production-valued attributes. This feature also highlights the diﬀerence between forwarding and object-oriented extensions of attribute grammars since the forwarded to construct is dynamically computed at attribute evaluation time instead of statically determined via inheritance [21] when the attribute grammar is deﬁned. We now turn to an example to illustrate this phenomenon. 1.2

A Production-Valued Attribute Example: Operator Overloading

The aim is to overload the + operator for numeric addition and string concatenation. Furthermore, we would like to achieve this in a modular way fashion: overloading + on yet another type (such as matrices) should not require any changes to existing attribute deﬁnitions. Below we shall present three versions of the solution: the ﬁrst achieves the desired modularity, the second exempliﬁes how production-valued attributes can be compiled away, and the ﬁnal version demonstrates the reduction to ordinary higher-order attribute grammars. Our starting point is a grammar with nonterminals Expr (for expressions) and Type (for types) that includes the following productions: name description production plus overloaded + Expr ::= Expr Expr add numeric addition Expr ::= Expr Expr cat string concatenation Expr ::= Expr Expr num numeric constant Expr ::= Number str string constant Expr ::= Qstring id identiﬁers Expr ::= Id int type of integers Type ::= ε string type of strings Type ::= ε

Forwarding in Attribute Grammars for Modular Language Design

131

The Expr nonterminal has an attribute code of type String and a higher-order attribute type which contains a tree derived from nonterminal Type. It also has an inherited attribute environment. Figure 1 sketches the attribute deﬁnitions that one would expect on those productions that have no direct relation to overloading of +. Note the use of type as a higher-order attribute. The environment attribute is deﬁned implicitly by default copy. Below we shall discuss the attribute deﬁnitions for the Type productions, and for overloaded +.

add : Expr ::= Expr Expr Expr1 .code = gen add code(Expr2 .code, Expr3 .code) Expr1 .type = integer concat : Expr ::= Expr Expr Expr1 .code = gen concat code(Expr2 .code, Expr3 .code) Expr1 .type = string numeric const : Expr ::= num Expr1 .code = gen num code(num) Expr1 .type = integer string const : Expr ::= str Expr1 .code = gen str code(str ) Expr1 .type = string identifier : Expr ::= id Expr1 .code = gen id code(id ) Expr1 .type = lookup(Expr .environment, id .lexeme)

Fig. 1. Standard attribute deﬁnitions

Using forwarding and production-valued attributes. The productions for the types are ε-productions (empty right hand sides). We introduce a new attribute on types, called plusProd (short for “plus production”). The values of this attribute are tree constructors: they take two trees, and build a new tree. Furthermore these trees should be syntax trees that were derived from the Expr nonterminal, so the type of plusProd is Expr × Expr → Expr . Any production in the grammar that rewrites an expression to two further expressions could be viewed as a function of this type. For integers, plusProd is add — that is the production for numeric addition. For strings, plusProd is concat, which is the production for string concatenation. These are the formal deﬁnitions: integer : Type ::= ε Type.plusProd = add

string : Type ::= ε Type.plusProd = concat

The presence of the plusProd attribute and forwarding makes it rather easy to deﬁne the overloaded plus attributes. The generic plus production forwards

132

Eric Van Wyk et al.

to a node created by the appropriate production (here add or concat) which is retrieved from an attribute on a child. This production is provided with children (the same ones as the original plus) and provided with inherited attributes by “inserting” it into the current tree as the third implicit child of plus. This is a well-known use of higher order attributes. What is new is that we are passing a production as an attribute and providing it with children. Furthermore, any synthesised attribute of the generic plus that is not explicitly deﬁned is obtained by implicit copying from the newly created node. We have added a pretty printing attribute to this example as an example of an explicitly deﬁned attribute. Similarly, inherited attributes are implicitly passed to the fowarding node. plus : Expr ::= Expr Expr Type Type = Expr 2 .type Expr 1 .pp = Expr 2 .pp ++ “+” ++ Expr 3 .pp forwardsTo Type.plusProd (Expr 2 , Expr 3 ) Note that we need not deﬁne the inherited attributes of Expr 2 and Expr 3 unless we use synthesised attributes that depend on them. For example, Expr 3 .type is not used, so there is no need to deﬁne Expr 3 .environment . Unfortunately, the standard closure test requires every inherited attribute to be deﬁned regardless. This problem and its solution are discussed further in Section 3.3. At the outset we stated that the overloading should be achieved in a modular fashion, so that adding a new overloading is just a local change to the attribute grammar. Indeed, this goal has been achieved. All we need to add for overloading + on matrices are the following two productions: matrix : Type ::= ε Type.plusProd = matrix add matrix add : Expr ::= Expr Expr Expr1 .type = matrix Expr1 .code = gen matrix add code(Expr2 .code, Expr3 .code) Some readers may argue that this form of overloading is rather awkward compared to the overloading features of a modern programming language such as Haskell [16]. That is certainly true: here we merely use the example of overloaded syntax to illustrate the merits of production-valued attributes and forwarding in a nutshell. Elimination of production-valued attributes. We now aim to show how the device of production-valued attributes can be eliminated from this example. Naturally the productions of Figure 1 remain as before. The elimination of productionvalued attributes is very similar to the elimination of higher-order functions from more general programs [12]. For each production-valued attribute attr which can be given the value of productions prod1 , prod2 , . . . , prodn create an enumerated type attr token whose possible values are the tokens attr prod1 , attr prod2 , . . . , attr prodn . This is possible because there are a ﬁxed number of productions in the grammar. We then

Forwarding in Attribute Grammars for Modular Language Design

133

replace production valued attributes with attributes whose value are these new enumerated types. We replace production references in attribute deﬁnitions with the appropriate token and replace attribute references with case statements which switch on the token value to make use of the appropriate production as below. plus : Expr ::= Expr Expr Type Type = Expr2 .type forwardsTo case Type.plusProd token of plusProd add → add (Expr2 , Expr3 ) plusProd concat → concat (Expr2 , Expr3 ) integer : Type ::= ε Type.plusProd token = plusProd add string : Type ::= ε Type.plusProd token = plusProd concat This is still a fairly painless way to implement operator overloading. If new attributes, say for a new target language, are introduced onto the add and concat productions, we do not need to change the plus production, thanks to forwarding. We do lose modularity in another dimension, however: to overload matrix addition, we need to change the forwards–to clause of the plus production so that the case statement recognises a token for matrix addition. Elimination of forwarding. We now consider how the above version of our example can be implemented without forwarding. The elimination of forwarding involves two changes. First, we need to introduce a new, explicit child of plus to represent the newly constructed tree that attribute queries are forwarded to. Second, all attribute deﬁnitions for plus have to be made explicit. The inherited attributes of the newly constructed tree are the inherited attributes of the generic plus. The synthesised attributes of the generic plus are the same as the synthesised attributes of the newly constructed tree. In summary, the plus production is transformed into the following: plus : Expr ::= Expr Expr Type Expr Type = Expr2 .type Expr4 = case Type.plusProd token of plusProd add → add (Expr2 , Expr3 ) plusProd concat → concat (Expr2 , Expr3 ) Expr1 .code = Expr4 .code Expr1 .type = Expr4 .type No other productions need to be altered. It goes without saying that most of the modularity has been lost at this stage. In particular, the introduction of a new target language would necessitate a new attribute equation, and the introduction of a new overloading would require a new clause in the case expression.

134

2 2.1

Eric Van Wyk et al.

Attribute Grammars with Forwarding Definition of Attribute Grammars with Forwarding

An attribute grammar with forwarding is deﬁned as a tuple G, A, S where G is a context free grammar, A speciﬁes attributes for nonterminals in G and S deﬁnes the attribute deﬁning semantic functions for each production in G. A context free grammar G is deﬁned as a tuple N, T, P, S where N is a ﬁnite set of nonterminal symbols, T is a ﬁnite set of terminal symbols, S is a nonterminal in N (S ∈ N ) called the start symbol and P ⊂ N × (N ∪ T )∗ is p Xfpp ﬁnite set of productions. Each p ∈ P has the form X0p ::= X1p X2p . . . Xm p where X0p ∈ N is the left hand side of the production p, Xip ∈ N ∪ T, 1 ≤ i ≤ mp are the standard terminals and nonterminals on the right hand side of the production p, and Xfpp , fp = mp + 1, is a distinguished optional nonterminal called the forwards-to nonterminal. The left hand side nonterminal X0p is the same nonterminal as the forwards-to nonterminal Xfpp , if it exists. If there is no forwarding nonterminal for p then fp = mp . Each nonterminal is attributed with semantic values called attributes. For a nonterminal X ∈ N , A(X) is the set of attributes which are assigned values for X. This set is partitioned into synthesised, As (X), and inherited, Ai (X), attributes. The set of all attributes is A = X∈N A(X). The type of an attribute a ∈ A is speciﬁed by At (a) and indicates the possible values that can be assigned to occurrences of a. A set of base types Tb is left undeﬁned but typically includes integers, strings, etc. In traditional attribute grammars deﬁned by Knuth [18], At (a) ∈ Tb . In higher order attribute grammars [29,32,31] attributes can also take on the value of syntax trees whose type is the terminal or nonterminal symbol at the root of the tree. Thus At (a) ∈ Tb ∪ N ∪ T . In higher order attribute grammars, some of the nonterminals on the right hand side of the production are classiﬁed as nonterminal attributes. The abstract syntax trees rooted on these nonterminals are not created by parsing a source text, as the standard nonterminals are, but are generated by semantic rules associated with the production. We will require that all nonterminals classiﬁed as nonterminal attributes are to the right of all standard nonterminals on the right hand side of the production. Thus, we can deﬁne ntap as the index of the ﬁrst nonterminal attribute of p such that for every i ≥ ntap , Xip is a nonterminal attribute and for every i < ntap , Xip is a standard nonterminal. In particular, the forwards-to nonterminal is also a nonterminal attribute. If p has no nonterminal attributes then ntap > mp . The signature of a production p, denoted σ(p), p is X1p × X2p × . . . × Xnta → X0p - the right hand side nonterminals whose trees p −1 are not computed by semantic rules on p and the lefthand side nonterminal. The set of all signatures for all productions is Σ(P ) = p∈P σ(p). To assign values for nonterminal attributes it is often convenient to use production-valued attributes. The productions passed via these attributes are applied to the appropriate trees to produce the tree to be assigned to the nonterminal attribute. We thus extend the possible types of attributes to allow for production valued attributes so that At (a) ∈ Tb ∪ N ∪ T ∪ Σ(P ).

Forwarding in Attribute Grammars for Modular Language Design

135

p For each production p = X0p ::= X1p X2p . . . Xm Xfpp ∈ P , we have a set p of semantic rules S(p) for computing values of synthesised attributes for X0p , a ∈ As (X0p ), inherited attributes of Xip , 1 ≤ i ≤ fp , a ∈ Ai (Xip ) and nonterminal attributes Xip , ntap ≤ i ≤ fp . The set Aes (p), p ∈ P is the set of synthesised attributes deﬁned explicitly by a semantic rule in S(p) for nonterminal X0p and Aei (p, Xjp ), p ∈ P, j ≥ 1 is the inherited attributes deﬁned explicitly by a semantic rule in S(p) for Xjp . In standard (higher order) attribute grammars, it is required that S(p) contains a semantic rule for each attribute a ∈ As (X0p ) and each attribute a ∈ Ai (Xip ), ≤ i ≤ mp and each nonterminal attribute Xip , ntap ≤ i ≤ fp . That is, As (X0p ) = Aes (p) and Ai (Xip ) = Aei (p, Xip ), ∀i.1 ≤ i ≤ fp . For productions using forwarding, we only require that all nonterminal attributes are explicitly deﬁned by rules in S(p). As we will see, synthesised attributes a ∈ As (X0p ) which are not explicitly deﬁned receive as their value the value of Xfpp .a and inherited attributes a ∈ As (Xip ), 1 ≤ i ≤ fp which are not explicitly deﬁned are not needed in the calculation of the synthesised attributes on X0p . A definedness test verifying that this condition holds is discussed in Section 4.

2.2

Attribute Evaluation

Attribute grammars with forwarding can be evaluated directly, as described here, or embedded into standard higher order attribute grammars and evaluated in that framework by traditional means as described in Section 3. Attribute evaluation proceeds as it normally does for higher order attribute grammars with the exception of synthesised attributes not explicitly deﬁned by a production p (a ∈ As (X0p ) \ Aes (p), p ∈ P ) and inherited attributes for the forwards-to nonterminal which are not explicitly deﬁned (a ∈ Ai (Xfpp ) \ Aei (p, Xfpp ), p ∈ P ). For synthesised attribute occurrences a on nonterminals X0p deﬁned by production p such that a ∈ As (X0p ) \ Aes (p), that is, those for which there is no deﬁning semantic rule in S(p), we will use the value Xfpp .a. That is, if X0p is queried for its a attribute value it will return X0p .a if there is a semantic rule in S(p) deﬁning a (a ∈ Aes (p)), otherwise it returns Xfpp .a. For inherited attributes a not explicitly deﬁned for Xfpp by p, a ∈ Ai (Xfpp ) \ Aei (p, Xfpp ) we copy the values from X0p . The direct evaluation described here is particularly easy to implement by encoding the attribute grammar as a lazy functional program [15,1] and forms the basis of our prototype Intentional Programming system [30].

3

Reduction to Higher Order Attribute Grammars

Forwarding enables the decomposition of an attribute grammar into separate aspects, which are fragments that deﬁne a group of related attributes [7,6]. Once all the aspects are known, and a complete grammar is woven from the pieces, forwarding can be eliminated. That is important both for the implementation and analysis of an attribute grammar. Much earlier work on eﬃcient evaluation

136

Eric Van Wyk et al.

can be used directly, and the tools for analysing attribute grammars need only be modiﬁed to trace potential errors to the source that used forwarding. This section is divided into three parts: ﬁrst we show how production-valued attributes are eliminated, and then we demonstrate how forwarding itself can be transformed away. Finally we discuss how well the standard closure test, circularity test and attribute evaluation strategies work with the reduced grammar. Some readers may ﬁnd it helpful to study the formal descriptions below alongside the concrete example in Section 1. 3.1

Elimination of Production-Valued Attributes

As mentioned before, our technique for eliminating production-valued attributes is very similar to the de-functionalisation of higher-order programs. That transformation was ﬁrst proposed and studied by Reynolds [24], see also [5]. It is a whole-program transformation where function types are replaced by an enumeration of the function abstractions in the program. Here, we introduce an enumeration type for all production names. Next, we replace each production-valued attribute attr with an enumeration valued attribute attrpn generated from the names of the intended productions. Furthermore replace each reference to a production-valued attribute and its application to trees t1 , t2 , . . . , tn Xjp .attr(t1 , t2 , ..., tn ) by the expression case Xjp .attrpn of attr p1 → p1 (t1 , t2 , ..., tn ) attr p2 → p2 (t1 , t2 , ..., tn ) ... attr pm → pm (t1 , t2 , ..., tn ) such that attr pi , 1 ≤ i ≤ m is the enumeration token value for production pi such that σ(pi ) = At (a), pi ∈ P, 1 ≤ i ≤ m. As the defunctionalisation transformation is well-known, and this is a particularly simple instance, we conﬁne its exposition to this brief sketch. 3.2

Elimination of Forwarding

Our starting point is an attribute grammar with forwarding, as deﬁned in Section 2.1. The forwarding is eliminated in two steps, with a third optional step that is necessitated for the result to be acceptable in many attribute grammar systems. 1. Add semantic rules to explicitly copy synthesised attribute values from the forwarding nonterminals to the left hand side nonterminals. For each forwarding production p ∈ P and for each attribute a ∈ As (X0p ) \ Aes (p) add the following semantic rule: X0p a. = Xfpp .a

Forwarding in Attribute Grammars for Modular Language Design

137

That is, for each synthesised attribute a that is declared to annotate the left hand side of p (a ∈ As (X0p )) but is not one of the attributes explicitly deﬁned by p (a ∈ Aes (p)), add the above semantic rule to p. 2. Add semantic rules to explicitly copy inherited attribute values from the left hand side nonterminals to the forwarding nonterminals. For each forwarding production p ∈ P and for each attribute a ∈ Ai (Xfpp ) \ Aei (p, Xfpp ) add the following semantic rule: Xfp .a = X0p .a That is, for each inherited attribute a that is declared to annotate the forwards-to nonterminal Xfpp of p (a ∈ Ai (Xfpp )) but is not one of the attributes explicitly deﬁned by p (a ∈ Aei (p, Xfpp )), add the above semantic reul to p. 3. Add semantic rules for undefined inherited attributes. This step is optional and is only necessary to force the reduced higher order attribute grammar deﬁnition to pass the standard closure tests. It does not aﬀect the evaluation of the attribute grammar. For each forwarding production p ∈ P and for each attribute a ∈ Ai (Xjp ) \ Aei (p, Xjp ), 1 ≤ j ≤ mp add the following semantic rule: Xjp .a = αAt (a) where αAt (a) is any value of type At (a). That is, for each inherited attribute a that is declared to annotate the nonterminal Xjp , 1 ≤ j ≤ mp of p (a ∈ Ai (Xjp )) but is not one of the attributes explicitly deﬁned by p (a ∈ Aei (p, Xjp )), add the above semantic rule to p. Similar mechanisms for the automatic generation of copy rules ﬁrst came to our attention when studying the micro attribute grammar system produced by Swierstra and his colleagues at Utrecht [28]. That system does however not provide forwarding. 3.3

Closure, Circularity and Attribute Evaluation

Once the attribute grammar with forwarding has been reduced we can apply the standard closure and circularity tests and use existing mechanisms for attribute evaluation. We have, in fact, developed a simple prototype which uses the process described above to reduce a grammar with forwarding to a standard higher order attribute grammar written in SSL, the attribute grammar deﬁnition language of the Synthesizer Generator [23]. This allows us to use this tool’s analysis tests and attribute evaluation implementation. Attribute grammars are typically checked for deﬁnedness in two phases. The ﬁrst phase, known as the closure test, checks that no semantic rules are missing. For example, if somewhere in the grammar the synthesised attribute code from a subtree of type Stmt is used, then every production with Stmt on its left hand side must provide a semantic function for code. The second phase, known as the circularity test, checks whether there is an input tree on which the attributes are circularly deﬁned. We can safely apply the circularity test to the reduced

138

Eric Van Wyk et al.

grammar since circularities in the original grammar will also be detected as circularities in the reduced grammar. However, there are problems with the closure test. Although we can force the reduced grammar to pass the standard closure test (step 3 above), it then fails to detect genuine missing deﬁnitions. Also, the Synthesizer Generator’s strict evaluation strategy on the reduced grammar will cause the unnecessary evaluation of unused attributes.1 The root of both problems is that a production can deﬁne values for the left hand side nonterminal either explicitly or implicitly via forwarding. Consider using forwarding to deﬁne a for loop in terms of the expected while loop as deﬁned below. Here we have used quoting and implicit anti-quoting functions in order to specify the forwards-to construct using its concrete syntax instead of the abstract syntax tree constructors as we’ve done before. for : Stmt ::= id Expr Expr Stmt Stmt1 .pp = gen for pp(id1 , Expr1 .pp, Expr2 .pp, Stmt2 .pp) forwardsTo parse “id1 := Expr1 ; while ( id1 ≤ Expr2 ) do Stmt2 ; id1 := id1 + 1 endwhile” Except for the pretty print attribute, the semantics of for are determined entirely by forwarding. The eﬃciency problem can be seen by considering a strategy which evaluates all attribute occurrences in the tree. Such a strategy would unnecessarily compute the code attribute for the nodes in the child trees of for and the pretty print attribute for the nodes in the forwards-to tree. In contrast, demand driven evaluation would only evaluate those attribute deﬁnition functions which are necessary. The problems with the closure test are more subtle. Consider a break statement deﬁned as follows: break : Stmt ::= ε code = goto Stmt1 .gotoLabel The inherited attribute gotoLabel is deﬁned by the while production for its Stmt child and other productions have semantic functions to copy this value to their Stmt children. By using forwarding, the break statement works as expected when it appears inside a for loop since the code attribute for for is deﬁned by forwarding to a while loop construct. The for writer doesn’t need to deﬁne, or even know about, the gotoLabel attribute and the for writer should deﬁne neither the code nor the gotoLabel attribute. This attribute represents the type of detailed semantic information the writer of the for construct should not need to know about. The subtlety arises in the case when the for production explicitly deﬁnes the code attribute in terms of the code attribute of its children (perhaps in an attempt to generate more eﬃcient code than that generated by the translation into a while loop) but doesn’t deﬁne the gotoLabel attribute for its Stmt children. Since any break inside the for will need a gotoLabel value, this attribute should be deﬁned by the for production. If we evaluate the attribute grammar with forwarding 1

One can, however, specify attributes to be evaluated on demand in SSL.

Forwarding in Attribute Grammars for Modular Language Design

139

as discussed in Section 2.2, the evaluation fails when the Stmt child of the for attempts to reference its gotoLabel value which should be, but isn’t, deﬁned for it by the for production and causes a compile time exception. If we reduce the grammar as described above, step 3 adds an incorrect deﬁnition of gotoLabel and this grammar passes the standard closure test but generates incorrect results. Clearly, the for loop must deﬁne either both the code and gotoLabel attributes or neither of them. Next, we describe a deﬁnedness test which can identify this type of error.

4

The Definedness Test

As we explained above, although the standard circularity test can be applied to the reduced grammar, the standard closure test is inaccurate. We propose that this problem can be solved by abandoning the closure test and modifying the circularity test, so that it encompasses both roles. It statically checks that all required attributes are well defined. That is, they have deﬁnitions and that these deﬁnitions are non-circular. The standard circularity test [18] operates by computing a set of dependency relations for each nonterminal in the grammar. A dependency relation is a property of an individual abstract syntax tree which relates the root node’s synthesised attributes to its inherited attributes. A synthesised attribute s is related to the inherited attribute i if the computation of s depends on the value of i. Diﬀerent abstract syntax trees can have diﬀerent dependency relations, even if they are of the same nonterminal type. Therefore, the circularity test computes for each nonterminal the set of all dependency relations that a tree of that type might have. Since the set of possible relations is ﬁnite, there is an algorithm which can compute them in a ﬁnite amount of time without examining every possible tree. During the process of computing these sets, it may discover that an abstract syntax tree exists in which the attributes are circularly deﬁned. Our deﬁnedness test replaces dependency relations with definedness functions. A deﬁnedness function is also a property of a particular tree and has the type Set Ai → Set As , where Ai (As ) is the set of all inherited (synthesised) attributes. The function states which synthesised attributes can be computed on its root node if only the given set of inherited attributes is deﬁned on the root. Consider an example deﬁnedness function w. – For s ∈ As , if s ∈ w(∅), then s must have a constant value, because it does not depend on any of the inherited attributes. – If i ∈ I, I ⊆ Ai and s ∈ w(I), s ∈ As , then s does not depend on i. – If s ∈ w(Ai ), s ∈ As , then either the semantic rule for s is missing or s depends on a circular computation. Deﬁnedness functions are very similar to the dependency relations except that they operate in opposite directions; they are given a set of inherited attributes and report which synthesised attributes can be computed. The advantage of the deﬁnedness function is that it incorporates closure as well as circularity;

140

Eric Van Wyk et al.

for a particular tree, if the semantic function is missing for a attribute in the tree whose value is required to compute a synthesised attribute s, then s will not appear in the output of w for any given input. This is exactly the type of information we need to detect the missing gotoLabel semantic function when for explicitly deﬁned the code attribute in the example above. A disadvantage of the deﬁnedness function is that it does not distinguish between circularity and closure. If s does not appear in the output of w, then this could be due to a missing semantic rule or because s depends on a circular computation. However, in both of these cases the grammar is ill-deﬁned, so we do not see this as a major drawback. The algorithm for the deﬁnedness test is very similar to the circularity test. The test produces a set of deﬁnedness functions for every nonterminal in the grammar. Again, since the set of possible deﬁnedness functions is ﬁnite, our algorithm uses the same technique as the circularity test to compute them in a ﬁnite amount of time without needing to examine every possible tree. A more complete description and a proof of correctness is given in a forthcoming paper by Backhouse [2]. We must note that neither the standard circularity test nor the deﬁnedness test catches a particular kind of non-termination error. It is possible to construct an inﬁnitely large abstract syntax tree by unbounded nesting of nonterminal attributes, but no exact static test can detect this type of error.

5

Conclusion

We have introduced forwarding as a technique for the modular decomposition of higher-order attribute grammars. The technique is orthogonal to other features for the modular description of programming languages. Furthermore, we have demonstrated how a whole-grammar transformation can eliminate the use of forwarding altogether. Production-valued attributes are convenient in conjunction with forwarding; their use can also be transformed away. We noted the connection with defunctionalization, and indeed it would be of interest to see whether the elimination of forwarding itself can also be understood in those terms. If so, it would lend further credence to our belief that there is much beneﬁt to be derived from the interaction between the functional programming and attribute grammar communities. In a separate paper, one of us (Backhouse) has shown how abstract interpretation can beneﬁt the study of attribute grammars [2]. Conversely, Correnson, Parigot and their coworkers have argued that transformations on attribute grammars beneﬁt functional programs [4,22].

Acknowledgements Eric Van Wyk and Kevin Backhouse are supported by a grant from Microsoft Research. We would like to thank our colleagues both at Microsoft and at Oxford for many interesting discussions on the topic of forwarding.

Forwarding in Attribute Grammars for Modular Language Design

141

References 1. S. Adams. Modular Attribute Grammars for Programming Language Prototyping. Ph.D. thesis, University of Southampton, 1991. 128, 135 2. K. S. Backhouse. A functional semantics of attribute grammars. In International Conference on Tools and Algorithms for Construction and Analysis of Systems, Lecture Notes in Computer Science. Springer-Verlag, 2002. 140 3. A. Carle. Hierarchical attribute grammars: Dialects, applications and evaluation algorithms. Technical Report TR93-270, Department of Computer Science, Rice University, 1993. 128 4. L. Correnson, E. Duris, D. Parigot, and G. Roussel. Declarative program transformation: a deforestation case-study. In G. Nadathur, editor, Principles and Practice of Declarative Programming, volume 1702, pages 353–369. Lecture Notes in Computer Science, 1999. 140 5. O. Danvy and L. R. Nielsen. Defunctionalization at work. In Third International Conference on Principles and Practice of Declarative Programming (PPDP 01). ACM Press, 2001. 136 6. O. de Moor, K. Backhouse, and S. D. Swierstra. First-class attribute grammars. Informatica, 24(3), 2000. 135 7. O. de Moor, S. Peyton-Jones, and E. Van Wyk. Aspect-oriented compilers. In First International Symposium on Generative and Component-based Software Engineering, Lecture Notes in Computer Science. Springer-Verlag, 1999. 135 8. G. D. P. Dueck and G. V. Cormack. Modular attribute grammars. Computing Journal, 33:164–172, 1990. 128 9. Dawson R. Engler, Wilson C. Hsieh, and M. Frans Kaashoek. ‘c: A language for high-level, eﬃcient, and machine-independent dynamic code generation. In Symposium on Principles of Programming Languages, pages 131–144, 1996. 130 10. R. Farrow, T. J. Marlowe, and D. M. Yellin. Composable attribute grammars: Support for modularity in translator design and implementation. In Proceedings of the ACM Symposium on Principles of Programming Languages, pages 223–234. ACM Press, 1992. 128 11. H. Ganzinger and R. Giegerich. Attribute coupled grammars. SIGPLAN Notices, 19:157–170, 1984. 128 12. J. A. Goguen. Higher-order functions considered unnecessary for higher-order programming. In D. A. Turner, editor, Research Topics in Functional Programming, pages 309–351. Addison-Welsey, Reading, MA, 1990. 132 13. G. Hedin. An object-oriented notation for attribute grammars. In Proceedings of the European Conference on Object-Oriented Programming, ECOOP’89. Cambridge University Press, 1989. 128 14. Gorel Hedin. Reference Attributed Grammars. In D. Parigot and M. Mernik, editors, Second Workshop on Attribute Grammars and their Applications, WAGA’99, pages 153–172, Amsterdam, The Netherlands, 1999. INRIA rocquencourt. 128 15. T. Johnsson. Attribute grammars as a functional programming paradigm. In G. Kahn, editor, Functional Programming Languages and Computer Architecture, volume 274 of Lecture Notes in Computer Science, pages 154–173. Springer-Verlag, 1987. 135 16. S. Jones and J. Hughes. Haskell98: A non-strict purely functional language. 132 17. U. Kastens and W. M. Waite. Modularity and reusability in attribute grammars. Acta Informatica, 31:601–627, 1994. 128

142

Eric Van Wyk et al.

18. D. E. Knuth. Semantics of context-free languages. Mathematical Systems Theory, 2(2):127–146, 1968. Corrections in 5(2):95–96, 1971. 134, 139 19. C. Le Bellec, M. Jourdan, D. Parigot, and G. Roussel. Speciﬁcation and implementation of grammar coupling using attribute grammars. In M. Bruynooghe and J. Penjam, editors, Programming Language Implementation and Logic Programming (PLILP ’93), volume 714 of Lecture Notes in Computer Science, pages 123–136. Springer-Verlag, 1993. 128 20. W Maddox. Semantically-sensitive macroprocessing. Master’s thesis, The University of California at Berkeley, Computer Science Division (EECS), Berkeley, CA 94720, December 1989. 130 21. M. Mernik, M. Lenic, E. Avdicausevic, and V. Zumer. Multiple attribute grammar inheritance. Informatica, 24(3):319–328, 2000. 128, 130 22. D. Parigot, E. Duris, G. Roussel, and M. Jourdan. Attribute grammars: a declarative functional language. Rapport de Recherche 2662, INRIA, 1995. 140 23. T. W. Reps and T. Teitelbaum. The Synthesizer Generator: A system for constructing language-based editors. Texts and Monographs in Computer Science. Springer-Verlag, 1989. 137 24. J. C. Reynolds. Deﬁnitional interpreters for higher-order programming languages. Higher-order and symbolic computation, 11(4):363–397, 1998. Reprinted from the proceedings of the 25th ACM National Conference (1972). 136 25. Joao Saraiva and Doaitse Swierstra. Generic Attribute Grammars. In D. Parigot and M. Mernik, editors, Second Workshop on Attribute Grammars and their Applications, WAGA’99, pages 185–204, Amsterdam, The Netherlands, 1999. INRIA rocquencourt. 128 26. Tim Sheard. Using metaml: A staged programming language. In Advanced Functional Programming, pages 207–239, 1998. 130 27. C. Simonyi. Intentional programming: Innovation in the legacy age. Presented at IFIP Working group 2.1. Available from URL http://www.research. microsoft.com/research/ip/, 1996. 128 28. S. D. Swierstra. Simple, functional attribute grammars. http://www.cs.uu.nl/ groups/ST/Software/UU AG/, 1999. 137 29. T. Teitelbaum and R. Chapman. Higher-order attribute grammars and editing environments. In ACM Sigplan ’90 Conference on Programming Languages Design and Implementation, pages 197–208, 1990. 128, 134 30. E. Van Wyk, O. de Moor, G. Sittampalam, I. Sanabria-Piretti, K. Backhouse, and P. Kwiatkowski. Intentional programming: a host of language features. Technical Report PRG-RR-01-15, Computing Laboratory, University of Oxford, 2001. 128, 129, 135 31. H. Vogt. Higher order attribute grammars. PhD thesis, Department of Computer Science, Utrecht University, The Netherlands, 1989. 134 32. H. Vogt, S. D. Swierstra, and M. F. Kuiper. Higher-order attribute grammars. In Conference on Programming Languages Design and Implementation, pages 131– 145, 1990. Published as ACM SIGPLAN Notices, 24(7). 128, 134 33. Daniel Weise and Roger F. Crew. Programmable syntax macros. ACM SIGPLAN Notices, 28(6):156–165, 1993. 130

Disambiguation Filters for Scannerless Generalized LR Parsers Mark G. J. van den Brand1,4 , Jeroen Scheerder2 , Jurgen J. Vinju1 , and Eelco Visser3 1

Centrum voor Wiskunde en Informatica (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands {Mark.van.den.Brand,Jurgen.Vinju}@cwi.nl 2 Department of Philosophy, Utrecht University Heidelberglaan 8, 3584 CS Utrecht, The Netherlands [email protected] 3 Institute of Information and Computing Sciences, Utrecht University P.O. Box 80089, 3508TB Utrecht, The Netherlands [email protected] 4 LORIA-INRIA 615 rue du Jardin Botanique, BP 101, F-54602 Villers-l`es-Nancy Cedex, France

Abstract. In this paper we present the fusion of generalized LR parsing and scannerless parsing. This combination supports syntax deﬁnitions in which all aspects (lexical and context-free) of the syntax of a language are deﬁned explicitly in one formalism. Furthermore, there are no restrictions on the class of grammars, thus allowing a natural syntax tree structure. Ambiguities that arise through the use of unrestricted grammars are handled by explicit disambiguation constructs, instead of implicit defaults that are taken by traditional scanner and parser generators. Hence, a syntax deﬁnition becomes a full declarative description of a language. Scannerless generalized LR parsing is a viable technique that has been applied in various industrial and academic projects.

1

Introduction

Since the introduction of eﬃcient deterministic parsing techniques, parsing is considered a closed topic for research, both by computer scientists and by practicioners in compiler construction. Tools based on deterministic parsing algorithms such as LEX & YACC [15,11] (LALR) and JavaCC (recursive descent), are considered adequate for dealing with almost all modern (programming) languages. However, the development of more powerful parsing techniques is prompted by domains such as reverse engineering and domain-speciﬁc languages. The ﬁeld of reverse engineering is concerned with automatically analyzing legacy software and producing speciﬁcations, documentation, or reimplementations. This area provides numerous examples of parsing problems that can only be tackled by using powerful parsing techniques.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 143–158, 2002. c Springer-Verlag Berlin Heidelberg 2002

144

Mark G. J. van den Brand et al.

Grammars of languages such as Cobol, PL1, Fortran, etc. are not naturally LALR. Much massaging and default resolution of conﬂicts are needed to implement a parser for these languages in YACC. Maintenance of such massaged grammars is a pain since changing or adding a few productions can lead to new conﬂicts. This problem is aggravated when diﬀerent dialects need to be supported—many vendors implement their own Cobol dialect. Since grammar formalisms are not modular this usually leads to forking of grammars. Further trouble is caused by the embedding of ‘foreign’ language fragments, e.g., assembler code, SQL, CICS, or C, which is common practice in Cobol programs. Merging of grammars for several languages leads to conﬂicts at the context-free grammar level and at the lexical analysis level. These are just a few examples of problems encountered with deterministic parsing techniques. The need to tackle such problems in the area of reverse engineering has led to a revival of generalized parsing algorithms such as Earley’s algorithm, (variants of) Tomita’s algorithm (GLR) [14,21,17,2,20], and even recursive descent backtrack parsing [6]. Although generalized parsing solves several problems in this area, generalized parsing alone is not enough. In this paper we describe the beneﬁts and the practical applicability of scannerless generalized LR parsing. In Section 2 we discuss the merits of scannerless parsing and generalized parsing and argue that their combination provides a solution for problems like the ones described above. In Section 3 we describe how disambiguation can be separated from grammar structure, thus allowing a natural grammar structure and declarative and selective speciﬁcation of disambiguation. In Section 4 we discuss issues in the implementation of disambiguation. In Section 5 practical experience with the parsing technique is discussed. In Section 6 we present ﬁgures on the performance of our implementation of a scannerless generalized parser. Related work is discussed where needed throughout the paper. Finally, we conclude in Section 7.

2 2.1

Scannerless Generalized Parsing Generalized Parsing

Generalized parsers are a class of parsing algorithms that are not constrained by restrictions on the class of grammars that they can handle, contrary to restricted parsing algorithms such as the various derivatives of the LL and LR algorithms. Whereas these algorithms only deal with context-free grammars in LL(k) or LR(k) form, generalized algorithms such as Earley’s or Tomita’s algorithms can deal with arbitrary context-free grammars. There are two major advantages to the use of arbitrary context-free grammars. Firstly, the class of context-free grammars is closed under union, in contrast with all proper subclasses of context-free grammars. For example, the composition of two LALR grammars is very often not an LALR grammar. The compositionality of context-free grammars opens up the possibility of developing modular syntax deﬁnition formalisms. Modularity in programming languages

Disambiguation Filters for Scannerless Generalized LR Parsers

145

and other formalisms is one of the key beneﬁcial software engineering concepts. A striking example in which modularity of a grammar is obviously practical is the deﬁnition of hybrid languages such as Cobol with CICS, or C with assembly. Sdf [10,23] is an example of a modular syntax deﬁnition formalism. Secondly, an arbitrary context-free grammar allows the deﬁnition of declarative grammars. There is no need to massage the grammar into LL, LR, LALR, or any other form. Rather the grammar can reﬂect the intended structure of the language, resulting in a concise and readable syntax deﬁnition. Thus, the same grammar can be used for documentation as well as implementation of a language without any changes. Since generalized parsers can deal with arbitrary grammars, they can also deal with ambiguous grammars. While a deterministic parser produces a single parse tree, a non-deterministic parser produces a collection (forest) of trees compactly representing all possible derivations according to the grammar. This can be helpful when developing a grammar for a language. The parse forest can be used to visualize the ambiguites in the grammar, thus aiding in the improvement of the grammar. Contrast this with solving conﬂicts in an LALR table. Disambiguation ﬁlters can be used to reduce a forest to the intended parse tree. Filters can be based on disambiguation rules such as priority and associativity declarations. Such ﬁlters solve the most frequent ambiguities in a natural and intuitive way without hampering the clear structure of the grammar. In short, generalized parsing opens up the possibility for developing clear and concise language deﬁnitions, separating the language design problem from the disambiguation problem. 2.2

Scannerless Parsing

Traditionally, syntax analysis is divided into a lexical scanner and a (contextfree) parser. A scanner divides an input string consisting of characters into a string of tokens. This tokenization is usually based on regular expression matching. To choose between overlapping matches a number of standard lexical disambiguation rules are used. Typical examples are prefer keywords, prefer longest match, and prefer non-layout. After tokenization, the tokens are typically interpreted by the parser as the terminal symbols of an LR(1) grammar. Although this architecture proves to be practical in many cases and is globally accepted as the standard solution for parser generation, it has some problematic limitations. Only few existing programming languages are designed to ﬁt this architecture, since these languages generally have an ambiguous lexical syntax. The following examples illustrate this misﬁt for Cobol, PL1 and Pascal. In an embedded language, such as SQL in Cobol, identiﬁers that are reserved keywords in Cobol might be allowed inside SQL statements. However, the implicit “prefer keywords” rule of lexical scanners will automatically prohibit them in SQL too. Another Cobol example; a particular “picture clause” might look like "PIC 99", where "99" should be recognized as a list of picchars. In some other part of a Cobol program, the number "99" should be recognized as numeric. Both

146

Mark G. J. van den Brand et al.

lexical categories obviously overlap, but on the context-free level there is no ambiguity because picture clauses do not appear where numerics do. See [13] for a Cobol syntax deﬁnition. Another example of scanner and parser interference stems from Pascal. Consider the input sentence "array [1..10] of integer", the range "1..10" can be tokenized in two diﬀerent manners, either as the real "1." followed by the real ".10", or as the integer "1" followed by the range operator ".." followed by the integer "10". In order to come up with the correct tokenization the scanner must “know” it is processing an array declaration. The problem is even more imminent when a language does not have reserved keywords at all. PL1 is such a language. This means that a straightforward tokenization is not possible when scanning a valid PL1 sentence such as "IF THEN THEN = ELSE; ELSE ELSE = THEN;". Similar examples can be found for almost any existing programming language. A number of techniques for tackling this problem is discussed in [3]. Some parser generators provide a complex interface between scanner and parser in order to proﬁt from the speed of lexical analysis while using the power of a parser. Some lexical scanners have more expressive means than regular expressions to be able to make more detailed decisions. Some parser implementations allow arbitrary computations to be expressed in a programming language such as C to guide the scanner and the parser. All in all it is rather cumbersome to develop and to maintain grammars which have to solve such simple lexical disambiguations, because none of these approaches result in declarative syntax speciﬁcations. Scannerless parsing is an alternative parsing technique that does not suﬀer these problems. The term scannerless parsing was introduced in [18,19] to indicate parsing without a separate lexical analysis phase. In scannerless parsing, a syntax deﬁnition is a context-free grammar with characters as terminals. Such an integrated syntax deﬁnition deﬁnes all syntactic aspects of a language, including the full details of the lexical syntax. The parser derived from this grammar directly reads the characters of the input string and ﬁnds its phrase structure. Scannerless parsing does not suﬀer the problems of implicit lexical disambiguation. Very often the problematic lexical ambiguities do not even exist at the context-free level, as is the case in our Cobol, Pascal and PL1 examples. On the other hand, the lack of implicit rules such as “prefer keywords” and “longest match” might give rise to new ambiguities at the context-free level. These ambiguities can be solved by providing explicit declarative rules in a syntax deﬁnition language. Making such disambiguation decisions explicit makes it possible to apply them selectively. For instance, we could specify longest match for a single speciﬁc sort, instead of for the entire grammar, as we shall see in Section 3. In short, scannerless parsing does not need to make any assumptions about the lexical syntax of a language and is therefore more generically applicable for language engineering.

Disambiguation Filters for Scannerless Generalized LR Parsers

2.3

147

Combining Scannerless Parsing and Generalized Parsing

Syntax deﬁnitions in which lexical and context-free syntax are fully integrated do not usually ﬁt in any restricted class of grammars required by deterministic parsing techniques because lexical syntax often requires arbitrary length lookahead. Therefore, scannerless parsing does not go well with deterministic parsing. For this reason the adjacency restrictions and exclusion rules of [18,19] could only be partly implemented in an extension of a SLR(1) parser generator and led to complicated grammars. Generalized parsing techniques, on the other hand, can deal with arbitrary length lookahead. Using a generalized parsing technique solves the problem of lexical lookahead in scannerless parsing. However, it requires a solution for disambiguation of lexical ambiguities that are not resolved by the parsing context. In the rest of this paper we describe how syntax deﬁnitions can be disambiguated by means of declarative disambiguation rules for several classes of ambiguities, in particular lexical ambiguities. Furthermore, we discuss how these disambiguation rules can be implemented eﬃciently.

3

Disambiguation Rules

There are many ways for disambiguation of ambiguous grammars, ranging from simple syntactic criteria to semantic criteria [12]. Here we concentrate on ambiguities caused by integrating lexical and context-free syntax. Four classes of disambiguation rules turn out to be adequate. Follow restrictions are a simpliﬁcation of the adjacency restriction rules of [18,19] and are used to achieve longest match disambiguation. Reject productions, called exclusion rules in [18,19], are designed to implement reserved keywords disambiguation. Priority and associativity rules are used to disambiguate expression syntax. Preference attributes are used for selecting a default among several alternative derivations. 3.1

Follow Restrictions

Suppose we have the simple context-free grammar for terms as presented in Figure 1. An Id is deﬁned to be one ore more characters from the class [a-z]+ and two terms are separated by whitespace consisting of zero or more spaces or newlines. Without any lexical disambiguation, this grammar is ambiguous. For example, the sentence "hi" can be parsed as Term(Id("hi")) or as Term(Id("h")), Ws(""), Term(Id("i")). Assuming the ﬁrst is the intended derivation, we add a follow restriction, Id -/- [a-z], indicating that an Id may not directly be followed by a character in the range [a-z]. This entails that such a character should be part of the identiﬁer. Similarly, follow restrictions are added for Nat and Ws. We have now speciﬁed a longest match for each of these lexical constructs.

148

Mark G. J. van den Brand et al. Term ::= Id | Nat | Term Ws Term Id ::= [a-z]+ Nat ::= [0-9]+ Ws ::= [~\n]* %restrictions Id -/- [a-z] Nat -/- [0-9] Ws -/- [~\n]

Fig. 1. Term language with follow restrictions In some languages it is necessary to have more than one character lookahead to decide the follow restriction. In Figure 2 we extend the layout deﬁnition of Figure 1 with comments. The expression ~[\*] indicates any character except the asterisk. The expression [$].[\*] deﬁnes a restriction on two consecutive characters. The result is a longest match for the Ws nonterminal, including comments. The follow restriction on Star prohibits the recognition of the string "*)" within Comment. Note that it is straightforward to extend this deﬁnition to deal with nested comments. Star ::= [\*] CommentChar ::=~[\*] | Star Comment ::= "(*" CommentChar* "*)" Ws ::= ([~\n] | Comment)* %restrictions Star -/- [$] Ws -/- [~\n] | [\(].[\*]

Fig. 2. Extended layout deﬁnition with follow restrictions

3.2

Reject Productions

Reject productions are used to implement keyword reservation. We extend the grammar deﬁnition of Figure 1 with the begin and end construction in Figure 3. The sentence "begin hi end" is either interpreted as three consecutive Id terms separated by Ws, or as a Program with a single term hi. By rejecting the strings begin and end from Id, the ﬁrst interpretation can be ﬁltered out. The reject mechanism can be used to reject not only strings, but entire context-free languages from a nonterminal. We focus on its use for keyword reservation in this paper and refer to [23] for more discussion. Program ::= "begin" Ws Term Ws "end" Id ::= "begin" | "end" {reject}

Fig. 3. Prefer keywords using reject productions

Disambiguation Filters for Scannerless Generalized LR Parsers

3.3

149

Priority and Associativity

For completeness we show an example of the use of priority and associativity in an expression language. Note that we have left out the Ws nonterminal for brevity1 . In Figure 4 we see that the binary operators + and * are both deﬁned as left associative and the * operator has a higher priority than the + operator. Consequently the sentence "1 + 2 + 3 * 4" is interpreted as "(1 + 2) + (3 * 4)". Exp ::= [0-9]+ Exp ::= Exp "+" Exp {left} Exp ::= Exp "*" Exp {left} %priorities Exp ::= Exp "*" Exp > Exp ::= Exp "+" Exp

Fig. 4. Associativity and priority rules

3.4

Preference Attributes

A preference rule is a generally applicable rule to choose a default among ambiguous parse trees. For example, it can be used to disambiguate the notorious dangling else construction. Again we have left out the Ws nonterminal for brevity. In Figure 5 we extend our term language with this construct. The input sentence "if 0 then if 1 then hi else ho" can be parsed in two ways: if 0 then (if 1 then hi) else ho and if 0 then (if 1 then hi else ho). We can select the latter derivation by adding the prefer attribute to the production without the else part. The parser will still construct an ambiguity node containing both deriviations, namely, if 0 then (if 1 then hi {prefer}) else ho and if 0 then (if 1 then hi else ho) {prefer}. But given the fact that the top node of the latter derivation tree has the prefer attribute this derivation is selected and the other tree is removed from the ambiguity node. The dual of {prefer} is the {avoid} attribute. Any other tree is preferred over a tree with an avoided top production. One of its uses is to prefer keywords rather than reserving them entirely. For example, we can add an {avoid} to the Id ::= [a-z]+ production in Figure 1 and not add the reject productions of Figure 3. The sentence "begin begin end" is now a valid Program with the single derivation of a Program containing the single Id "begin".

4

Implementation Issues

Our implementation of scannerless generalized parsing consists of the syntax deﬁnition formalism Sdf that supports concise speciﬁcation of integrated syn1

By doing grammar normalization a parse table generator can automatically insert layout between the members in the right-hand side. See also Section 5.

150

Mark G. J. van den Brand et al. Term ::= "if" Nat "then" Term {prefer} Term ::= "if" Nat "then" Term "else" Term Id ::= "if" | "then" | "else" {reject}

Fig. 5. Dangling else construction disambiguated tax deﬁnitions, a grammar normalizer that injects layout and desugars regular expressions, a parse table generator and a parser that interprets parse tables. The parser is based on the GLR algorithm. For the basic GLR algorithms we refer to the ﬁrst publication on generalized LR parsing by Lang [14], the work by Tomita [21], and the various improvements and implementations [17,2,20]. We will not present the complete SGLR algorithm, because it is essentially the standard GLR algorithm where each character2 is a separate token. For a detailed description of the implementation of GLR and SGLR we refer to [17] and [22] respectively. The algorithmic diﬀerences between standard GLR and scannerless GLR parsing are centered around the disambiguation constructs. From a declarative point of view each disambiguation rule corresponds to a ﬁlter that prunes parse forests. In this view, parse table generation and the GLR algorithm remain unchanged and the parser returns a forest containing all derivations. After parsing a number of ﬁlters is executed and a single tree or at least a smaller forest is obtained. Although this view is conceptually attractive, it does not fully exploit the possibilities for pruning the parse forest before it is even created. A ﬁlter might be implemented statically, during parse table generation, dynamically, during parsing, or after parsing. The sooner a ﬁlter is applied, the faster a parser will return the ﬁltered derivation tree. In which phase they are applicable depends on the particulars of speciﬁc disambiguation rules. In this section we discuss the implementation of the four classes of disambiguation rules. 4.1

Follow Restrictions

Our parser generator generates a simple SLR(1) parse table, however we deviate at a number of places from the standard algorithm [1]. One modiﬁcation is the calculation of the follow set. The follow set is calculated for each individual production rule instead of for each nonterminal. Using priority and associativity relations may lead to diﬀerent follow sets for productions with the same non-terminal in the left-hand side. Another modiﬁcation is that the transitions between states (item-sets) in the LR-automaton are not labeled with a nonterminal, but with a production rule. These more ﬁne-grained transitions increase the size of the LR-automaton, but it allows us to generate parse tables with fewer conﬂicts. Follow restriction declarations with a single lookahead can be used during parse table generation to remove reductions from the parse table. This is done 2

The current implementation of SGLR supports the Latin-1 character set.

Disambiguation Filters for Scannerless Generalized LR Parsers

151

by intersecting the follow set of each production rule with the set of characters in the follow restrictions for the produced nonterminal. The eﬀect of this ﬁlter is that the reduction in question cannot be performed for characters in the follow restriction set. Restrictions with more than one lookahead must be dealt with dynamically by the parser. The parse table generator marks the reductions that produce a nonterminal that has restrictions with more than one character. Then, while parsing, before such a reduction is done the parser must retrieve the required number of characters from the string and check them with the restrictions. If the next characters in the input match these restrictions the reduction is not allowed, otherwise it can be performed. This parse-time implementation prohibits shift/reduce conﬂicts that would normally occur and therefore saves the parser from performing unnecessary work. 4.2

Reject Productions

Disambiguation by means of reject productions cannot be implemented statically, since this would require computing the intersection of two syntactic categories, which is not possible in general. Even computing such intersections for regular grammars would lead to very large automata. When using a generalized parser, ﬁltering with reject productions can be implemented eﬀectively during parsing. Consider the reject production Id ::= "begin" {reject}, which declares that "begin" is not a valid Id in any way (Figure 3). Thus, each and every derivation of the subsentence "begin" that produces an Id is illegal. During parsing, without the reject production the substring "begin" will be recognized both as an Id and as a keyword in a Program. By adding the reject production to the grammar another derivation is created for "begin" as an Id, resulting in an ambiguity of two deriviations. If one derivation in an ambiguity node is rejected, the entire parse stack for that node is deleted. Hence, "begin" is not recognized as an identiﬁer in any way. Note that the parser must wait until each ambiguous derivation has returned before it can delete a stack3 . The stack on which this substring was recognized as an Id will not survive, thus no more actions are performed on this stack. The only derivation that remains is where "begin" is a keyword in a Program. Reject productions could also be implemented as a backend ﬁlter. However, by terminating stacks on which reject productions occur as soon as possible a dramatic reduction in the number of ambiguities can be obtained. 4.3

Priority and Associativity

Associativity of productions and priority relations can be processed during the construction of the parse table. We present an informal description here and refer to [23] for details. 3

Our parser synchronizes parallel stacks on shifts, so we can wait for a shift before we delete an ambiguity node.

152

Mark G. J. van den Brand et al.

There are two phases in the parse table generation process in which associativity and priority information is used. The ﬁrst place is during the construction of the LR-automaton. Item-sets in the LR-automaton contain dotted productions. Prediction of new items for an item-set takes the associativity and priority relations into consideration. If a predicted production is in conﬂict with the production of the current item, then the latter production is not added to the item-set. The second place is when shifting a dot over a nonterminal in an item. In case of an associativity or priority conﬂict between a production rule in the item and a production rule on a transition, the transition will not be added to the LR-automaton. We will illustrate the approach described above by discussing the construction of a part of the LR-automaton for the grammar presented in Figure 4. We create the transitions in the LR-automaton for state si which contains the items [Exp ::= . Exp "+" Exp] [Exp ::= . Exp "*" Exp] [Exp ::= . [0-9]+]

In order to shift the dot over the nonterminal Exp via the production rule Exp ::= Exp "+" Exp every item in si is checked for a conﬂict. The new state sj has the item-set [Exp ::= Exp . "+" Exp]

Note that sj does not contain the item [Exp ::= Exp . "*" Exp], since that would cause a conﬂict with the given priority relation "*" > "+". By pruning the transitions in a parse table in the above manner, conﬂicts at parse time pertaining to associativity and priority can be ruled out. However, if we want priority declarations to ignore injections (or chain rules) this implementation does not suﬃce. Yet it is natural to ignore injections when applying disambiguation rules, since they do not have any visible syntax. Priorities module chain rules require an extension of this method or a parse-time ﬁlter. 4.4

Preference Attributes

The preference ﬁlter is a typical example of an after parsing ﬁlter. In principle it could be applied while parsing, however this will complicate the implementation of the parser tremendously without gaining eﬃciency. This ﬁlter operates on an ambiguity node, which is a set of ambiguous subtrees, and selects the subtrees with the highest preference. The simplest preference ﬁlter compares the trees of each ambiguity node by comparing the avoid or prefer attributes of the top productions. Each preferred tree remains in the set, while all others are removed. If there is no preferred tree, all avoided trees are removed, while all others remain. Ignoring injections at the top is a straightforward extension to this ﬁlter. By implementing this ﬁlter in the backend of the parser we can exploit the redundancy in parse trees by caching ﬁltered subtrees and reusing the result when ﬁltering other identical subtrees. We use the ATerm library [5] for representing a parse forest. It has maximal sharing of subterms, limiting the amount of memory used and making subtree identiﬁcation a trivial matter of pointer equality.

Disambiguation Filters for Scannerless Generalized LR Parsers

153

For a number of grammars this simple preference ﬁlter is not powerful enough, because the production rules with the avoid or prefer are not at the root (modulo injectons) of the subtrees, but deeper in the subtree. In order to disambiguate these ambiguous subtrees, more subtle preference ﬁlters are needed. However, these ﬁlters will always be based on some heuristic, e.g., counting the number of “preferred” and “avoided” productions and applying some selection on the basis of these numbers, or by looking a the depth at which a “preferred” or “avoided” production occurs. In principle, for any chosen heuristic counter examples can be constructed for which the heuristic fails to achieve its intended goal, yielding undesired results.

5

Applications

5.1

Asf+Sdf Meta-Environment

In the introduction of this paper we claimed that generalized parsing techniques are applicable in the ﬁelds of reverse engineering and language prototyping, i.e., the development of new (domain-speciﬁc) languages. The Asf+Sdf Meta-Environment [4] is used in both these ﬁelds. This environment is an interactive development environment for the automatic generation of interactive systems for manipulating programs, speciﬁcations, or other texts written in a formal language. The parser in this environment and in the generated environments is an SGLR parser. The language deﬁnitions are written in the Asf+Sdf formalism [8] which allows the deﬁnition of syntax via Sdf (Syntax Deﬁnition Formalism) [10] as well as semantics via Asf (Algebraic Speciﬁcation Formalism). Figure 6 shows an Sdf speciﬁcation of the previous examples. Asf+Sdf has been used in a number of industrial and scientiﬁc projects. Amongst others it was used for parsing and compiling Asf+Sdf speciﬁcations, automatically renovating Cobol code, program analysis of legacy code via socalled island grammars [16], and development of new Action Notation syntax [9]. 5.2

XT

XT [7] is a collection of basic tools for building program transformation systems including the Stratego transformation language [24], and the syntax deﬁnition formalism Sdf supported by SGLR. Tools standardize on ATerms [5] as common exchange format. Several meta-tools are provided for generating transformation components from syntax deﬁnitions, including a data type declaration generator that generates the data type corresponding to the abstract syntax of an Sdf syntax deﬁnition, and a pretty-printer generator that generates default pretty-print tables. To promote reuse and standardization of syntax deﬁnitions, the XT project has initiated the creation of the Online Grammar Base4 currently with some 25 4

http://www.program-transformation.org/gb

154

Mark G. J. van den Brand et al.

module Program imports If exports sorts Program context-free syntax "begin" Term "end" -> Program "begin" | "end" -> Id {reject} module If imports Terms exports context-free syntax "if" Nat "then" Term -> Term {prefer} "if" Nat "then" Term "else" Term -> Term "if" | "then" | "else" -> Id {reject} module Terms imports Comment exports sorts Term lexical syntax [0-9]+ -> Nat [a-z]+ -> Id lexical restrictions Id -/- [a-z] Nat -/- [0-9] context-free syntax Term Term -> Term {left} Id | Nat -> Term module Comment exports lexical syntax [\*] -> Star ~[\*] | Star -> CommentChar "(*" CommentChar* "*)" -> Comment [~\n] | Comment -> LAYOUT lexical restrictions Star -/- [\)] context-free restrictions LAYOUT? -/- [~\n] | [\(].[\*]

Fig. 6. A modular Sdf deﬁnition combining some of the previous examples. This example also shows the use of a special "LAYOUT" nonterminal, the use of regular expressions (e.g. "|" for alternative and "*" for repetition) and the use of multiple start nonterminals

Disambiguation Filters for Scannerless Generalized LR Parsers

155

Table 1. Some ﬁgures on SGLR performance Grammar

Average Characters/second Characters/second ﬁle size with ﬁlter & tree5 w/o ﬁlter & tree5 ATerms 106,000 chars 108,000 340,000 BibTEX 455,000 chars 85,000 405,000 Box 80,000 chars 34,000 368,000 Cobol 170,000 chars 58,000 146,000 Java 105,000 chars 37,000 210,000 Java (LR1) 105,000 chars 53,000 242,000

Table 2. Some ﬁgures on the grammars and the generated parse tables Grammar

Productions States Actions Actions with Gotos conﬂicts ATerms 104 128 8531 75 46569 BibTEX 150 242 40508 3129 98901 Box 202 385 19249 1312 177174 Cobol 1906 5520 170375 32634 11941923 Java 726 1561 148359 5303 1535446 Java (LR1) 765 1597 88561 3354 1633156

syntax deﬁnitions for various general purpose and domain-speciﬁc languages, including Cobol, Java, SDL, Stratego, YACC, and XML. Many syntax deﬁnitions were semi-automatically reengineered from LEX/YACC deﬁnitions using grammar manipulation tools from XT, producing more compact syntax deﬁnitions. Sdf/SGLR based parsers have been used in numerous projects built with XT in areas ranging from software renovation and grammar recovery to program optimization and compiler construction.

6

Benchmarks

We have benchmarked our implementation of SGLR by parsing a number of large ﬁles and measuring the user time. Table 1 shows the results with and without parse tree construction and backend ﬁltering. All ﬁlters implemented in the parse table or during parsing are active in both measurements. The table shows that the parser is fast enough for industrial use. An interesting observation is that the construction of the parse tree slows down the entire process quite a bit. Further speedup can be achieved by optimizing parse tree construction. Table 2 shows some details of the SLR(1) parse tables for the grammars we used. We downloaded all but the last grammar from the Online Grammar 5

All benchmarks were performed on a 1200 Mhz AMD Athlon(tm) with 512Mb memory running Linux.

156

Mark G. J. van den Brand et al.

Base. ATerms is a grammar for preﬁx terms with annotations, BibTEX is a bibliography ﬁle format, Box is a mark-up language used in pretty-print tools. Cobol and Java are grammars for the well-known programming languages. We have benchmarked two diﬀerent Java grammars. The ﬁrst is written from scratch in Sdf, the second was obtained by transforming a Yacc grammar into Sdf. So, the ﬁrst is a more natural deﬁnition of Java syntax, while the second is in LR(1) form. The number of productions is measured after Sdf grammar normalization6 . We mention the number of states, gotos and actions in the parse table. Remember that the parse table is speciﬁed down to the character level, so we have more states than usual. Also, actions and gotos are based on productions, not nonterminals, resulting in a bigger parse table. The number of actions with more than one reduce or shift (a conﬂict) gives an indication of the amount of “ambiguity” in a grammar. The two Java results in Table 1 show that ambiguity of a grammar has a limited eﬀect on performance. Note that after ﬁltering, every parse in our testset resulted in a single derivation.

7

Conclusions

In this paper we discussed the combination of generalized LR parsing with scannerless parsing. The ﬁrst parsing technique allows for the development of modular deﬁnition of grammars whereas the second one relieves the grammar writer from interface problems between scanner and parser. The combination supports the development of declarative and maintainable syntax deﬁnitions that are not forced into the harness of a restricted grammar class such as LL(k) or LR(k). This proves to be very beneﬁcial when developing grammars for legacy languages such as Cobol and PL/I, but it also provides greater ﬂexibility in the development of new (domain-speciﬁc) languages. One of the assets of the SGLR approach is the separation of disambiguation from grammar structure. Thus, it is not necessary to encode disambiguation decisions using extra productions and non-terminals. Instead a number of disambiguation ﬁlters, driven by disambiguation declarations solve ambiguities by pruning the parse forest. Lexical ambiguities, which are traditionally handled by adhoc default decisions in the scanner, are also handled by such ﬁlters. Filters can be implemented at several points in time, i.e., at parser generation time, parse time, or after parsing. SGLR is usable in practice. It has been used as the implementation of the expressive syntax deﬁnition formalism Sdf. SGLR is not only fast enough to be used in interactive tools, like the Asf+Sdf Meta-Environment, but also to parse huge amounts of Cobol code in an industrial environment. SGLR and the Sdf based parse table generator are open-source and can be downloaded from http://www.cwi.nl/projects/MetaEnv/. 6

So this number does not reﬂect the size of the grammar deﬁnition.

Disambiguation Filters for Scannerless Generalized LR Parsers

157

Acknowledgements User feedback has been indispensable while developing SGLR. Hayco de Jong and Pieter Olivier dedicated considerable time on improving SGLR eﬃciency. Merijn de Jonge and Joost Visser were instrumental in the development of the Online Grammar Base that serves as a testbed for SGLR. Jan Heering and Paul Klint provided valuable input when discussing design and implementation of SGLR.

References 1. A. V. Aho, R. Sethi, and J. D. Ullman. Compilers. Principles, Techniques and Tools. Addison-Wesley, 1986. 150 2. J. Aycock and R. N. Horspool. Faster generalized LR parsing. In S. J¨ ahnichen, editor, CC’99, volume 1575 of LNCS, pages 32–46. Springer-Verlag, 1999. 144, 150 3. J. Aycock and R. N. Horspool. Schr¨ odinger’s token. Software, Practice & Experience, 31:803–814, 2001. 146 4. M. G. J. van den Brand, A. van Deursen, J. Heering, H. A. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P. A. Olivier, J. Scheerder, J. J. Vinju, E. Visser, and J. Visser. The ASF+SDF Meta-Environment: a Component-Based Language Development Environment. In R. Wilhelm, editor, CC’01, volume 2027 of LNCS, pages 365–370. Springer-Verlag, 2001. 153 5. M. G. J. van den Brand, H. A. de Jong, P. Klint, and P. A. Olivier. Eﬃcient Annotated Terms. Software, Practice & Experience, 30(3):259–291, 2000. 152, 153 6. J. R. Cordy, C. D. Halpern-Hamu, and E. Promislow. TXL: A rapid prototyping system for programming language dialects. Computer Languages, 16(1):97–107, 1991. 144 7. M. de Jonge, E. Visser, and J. Visser. XT: A bundle of program transformation tools. In M. G. J. van den Brand and D. Parigot, editors, Workshop on Language Descriptions, Tools and Applications (LDTA’01), volume 44 of Electronic Notes in Theoretical Computer Science. Elsevier Science Publishers, 2001. 153 8. A. van Deursen, J. Heering, and P. Klint, editors. Language Prototyping, volume 5 of AMAST Series in Computing. World Scientiﬁc, 1996. 153 9. K.-G. Doh and P. D. Mosses. Composing programming languages by combining action-semantics modules. In M. G. J. van den Brand and D. Parigot, editors, Electronic Notes in Theoretical Computer Science, volume 44, 2001. 153 10. J. Heering, P. R. H. Hendriks, P. Klint, and J. Rekers. The syntax deﬁnition formalism SDF – reference manual. SIGPLAN Notices, 24(11):43–75, 1989. 145, 153 11. S. C. Johnson. YACC—yet another compiler-compiler. Technical Report CS-32, AT & T Bell Laboratories, Murray Hill, N. J., 1975. 143 12. P. Klint and E. Visser. Using ﬁlters for the disambiguation of context-free grammars. In G. Pighizzini and P. San Pietro, editors, Proc. ASMICS Workshop on Parsing Theory, pages 1–20, Milano, Italy, 1994. Tech. Rep. 126–1994, Dipartimento di Scienze dell’Informazione, Universit` a di Milano. 147 13. R. L¨ ammel and C. Verhoef. VS COBOL II grammar7 , 2001. 146 7

http://www.cs.vu.nl/grammars/browsable/vs-cobol-ii/

158

Mark G. J. van den Brand et al.

14. B. Lang. Deterministic techniques for eﬃcient non-deterministic parsers. In J. Loeckx, editor, Proceedings of the Second Colloquium on Automata, Languages and Programming, volume 14 of LNCS, pages 255–269. Springer-Verlag, 1974. 144, 150 15. M. E. Lesk and E. Schmidt. LEX — A lexical analyzer generator. Bell Laboratories, 1986. UNIX Programmer’s Supplementary Documents, Volume 1 (PS1). 143 16. L. Moonen. Generating robust parsers using island grammars. In Proceedings of the 8th Working Conference on Reverse Engineering, pages 13–22. IEEE Computer Society Press, 2001. 153 17. J. Rekers. Parser Generation for Interactive Environments. PhD thesis, University of Amsterdam, 1992. ftp://ftp.cwi.nl/pub/gipe/reports/Rek92.ps.Z. 144, 150 18. D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) parsing of programming languages. SIGPLAN Notices, 24(7):170–178, 1989. 146, 147 19. D. J. Salomon and G. V. Cormack. The disambiguation and scannerless parsing of complete character-level grammars for programming languages. Technical Report 95/06, Dept. of Computer Science, University of Manitoba, 1995. 146, 147 20. E. Scott, A. Johnstone, and S. S. Hussain. Technical Report TR-00-12, Royal Holloway, University of London, Computer Science Dept., 2000. 144, 150 21. M. Tomita. Eﬃcient Parsing for Natural Languages. A Fast Algorithm for Practical Systems. Kluwer Academic Publishers, 1985. 144, 150 22. E. Visser. Scannerless generalized-LR parsing. Technical Report P9707, Programming Research Group, University of Amsterdam, 1997. 150 23. E. Visser. Syntax Deﬁnition for Language Prototyping. PhD thesis, University of Amsterdam, 1997. 145, 148, 151 24. E. Visser. Stratego: A language for program transformation based on rewriting strategies. System description of Stratego 0.5. In A. Middeldorp, editor, RTA’01, volume 2051 of LNCS, pages 357–361. Springer-Verlag, 2001. 153

Modular Static Program Analysis Patrick Cousot1 and Radhia Cousot2 1

2

´ Ecole normale sup´erieure, D´epartement d’informatique 45 rue d’Ulm, 75230 Paris cedex 05, France [email protected] www.di.ens.fr/~cousot/ ´ CNRS & Ecole polytechnique, Laboratoire d’informatique 91128 Palaiseau cedex, France [email protected] lix.polytechnique.fr/~rcousot

Abstract. The purpose of this paper is to present four basic methods for compositional separate modular static analysis of programs by abstract interpretation: – simpliﬁcation-based separate analysis; – worst-case separate analysis; – separate analysis with (user-provided) interfaces; – symbolic relational separate analysis; as well as a ﬁfth category which is essentially obtained by composition of the above separate local analyses together with global analysis methods.

1

Introduction

Static program analysis is the automatic compile-time determination of run-time properties of programs. This is used in many applications from optimizing compilers, to abstract debuggers and semantics based program manipulation tools (such as partial evaluators, error detection and program understanding tools). This problem is undecidable so that program analyzers involve some safe approximations formalized by abstract interpretation of the programming language semantics. In practice, these approximations are chosen to oﬀer the best trade-oﬀ between the precision of the information extracted from the program and the eﬃciency of the algorithms to compute this information from the program text. Abstract interpretation based static program analysis is now in an industrialization phase and several companies have developed static analyzers for the analysis of software or hardware either for their internal use or to provide new software analysis tools to end-users, in particular for the compile-time detection of run-time errors in embedded applications (which should be used before the application is launched). Important characteristics of these analyzers is that all possible run-time errors are considered at compilation time, without code instrumentation nor user interaction (as opposed to debugging for example).

This work was supported in part by the RTD project IST-1999-20527 daedalus of the european IST FP5 programme.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 159–179, 2002. c Springer-Verlag Berlin Heidelberg 2002

160

Patrick Cousot and Radhia Cousot

Because of foundational undecidability problems, not all errors can be statically classiﬁed as certain or impossible and a small percentage remains as potential errors for which the analysis is inconclusive. In most commercial software, with low correctness requirements, the analysis will reveal many previously uncaught certain errors so that the percentage of potential errors for which the analysis is inconclusive is not a practical problem as long as all certain errors have been corrected and these corrections do not introduce new certain errors . However, for safety critical software, it is usually not acceptable to remain inconclusive on these few remaining potential errors 1 . One solution is therefore to improve the precision of the analysis. This is always theoretically possible, but usually at the expense of the time and memory cost of the program analyses, which can become prohibitive for very large programs. The central idea is therefore that of compositional separate static analysis of program parts where very large programs are analyzed by analyzing parts (such as components, modules, classes, functions, procedures, methods, libraries, etc. . . ) separately and then by composing the analyses of these program parts to get the required information on the whole program. Components can be analyzed with a high precision whenever they are chosen to be small enough. Since these separate analyzes are done on parts and not on the whole program, total memory consumption may be reduced, even with more precise analyzes of the parts. Since these separate analyzes can be performed in parallel on independent computers, the global program analysis time may also reduced.

2

Global Static Program Analysis

The formulation of global static program analysis in the abstract interpretation framework [11,16,17,21] consists in computing an approximation of a program semantics expressing the properties of interest of the program P to be analyzed. The semantics can often be expressed as a least ﬁxpoint SP = lfp F P that is as the least solution to a monotonic system of equations X = F P (X) computed on a poset D, where the semantic domain D is a set equipped with m →D a partial ordering with inﬁmum ⊥ and the endomorphism F P ∈ D −− is monotonic. The approximation is formalized through a Galois connection D, γ ¯ ¯ where a concrete program property p ∈ D is approximated by ← −− −− D, − − α→ ¯ such that p γ(¯ any abstract program property p¯ ∈ D p) and has a best/more ¯ precise abstraction α(p) ∈ D (Other formalizations through closure operators, ideals, etc. are equivalent [11,21]. The best abstraction hypothesis can also be relaxed [24]). Then global static program analysis consists in computing an ab¯ ¯ = lfp F¯ P which is a sound approximation of the stract least ﬁxpoint SP ¯

concrete semantics in that lfp F P γ(lfp F¯ P ). This ﬁxpoint soundness condition can be ensured by stronger local/functional soundness conditions such ¯ F¯ P ◦ α, α ◦ F P ◦ γ ¯ F¯ P as F¯ P is monotonic and either α ◦ F P

1

The number of residual potential errors, even if it is a low percentage of the possible errors (typically 5%), may be unacceptably large for very large programs.

Modular Static Program Analysis

161

or equivalently F P ◦ γ γ ◦ F¯ P (see [21]). The least ﬁxpoint is computed ¯ . . . , F n+1 = F¯ P (F n ), . . . where ⊥ ¯ is the as the limit of the iterates F 0 = ⊥, ¯ inﬁmum of the abstract domain D. Convergence of the iterates can always be enforced using widening/narrowing techniques [17]. The result is correct but less ¯ F n where ¯ = ¯ is the least upper bound (which precise than the limit SP n≥0

¯ is a cpo, complete lattice, etc.) [17,24]. does exist if the abstract domain D For example, the reachability analysis of the following program with the interval abstract domain [17]: 0: x := 1; 1: while (x < 1000) do 2: x := (x + 1) 3: od 4:

consists in solving the following system of ﬁxpoint equations X0, X1, X2, X3, X4 = F¯ P (X0, X1, X2, X3, X4) [28] where Xi is the abstract environment associated to program point i = 0, . . . , 4, each environment Xi maps program variables (here x) to a description of their possible values at run-time (here an interval), U is the union of abstract environments and O denotes the singleton consisting of the undeﬁned initial value: X0 X1 X2 X3 X4

= = = = =

init(x, O ) assign[|x, 1|](X0) U X3 assert[|x < 1000|](X1) assign[|x, (x + 1)|](X2) assert[|x >= 1000|](X1)

The least solution to this system of equations is then approximated iteratively using widening/narrowing iterative convergence acceleration methods [17] as follows: X0 X1 X2 X3 X4

= = = = =

{ { { { {

x: O } x:[1,1000] } x:[1,999] } x:[2,1000] } x:[1000,1000] }

Some static program analysis methods (such as typing [14] or set based-analysis [27]) consist in solving constraints (also called veriﬁcation conditions, etc.), but this is equivalent to iterative ﬁxpoint computation [26], maybe with widening [27] (since the least solution to the constraints F P (X) X is the same as the least solution lfp F P of the equations X = F P (X)). Such ﬁxpoint computations constitute the basic steps of program analysis (e.g. for forward reachability analysis, backward ancestry analysis, etc.). More complex analyzes are obtained by combining these basic ﬁxpoint computations (see e.g. [11], [23] or [29,61] for abstract testing of temporal properties of program).

162

Patrick Cousot and Radhia Cousot

The type of static whole-program analysis methods that we have brieﬂy described above is global in that the whole program text is needed to establish the system of equations and this system of equations is solved iteratively at once. In practice, chaotic iteration strategies [5,18] can be used to iterate successively on components of the system of equations (as determined by a topological ordering of the dependency graph of the system of equations). However, in the worst case, the chaotic iteration may remain global, on all equations, which may be both memory consuming (since the program hence the system of equations can be very large) and time consuming (in particular when convergence of the iterates is slow). This problem can be solved by using less precise analyzes but this may simply lead to analyzes which are both imprecise and quite costly. Moreover, the whole program must be reanalyzed even if a small part only is modiﬁed. Hence the necessity to look for local methods for the static analysis of programs piecewise.

3 3.1

Separate Modular Static Program Analysis Formalization of Separate Modular Static Program Analysis in the Abstract Interpretation Framework

In general programs P [P1 , . . . , Pn ] are made up from parts P1 , . . . , Pn such as functions, procedures, modules, classes, components, libraries, etc. so that the semantics SP of the whole program P is obtained compositionally from the i semantics of its parts lfp F Pi , i = 1, . . . , n as follows:

SP = lfp F P [lfp

1

n

F P1 , . . . , lfp

F Pn ]

m m m where F Pi ∈ Di −− → Di , i = 1, . . . , n and F P ∈ (D1 × . . .× Dn ) −− → (D −− → D) are (componentwise) monotonic (see e.g. [20]). The compositional separate modular static analysis of the program P [P1 , γi ¯ i, ¯ i , for each −− −− D . . . , Pn ] is based on separate abstractions Di , i ← −− αi→ part Pi , i = 1, . . . , n. The analysis of the parts then consists in computing sep¯ ¯ i lfp i F¯i Pi , i = 1, . . . , n on each part Pi arately an abstract information Ai i so that lfp F Pi i γi (Ai ). Since the components Pi , i = 1, . . . , n are generally small, they can be analyzed with a high precision by choosing very precise γi ¯ i, ¯ i , (see examples of precise abstract domains −− −− D abstractions Di , i ← −− αi→ in e.g. [16]). A typical example is the replacement of numerical interval analysis (using intervals [a, b] where a and b are numerical constants) by a more precise symbolic interval analysis (using intervals [L, H] where L and H are mathematical variables, which can be implemented through the octagonal abstract domain of [63]). The global analysis of the program consists in composing the analyses Ai , i = 1, . . . , n of these program parts Pi , i = 1, . . . , n to get the required information ¯ on the whole program by computing lfp F¯ P [A1 , . . . , An ].

Modular Static Program Analysis

163

Since these separate analyzes are done on parts and not on the whole program, total memory consumption may be reduced, even with more precise analyzes of the parts. Since the separate analyzes of the program parts can be performed in parallel on independent computers, the global program analysis time may also reduced. The global abstraction is composed from the abstractions of the program parts and has the form: γ[γ ,...,γ ]

1 n ¯ D ¯ 1, . . . , D ¯ n ], ¯ . −− −− −− −− −− −− −− − − D[ D[D1 , . . . , Dn ], ← −− →

α[α1 ,...,αn ]

The local/functional soundness condition is: ¯ α[α1 , . . . , αn ](F P [γ1 (X1 ), . . . , γn (Xn )]) ¯ F P [X1 , . . . , Xn ] which implies that:

lfp F P [lfp

1

n

F P1 , . . . , lfp

F Pn ]

¯

γ[γ1 , . . . , γn ](lfp F¯ P [A1 , . . . , An ]) . 3.2

Diﬃculty of Separate Static Program Analysis: Interference

The theoretical situation that we have sketched above in Sec. 3.1 is ideal and sometimes very diﬃcult to put into practice. This is because the parts P1 , . . . , Pn of the program P are not completely independent so that the separate analyses of the parts Pi are not independent of those of the other parts P1 , . . . , Pi−1 , Pi+1 , . . . , Pn and of that of the program P . For example in an imperative program a` la C, a function may call other functions in the program and use and/or modify global variables. In Pascal, a program may modify variables on the program execution stack at a program point where these variables are even not visible (see [13]). A very simple formalization consists in considering that the semantics of the program can be speciﬁed in the following equational form:   Y = F P [P1 , . . . , Pn ]Y , X 1 , . . ., X n X i = F Pi Y , X 1 , . . ., X n  i = 1, . . . , n where Y represents the global information on the program while X i represents that on the program part Pi , i = 1, . . . , n. In general, the least solution is preferred for a componentwise ordering ×1 ×. . .×n where D, and the D1 , 1 , . . . , Dn , n are the concrete domains (usually cpos, complete lattices, etc.) respectively expressing the properties of the program P [P1 , . . . , Pn ] and its parts Pi , i = 1, . . . , n. In general the local properties X i of the part Pi depend upon the knowledge of the local properties X j of the other program parts Pj , j = i and of the global properties Y of the program P [P1 , . . . , Pn ]. The properties Y of the program

164

Patrick Cousot and Radhia Cousot

P [P1 , . . . , Pn ] are also deﬁned in ﬁxpoint form so depend on themselves as well as on the local properties X i of the part Pi , i = 1, . . . , n. Usually, the abstraction yields to an abstract system of equations of the same form:   Y = F¯ P [P1 , . . . , Pn ]Y , X 1 , . . ., X n X i = F¯ Pi Y , X 1 , . . ., X n  i = 1, . . . , n ¯ ¯ 1, ¯ n, ¯ and the D ¯ 1 , . . . , D ¯ n with Galois on the abstract domains D, γi γ ¯ ¯ ¯ ¯ ← − − − ← − − − connections D, −−α−→ D, and Di , i −−α−→ Di , i , for all i = 1, . . . , n. i Ideally, the separate analysis of program part Pi consists in computing a ﬁxpoint: ¯ i

lfp λ X i · F¯ Pi Y , X 1 , . . ., X i , . . ., X n where the Y , X 1 , . . . , X i−1 , X i+1 , . . . , X n denote the abstract properties which are assumed/guaranteed on the objects of the program P [P1 , . . . , Pn ] and its parts Pj , j = 1, . . . , i − 1, i + 1, . . . , n which are external references within that part Pi (such as global variables of a procedure, external functions called within a module, etc.). The whole problem is to determine Y , X 1 , . . . , X i−1 , X i+1 , . . . , X n while analyzing program part Pi . 3.3

Dependence Graph

A classical technique (also used in separate compilation) consists in computing a dependence graph where a part Pi depends upon another part Pj , i = j if and only if the analysis of Pi uses some information which is computed by the analysis of part Pj (formally Pi depends upon part Pj if and only if ∃X j , X j : F Pi Y , X 1 , . . ., X j , . . ., X n = F Pi Y , X 1 , . . ., X j , . . ., X n ). It is often the case that this dependency graph is built before the analysis and parts are analyzed in sequence by their topological order (see e.g. [8,57]). As in most incremental compilation systems, circular dependency may not be considered (i.e. all circularly dependent parts are grouped into a single part since an iterative analysis is necessary). At the limit, the analysis will degenerate into a global analysis as considered in Sec. 2, the dependence graph then corresponding to a particular chaotic iteration strategy [11,5,20]. Otherwise, the circularities must be broken using one of the compositional separate modular static program analysis methods considered in this paper: – – – –

simpliﬁcation-based separate analysis; worst-case separate analysis; separate analysis with (user-provided) interfaces; symbolic relational separate analysis;

or by a combined method which is essentially obtained by composition of the previous local ones together with global analysis.

Modular Static Program Analysis

4

165

Simplification-Based Separate Analysis

To start with, we consider ideas based upon the simpliﬁcation of the equations to be solved. We do not consider here local simpliﬁcations of the equations (that is simpliﬁcation of one equation independently of the others such as e.g. [39]) but global simpliﬁcations, where the simpliﬁcation of one equation requires the examination of other equations. Since these systems of equations can be considered as functional programs, many program static analysis, transformation and optimization techniques are directly applicable to do so such as algebraic simpliﬁcation, constant propagation, partial evaluation [54], compilation, etc. For each program part, the ﬁxpoint transformer F¯ Pi (often expressed as a system of equations X = F¯ Pi (X) or equivalently as constraints F¯ Pi (X) i X, [26]) is simpliﬁed into F¯s Pi . The global analysis of the program then consists ¯ ¯ ¯ 1 n in computing lfp F¯ P [lfp F¯s P1 , . . . , lfp F¯s Pn ] so that the ﬁxpoints for the parts are computed in a completely known context or environment. Very often, F¯s is obtained by abstract interpretation of F¯ (see [30] for a formalization of such transformations as abstract interpretations). A frequently used variant of this idea consists in ﬁrst using a preliminary global analysis of the whole program P [P1 , . . . , Pn ] with a rough imprecise abstraction to collect some global information on the program in order to help in the simpliﬁcation of the F¯ Pi , designed with a more precise abstraction, into F¯s Pi . Examples of application of this simpliﬁcation idea can be found in the analysis of procedures of [17, Sec. 4.2], in the componential set-based analysis of [38], in the variable substitution transformation of [66] and in the summary optimization of [67]. Another example is abstract compilation where the equations and ﬁxpoint computation are compiled (often in the same language as the one to be analyzed so that program analysis amounts to the execution of an abstract compilation of program), see e.g. [1,4,9,34,60]. Since the local analysis phases of the program parts Pi , which consist in ¯ i computing the ﬁxpoints lfp F¯ Pi are delayed until the global analysis phase, ¯ ¯1 ¯n which consists in computing lfp F¯ P [lfp F¯s P1 , . . . , lfp F¯s Pn ], not much time and memory resources are saved in this computation, even though the simpliﬁed ﬁxpoint operators F¯s Pi are used in place of the original ones F¯ Pi . The main reason is that the simpliﬁcation often saves only a linear factor 2 , which may be a negligible beneﬁt when compared to the cost of the iterative ﬁxpoint computation. In our opinion, this explains why this approach does not scale up for very large programs [36].

5

Worst-Case Separate Analysis

We have seen that the problem of separate analysis of a program part Pi consists in determining the properties Y , X 1 , . . . , X i−1 , X i+1 , . . . , X n of the external 2

Sometimes the simpliﬁcation can save an exponential factor, see e.g. [39].

166

Patrick Cousot and Radhia Cousot

objects referenced in the program part Pi while computing the local ﬁxpoint: ¯ i

lfp λ X i · F Pi Y , X 1 , . . ., X i , . . ., X n The worst-case separate analysis consists in considering that absolutely no information is known on the interfaces Y , X 1 , . . . , X i−1 , X i+1 , . . . , X n . Tradi¯ is used tionally in program analysis by abstract interpretation the top symbol ¯ to represent such an absence of information ( is the supremum of the complete ¯ i representing the abstract program properties ordered by the approxilattice D ¯ i corresponding to the abstraction of the logical implication). mation ordering The worst-case separate analysis therefore consists in ﬁrst separately computing or eﬀectively approximating the local abstract ﬁxpoints: ¯

i ¯ , ¯ . . ., X i , . . ., ¯ Ai i lfp λ X i · F Pi ,

for all program parts Pi . Then the global program analysis is: ¯

lfp λ Y · F¯ P [Y , A1 , . . . , An ] . The main advantage of this approach is that all analyzes of the parts Pi , i = 1, . . . , n can be done in parallel. Moreover the modiﬁcation of a program part requires only the analysis of that part to be redone before the global program analysis. This explains why the worst-case separate analysis is very eﬃcient. However, because nothing is known about the interfaces of the parts with the program and with the other parts, this worst-case analysis is often too imprecise. An example is the procedure analysis of [20, Sec. 4.2.1 & 4.2.2] where the eﬀect of procedures (in particular the values of result/output parameters) are computed by a local analysis of the procedure assuming that the properties of value/input parameters is unknown in the main call (and a widening is used in recursive calls both to cope with possible non-termination of calls with identical parameters and with the possibility of having inﬁnitely many calls with diﬀerent parameters). Another example is the escape analysis of higher-order functions by [2]. Escape analysis aims at determining which local objects of a procedure do not escape out of the call (so that they can be allocated on the stack, the escaping object have to be allocated on the heap since their lifetime is longer than that of the procedure call). In this analysis, the higher-order functions which are passed as parameter to a procedure are assumed to be unknown, so that e.g. any call to such an unknown external higher-order function may have any possible side-eﬀect. Yet another example is the worst-case separate analysis of library modules in the points-to and side-eﬀect analyses of [67]. A last example is the abstract interpretation-based analysis for automatically detecting all potential interactions between the agents of a part of a mobile system interacting with an unknown context [37]. As considered in Sec. 4, an improvement consists in using a preliminary global analysis of the whole program P [P1 , . . . , Pn ] with a rough imprecise abstraction

Modular Static Program Analysis

167

to collect some global information on the program in order to get information on the interface Y , X 1 , . . . , X i−1 , X i+1 , . . . , X n more precise than the unknown ¯ An example is the preliminary inexpensive whole-program points-to analysis . made by [68] before their modular/fragment analysis.

6

Separate Analysis with (User-Provided) Interfaces

The idea of interface-based separate program analysis is to ask the user to provide information about the properties Y , X 1 , . . . , X i−1 , X i+1 , . . . , X n of the external objects referenced in the program part Pi while computing the local abstract ﬁxpoints: ¯ i

lfp λ X i · F Pi Y , X 1 , . . ., X i , . . ., X n . i = 1, . . . , n as well as the global abstract ﬁxpoint: ¯

lfp λ Y · F P [P1 , . . . , Pn ]Y , X 1 , . . ., X n . The information provided on the interface of the program part with the external world takes the form of: – the assumptions J on the program and I 1 , . . . , I i−1 , I i+1 , . . . , I n on the other program parts Pj , i = j that can be made in the local analysis of the program part Pi . These assumptions will have to be guaranteed by the local analyzes of the other parts and the global analysis of the program when using this part Pi . These assumptions make possible the analysis of the program part Pi independently of the context in which that program part Pi is used (or more generally several possible contexts may be considered); – the guarantee I i on the program part Pi that must be established by the local analysis of that part Pi . The global program analysis and that of the other program parts will rely upon this guarantee when using that part Pi (considering only the possible behaviors of that part Pi which are relevant to its context of use). Typically, the interface should be precise enough so that the assumptions (or preconditions) J on the program and I 1 , . . . , I i−1 , I i+1 , . . . , I n are the weakest possible so that the analysis of a part Pi only requires the source code of that part Pi while the guarantee (or postcondition) I i should be the strongest possible so that analyzes using that part Pi never need to access the source code of that part Pi . Formally, the separate analysis with interfaces J , I 1 , . . . , I n consists in computing or approximating the local abstract ﬁxpoints: ¯

¯ i lfp i λ X i · F Pi J , I 1 , . . ., X i , . . ., I n . Ai One must also check that one can rely upon the assumptions J , I 1 , . . . , I i−1 , I i+1 , . . . , I n made during the analysis on the program part Pi by verifying that it is guaranteed by the analysis of the other parts Pj , j = i in that:

168

Patrick Cousot and Radhia Cousot

¯ i Ii ∀i = 1, . . . n : Ai as well as for the global assumption J on the program that should be guaranteed by the global program analysis: ¯

¯ lfp λ Y · F P [P1 , . . . , Pn ]Y , I 1 , . . ., I i , . . ., I n , A in that: ¯ J . A This technique is classical in program typing (e.g. user speciﬁed number, passing mode and type of parameters of procedures which are assumed in the type checking of the procedure body and must be guaranteed at each procedure call) and in program veriﬁcation (see e.g. the rely/guarantee speciﬁcations of [10]). Examples of user-provided interfaces in static program analysis are the control-ﬂow analysis of [71], the notion of summary information of [48,67] and the role analysis of [55]. A particular case is when no assumption is made on the interface of each program part with its external environment so that the automatic generation of the properties guaranteed by the program part essentially amounts to the worst-case analysis of Sec. 5 or its variants. Instead of asking the user to provide the interface, this interface can sometimes be generated automatically. For example, a backward analysis of absence of run-time errors or exceptions (such as the backward analysis using greatest ﬁxpoints introduced in [12]) or any other ancestry analysis (e.g. to compute necessary termination conditions [12] or success conditions for logic programs [42]) can be used to automatically determine conditions on the interface which have to be assumed to ensure that the program part Pi is correctly used in the whole program P [P1 , . . . , Pn ]. A forward reachability analysis will provide information on what can be guaranteed on the interface of the program part Pi with its environment, that is the other parts Pj , j = i and the program P . A reﬁnement is to combine the forward and backward analyses [23,29,61]. As considered in Sec. 4 and Sec. 5, an improvement consists in using a preliminary fast global analysis of the whole program P [P1 , . . . , Pn ] with a rough imprecise abstraction to collect some global information on the program in order to get information on what is guaranteed on the interfaces J , I 1 , . . . , I n . Moreover simpliﬁcation techniques, as considered in Sec. 4 can be applied to simplify the automatically synthesized or user-provided interface.

7

Symbolic Relational Separate Analysis

To start with, we consider a powerful but not well-known compositional separate modular static program analysis method that we ﬁrst introduced in [20]. Symbolic relational separate analysis is based on the use of relational abstract

Modular Static Program Analysis

169

domains and a relational semantics of the program parts (see e.g. [22,25]). The idea is to analyze a program part Pi separately by giving symbolic names to all external objects used or modiﬁed in that part Pi . The analysis of the part consists in relating symbolically the local information within the part Pi to the external objects through these names. External actions have to be handled in a lazy way and their possible eﬀects on internal objects must be delayed3 (unless the eﬀect of these actions is already known thanks to a previous static analysis, see Sec. 3.3). When the part is used, the information about the part is obtained by binding the external names to the actual values or objects that they denote and evaluating the delayed eﬀects. The concrete semantics can be understood either as a relational semantics or as a program symbolic execution [11, Ch. 3.4.5] which is abstracted without loosing information about the relationships between the internal and external objects of the program part thanks to the use of a relational domain. An example is the pointer analysis using collections [19] of [20, Sec. 4.2.2]. There pointer variables are organized in equivalence classes where variables in diﬀerent classes cannot point, even indirectly, to the same position on the heap. This analysis is relational and can be started by giving names to actual parameters which are in the same class as the formal parameters (as well as their potential aliases, as speciﬁed in the assumption interface). A similar example is the interprocedural pointer analysis of [59] using parameterized points-to graphs. Another example, illustrated below, uses the polyhedral abstract domain [31] so that functions (or procedures in the case of imperative programs) can be approximated by relations. These relations can be further approximated by linear inequalities between values of variables [31]. Let us illustrate this method using a Pascal example taken from [46]: procedure Hanoi (n : integer; var a, b, c : integer; var Ta, Tb, Tc : Tower); begin { n = n0 ∧ a = a0 ∧ b = b0 ∧ c = c0 } if n = 1 then begin b := b + 1; Tb[b] := Ta[a]; Ta[a] := 0; a := a − 1; { n = n0 = 1 ∧ a = a0 − 1 ∧ b = b0 + 1 ∧ c = c0 } end else begin { n = n0 ∧ a = a0 ∧ b = b0 ∧ c = c0 } Hanoi(n − 1, a, c, b, Ta, Tc, Tb); { n = n0 > 1 ∧ a = a0 − n + 1 ∧ b = b0 ∧ c = c0 + n − 1 } b := b + 1; Tb[b] := Ta[a]; Ta[a] := 0; a := a − 1; { n = n0 > 1 ∧ a = a0 − n ∧ b = b0 + 1 ∧ c = c0 + n − 1 } 3

[47] is another example of lazy static program analysis used in the context of demanddriven analysis.

170

Patrick Cousot and Radhia Cousot Hanoi(n − 1, c, b, a, Tc, Tb, Ta); { n = n0 > 1 ∧ a = a0 − n ∧ b = b0 + n ∧ c = c0 } end; { n = n0 ≥ 1 ∧ a = a0 − n0 ∧ b = b0 + n0 ∧ c = c0 } end;

The result of analyzing this procedure, which is given above between brackets {. . . } is independent of the values of the actual parameters provided in calls. This is obtained by giving formal names n0 , a0 , b0 and c0 to the values of the actual parameters corresponding to the initial values of the formal parameters n, a, b and c (array parameters Ta, Tb and Tc are simply ignored, which corresponds to a worst-case analysis) and by establishing a relation with the ﬁnal value of these formal parameters. The result is a precise description of the eﬀect of the procedure in the form of a relation between initial and ﬁnal values of its parameters: φ(n0 , a0 , b0 , c0 , n, a, b, c) = (n = n0 ≥ 1 ∧ a = a 0 − n 0 ∧ b = b 0 + n 0 ∧ c = c0 ) Observe that it is automatically shown that n0 ≥ 1, which is a necessary condition for termination. In a function call, n0 , a0 , b0 and c0 are set equal to the values of the actual parameters in φ and eliminated by existential quantiﬁcation. For example: a := n; b := 0; c := 0; { n=a ∧ b=0 ∧ c=0 } Hanoi(n, a, b, c, Ta, Tb, Tc); { ∃n0 , a0 , b0 , c0 : n0 = a0 ∧ b0 = 0 ∧ c0 = 0 ∧ n = n0 ≥ 1 ∧ a = a0 − n0 ∧ b = b0 + n0 ∧ c = c0 }

This last post-condition can be simpliﬁed by projection as: { a=0 ∧ n=b≥ 1 ∧ c=0 }

In recursive calls, successive approximations of the relation φ must be used, starting from the empty one. A widening (followed by a narrowing) [17,20] can be used to ensure convergence. Such relational analyzes are also very useful in the more classical context where functions are analyzed in the order of the dependence graph (see Sec. 3.3) since, as shown above, the relational analysis of the function determines a relationship between the inputs and the outputs of the function. This allows the function to be analyzed independently of its call sites and therefore the analysis becomes “context-sensitive” which improves the precision (and may decrease the cost if the function/procedure may be analyzed only once, not for all diﬀerent possible contexts). An example of such a symbolic relational separate analysis is the notion of summary transfer function of [6,7] in the context of points-to analysis for C++. A summary transfer function for a method expresses the eﬀects of the method

Modular Static Program Analysis

171

invocation on the points-to solution parameterized by unknown symbolic initial values and conditions on these values. Another example of symbolic relational separate analysis is the strictness analysis of higher-order functions [62] using a symbolic representation of boolean higher order functions called Typed Decision Graphs (TDGs), a reﬁnement of Binary Decision Diagrams (BDDs). A last example the backward escape analysis of first-order functions in [2] since the escape information for each parameter is computed as a function of the escape information for the result. For JavaTM , it is not a function but a relation between the various escape information available on the parameters and the result [3]. This symbolic relational separate analysis may degenerate in the simpliﬁcation case of Sec. 4 if no local iteration is possible. However this situation is rare since it is quite uncommon that all program parts circularly depend upon one another.

8

Combination of Separate Analysis Methods

The last category of methods essentially consists in combining the previous local separate analysis methods and/or some form of global analysis. We provide a few examples below. 8.1

Preliminary Global Analysis and Simpliﬁcation

We have already indicated that a preliminary rough global program analysis can always be performed to improve the information available before performing a local analysis. A classical example is pointer analysis [35,41,50,51,52,53,56, 58,59,64,67,72], see an overview in [69]. A preliminary pointer analysis is often mandatory since making conservative assumptions regarding pointer accesses can adversely aﬀect the precision and eﬃciency of the analysis of the program parts requiring this information. Such pointer alias analysis attempts to determine when two pointer expressions refer to the same storage location and is useful to detect potential side-eﬀects through assignment and parameter passing. Also the simpliﬁcation algorithms considered in Sec. 4 are applicable in all cases. 8.2

Iterated Separate Program Static Analysis

¯ X 0 = , ¯ . . . , X0 = ¯ a separate Starting with a worst case assumption Y 0 = , 1 n analysis with interfaces as considered in Sec. 6 can be iterated by successively computing: ¯

i = lfp λ X i · F¯ Pi Y k , X k1 , . . ., X i , . . ., X kn X k+1 i i = 1, . . . , n ¯

Y k+1 = lfp λ Y · F¯ P [P1 , . . . , Pn ]Y , X k1 , . . ., X kn

(1)

172

Patrick Cousot and Radhia Cousot

Note that this decreasing iteration is similar to the iterative reduction idea of [15, Sec. 11.2] and diﬀerent from and less precise than a chaotic iteration for ¯ X 0 = ⊥, ¯ . . . , X 0 = ⊥). ¯ the global analysis (which would start with Y 0 = ⊥, 1 n However the advantage is that one can stop the analysis at any step k > 0, the successive analyzes being more precise as k increases (a narrowing operation [17] may have to be used in order to ensure the convergence when k → +∞). A variant consists in starting with the user provided interfaces Y 0 = J , 0 X 1 = I 1 , . . . , X 0n = I n . Then the validity of the ﬁnal result Y k , X k1 , . . . , X kn must be checked as indicated in Sec. 6. A particular case is when some program parts are missing so that their initial ¯ and are reﬁned by a new iteration (1) as soon as they interfaces are initially become available. Again after each iteration k, the static program analysis of the partial program is correct. Yet another variant consists in successively reﬁning the abstract domains D, , D1 , 1 , . . . , Dn , n between the successive iterations k. The choice of this reﬁnement can be guided by interaction with the user. Sometimes, it can also be automated [43,44,45]. 8.3

Creating Parts through Cutpoints

x

Most often the parts P1 , . . . , Pn of a program P [P1 , . . . , Pn ] are determined on syntactic criteria (such as components, modules, classes, functions, procedures, methods, libraries, etc.). A preliminary static analysis can also be used to determine the parts on semantic grounds. For example in Sec. 2 on global static analysis, we have considered chaotic iteration strategies [5,18] that can be used to iterate successively on components of the system of equations (as determined by a topological ordering of the dependency graph of the system of equations). Such dependences can also be reﬁned on semantic grounds (such as deﬁnition-use chains [49]). These dependences can be used as a basis to split the whole program into parts by introducing interfaces as considered in Sec. 6. For example, with a dependence graph of the form:

the iteration will be ((C1 ) ; (C2 ) ) where (Ci ) denotes the local iteration within the connected component Ci , i = 1, 2, “;” is the sequential composition and the external iteration (. . .) handles the external loop. By designing interfaces at the two cutpoints:

Modular Static Program Analysis

173

one can have a parallel treatment of the two components as ((C1 ) (C2 ) ) . Moreover a preliminary dependency analysis of the variables can partition the variables into the global ones and those which scope is restricted to one connected component only, so as to reduce the memory size needed to separately analyze the parts. If we have G12 ⇒ A12 and G21 ⇒ A21 then G12 and G21 are invariants in the sense of Floyd [40] so that no global iteration is needed. Otherwise the external iteration can be used to strengthen the interface until a ﬁxpoint is reached, as done in Sec. 8.2. The limit of this approach is close to classical proof methods with user-provided invariants at cutpoints of all loops [13]. 8.4

Reﬁnement of the Abstract Domain into a Symbolic Relational Domain

Separate non-relational static program analyzes (such as sign analysis, interval analysis, etc.) expressing properties of individual objects of programs (such as ranges of values of numerical variables) but no relationships between objects manipulated by the program (such as the equality of the values of program variables at some program point) cannot be successfully used for the relational separate analysis considered in Sec. 7 which, in absence of user-provided information, amounts to the worst-case separate analysis of Sec. 5. In this case, and whenever the symbolic relational separate analysis considered in Sec. 7 is not applicable, it is always possible to reﬁne the non-relational abstract domain into a relational one for which the separate analysis method is applicable. This can be feasible in practice if the considered program parts are small enough to be analyzed at low cost using such precise abstract domains. A classical example consists in analyzing locally program parts (e.g. procedures) with the polyhedral domain of linear inequalities [31] and the global program with the much less precise abstract domain of intervals [17]. If the polyhedral domain is too expensive, the less precise domain of diﬀerence bound matrices [63] can also be used for the local relational analyzes of program parts. This is essentially the technique used by [73]. 8.5

Unknown Dependence Graph

Separate static program analysis is very diﬃcult when the dependence graph is not known in a modular way (which is the case with higher-order functions in functional languages or with virtual methods in object-oriented languages). When the dependence graph is fully known and can be decomposed modularly, the symbolic relational separate analysis technique of Sec. 7 is very eﬀective. If

174

Patrick Cousot and Radhia Cousot

the graph is not modular and parts can hardly be created through cutpoints as suggested in Sec. 8.3 or the dependence graph is partly unknown, the diﬃculty in the lazy symbolic representation of the unknown part of Sec. 7 is when the eﬀect of this unknown part must later be iterated. In the worst case, the delaying technique of Sec. 7 then amounts to a mere simpliﬁcation as considered in Sec. 4. As already suggested, computational costs can then only cut down through of the worst-case separate analysis of Sec. 5 or by an over-estimation of the dependence graph (such as the 0-CFA control-ﬂow analysis in functional languages [70] or the class hierarchy analysis in object-oriented languages [33]).

9

Conclusion

The wide range of program static analysis techniques that have been developed over the past two decades allows to analyze very large programs (over 1.4 million lines of code) in a few seconds or minutes but with a very low precision [32] up to precise relational analyses which are able to analyze large programs (over 120 thousands lines of code) in a few hours or days [65] and to very detailed and precise analyzes that do not scale up for programs over a few hundred lines of code. If such static program analyses are to scale up to precise analysis of huge programs (some of them now reaching 30 to 40 millions of lines), compositional separate modular methods are mandatory. In this approach very precise analyzes (in the style of Sec. 7) can be applied locally to small program parts. This local analysis phase can be fast if all these preliminary analyzes are performed independently in parallel. Then a cheap global program analysis can be performed using the results of the previous analyzes, using maybe less precise analyzes which have a low cost. The idea can obviously be repeatedly applied in stages to larger and larger parts of the program with less and less reﬁned abstract domains. Moreover the design of speciﬁcation and programming languages including user-speciﬁed of program parts interfaces can considerably facilitate such compositional separate modular static analysis of programs.

Acknowledgements We thank Bruno Blanchet, Jerˆ ome Feret, Charles Hymans, Francesco Logozzo, Laurent Mauborgne, Antoine Min´ e and Barbara G. Ryder for their comments on a preliminary version of this paper based on a presentation at SSGRR, aug. 2001.

References 1. G. Amato and F. Spoto. Abstract compilation for sharing analysis. In H. Kuchen and K. Ueda (eds), Proc. FLOPS 2001 Conf., LNCS 2024, 311–325. Springer, 2001. 165

Modular Static Program Analysis

175

2. B. Blanchet. Escape analysis: Correctness proof, implementation and experimental results. In 25th POPL, 25–37, San Diego, 1998. ACM Press. 166, 171 3. B. Blanchet. Escape analysis for object-oriented languages: Application to Java. In Proc. ACM SIGPLAN Conf. OOPSLA ’99. ACM SIGPLAN Not. 34(10), 1999. 171 4. D. Boucher and M. Feeley. Abstract compilation: A new implementation paradigm for static analysis. In T. Gyimothy (ed), Proc. 6th Int. Conf. CC ’96, LNCS 1060, 192–207. Springer, 1996. 165 5. F. Bourdoncle. Eﬃcient chaotic iteration strategies with widenings. In D. Bjørner, M. Broy, and I.V. Pottosin (eds), Proc. FMPA, LNCS 735, 128–141. Springer, 1993. 162, 164, 172 6. R. Chatterjee, B.G. Ryder, and W. Landi. Relevant context inference. In 26th POPL, 133–146, San Antonio, 1999. ACM Press. 170 7. R. Chatterjee, B.G. Ryder, and W. Landi. Relevant context inference. Tech. rep. DCS-TR-360, Rutgers University, 1999. ftp://athos.rutgers.edu/pub/technical-reports/dcs-tr-360.ps.Z. 170 8. M. Codish, S. Debray, and R. Giacobazzi. Compositional analysis of modular logic programs. In 20th POPL, 451–464, Charleston, 1993. ACM Press. 164 9. M. Codish and B. Demoen. Deriving polymorphic type dependencies for logic programs using multiple incarnations of Prop. In B. Le Charlier (ed), Proc. 1st Int. Symp. SAS ’94, LNCS 864, 281–296. Springer, 1994. 165 10. P. Colette and C.B. Jones. Enhancing the tractability of rely/guarantee speciﬁcations in the development of interfering operations. In G. Plotkin, C. Stirling, and M. Tofte (eds), Proof, Language and Interaction, ch. 10, 277–307. MIT Press, 2000. 168 11. P. Cousot. M´ethodes it´eratives de construction et d’approximation de points fixes d’op´erateurs monotones sur un treillis, analyse s´emantique de programmes. Th`ese ´ d’Etat `es sciences math´ematiques, Universit´e scientiﬁque et m´edicale de Grenoble, 21 Mar. 1978. 160, 161, 164, 169 12. P. Cousot. Semantic foundations of program analysis. In S.S. Muchnick and N.D. Jones (eds), Program Flow Analysis: Theory and Applications, ch. 10, 303–342. Prentice-Hall, 1981. 168 13. P. Cousot. Methods and logics for proving programs. In J. van Leeuwen (ed), Formal Models and Semantics, vol. B of Handbook of Theoretical Computer Science, ch. 15, 843–993. Elsevier, 1990. 163, 173 14. P. Cousot. Types as abstract interpretations, invited paper. In 24th POPL, 316– 331, Paris, 1997. ACM Press. 161 15. P. Cousot. The calculational design of a generic abstract interpreter. In M. Broy and R. Steinbr¨ uggen (eds), Calculational System Design, vol. 173, 421–505. NATO Science Series, Series F: Computer and Systems Sciences. IOS Press, 1999. 172 16. P. Cousot. Abstract interpretation based formal methods and future challenges, invited paper. In R. Wilhelm (ed), ”Informatics — 10 Years Back, 10 Years Ahead”, LNCS 2000, 138–156. Springer, 2000. 160, 162 17. P. Cousot and R. Cousot. Abstract interpretation: a uniﬁed lattice model for static analysis of programs by construction or approximation of ﬁxpoints. In 4th POPL, 238–252, Los Angeles, 1977. ACM Press. 160, 161, 165, 170, 172, 173 18. P. Cousot and R. Cousot. Automatic synthesis of optimal invariant assertions: mathematical foundations. In ACM Symposium on Artificial Intelligence & Programming Languages, ACM SIGPLAN Not. 12(8):1–12, 1977. 162, 172

176

Patrick Cousot and Radhia Cousot

19. P. Cousot and R. Cousot. Static determination of dynamic properties of generalized type unions. In ACM Symposium on Language Design for Reliable Software, ACM SIGPLAN Not. 12(3):77–94, 1977. 169 20. P. Cousot and R. Cousot. Static determination of dynamic properties of recursive procedures. In E.J. Neuhold (ed), IFIP Conf. on Formal Description of Programming Concepts, St-Andrews, CA, 237–277. North-Holland, 1977. 162, 164, 166, 168, 169, 170 21. P. Cousot and R. Cousot. Systematic design of program analysis frameworks. In 6th POPL, 269–282, San Antonio, 1979. ACM Press. 160, 161 22. P. Cousot and R. Cousot. Relational abstract interpretation of higher-order functional programs. Actes JTASPEFL ’91, Bordeaux, FR. BIGRE, 74:33–36, 1991. 169 23. P. Cousot and R. Cousot. Abstract interpretation and application to logic programs. J. Logic Programming, 13(2–3):103–179, 1992. (The editor of J. Logic Programming has mistakenly published the unreadable galley proof. For a correct version of this paper, see http://www.di.ens.fr/~cousot.). 161, 168 24. P. Cousot and R. Cousot. Abstract interpretation frameworks. J. Logic and Comp., 2(4):511–547, Aug. 1992. 160, 161 25. P. Cousot and R. Cousot. Galois connection based abstract interpretations for strictness analysis, invited paper. In D. Bjørner, M. Broy, and I.V. Pottosin (eds), Proc. FMPA, LNCS 735, 98–127. Springer, 1993. 169 26. P. Cousot and R. Cousot. Compositional and inductive semantic deﬁnitions in ﬁxpoint, equational, constraint, closure-condition, rule-based and game-theoretic form, invited paper. In P. Wolper (ed), Proc. 7th Int. Conf. CAV ’95, LNCS 939, 293–308. Springer, 1995. 161, 165 27. P. Cousot and R. Cousot. Formal language, grammar and set-constraint-based program analysis by abstract interpretation. In Proc. 7th FPCA, 170–181, La Jolla, 1995. ACM Press. 161 28. P. Cousot and R. Cousot. Introduction to abstract interpretation. Course notes for the “NATO Int. Summer School 1998 on Calculational System Design”, Marktoberdorﬀ, 1998. 161 29. P. Cousot and R. Cousot. Abstract interpretation based program testing, invited paper. In Proc. SSGRR 2000 Computer & eBusiness International Conference, Compact disk paper 248 and electronic proceedings http://www.ssgrr.it/en/ssgrr2000/proceedings.htm, 2000. Scuola Superiore G. Reiss Romoli. 161, 168 30. P. Cousot and R. Cousot. Systematic Design of Program Transformation Frameworks by Abstract Interpretation. In 29th POPL, 178–190, Portland, 2002. ACM Press. 165 31. P. Cousot and N. Halbwachs. Automatic discovery of linear restraints among variables of a program. In 5th POPL, 84–97, Tucson, 1978. ACM Press. 169, 173 32. M. Das, B. Liblit, M. F¨ ahndrich, and J. Rehof. Estimating the impact of scalable pointer analysis on optimization. In P. Cousot (ed), Proc. 8th Int. Symp. SAS ’01, LNCS 2126, 259–277. Springer, 2001. 174 33. J. Dean, Grove D., and G. Chambers. Optimization of object-oriented programs using static class hierarchy analysis. In W.G. Olthoﬀ (ed), Proc. 9th Euro. Conf. ECOOP ’95, LNCS 952, 77–101. Springer, 1995. 174 34. S.K. Debray and D.S. Warren. Automatic mode inference for logic programs. J. Logic Programming, 5(3):207–229, 1988. 165

Modular Static Program Analysis

177

35. M. Emami, R. Ghiya, and L. J. Hendren. Context-sensitive interprocedural pointsto analysis in the presence of function pointers. In Proc. ACM SIGPLAN ’93 Conf. PLDI. ACM SIGPLAN Not. 28(6), 242–256, 1994. ACM Press. 171 36. M. Felleisen. Program analyses: A consumer’s perspective and experiences, invited talk. In J. Palsberg (ed), Proc. 7th Int. Symp. SAS ’2000, LNCS 1824. Springer, 2000. Presentation available at URL http://www.cs.rice.edu:80/~matthias/Presentations/SAS.ppt. 165 37. J. Feret. Conﬁdentiality analysis of mobile systems. In J. Palsberg (ed), Proc. 7th Int. Symp. SAS ’2000, LNCS 1824, 135–154. Springer, 2000. 166 38. C. Flanagan and M. Felleisen. Componential set-based analysis. TOPLAS, 21(2):370–416, Feb. 1999. 165 39. C. Flanagan and J.B. Saxe. Avoiding exponential explosion: generating compact veriﬁcation conditions. In 28th POPL, 193–205, London, Jan. 2001. ACM Press. 165 40. R.W. Floyd. Assigning meaning to programs. In J.T. Schwartz (ed), Proc. Symposium in Applied Mathematics, vol. 19, 19–32. AMS, 1967. 173 41. R. Ghiya and L.J. Hendren. Putting pointer analysis to work. In 25th POPL, 121–133, San Diego, Jan. 1998. ACM Press. 171 42. R. Giacobazzi. Abductive analysis of modular logic programs. In M. Bruynooghe (ed), Proc. Int. Symp. ILPS ’1994, Ithaca, 377–391. MIT Press, 1994. 168 43. R. Giacobazzi and E. Quintarelli. Incompleteness, counterexamples and reﬁnements in abstract model-checking. In P. Cousot (ed), Proc. 8th Int. Symp. SAS ’01, LNCS 2126, 356–373. Springer, 2001. 172 44. R. Giacobazzi and F. Ranzato. Reﬁning and compressing abstract domains. In P. Degano, R. Gorrieri, and A. Marchetti-Spaccamela, editors, Proc. 24th Int. Coll. ICALP ’97, LNCS 1256, 771–781. Springer, 1997. 172 45. R. Giacobazzi, F. Ranzato, and F. Scozzari. Making abstract interpretations complete. J. ACM, 47(2):361–416, 2000. 172 46. N. Halbwachs. D´etermination automatique de relations lin´eaires v´erifi´ees par les variables d’un programme. Th`ese de 3`eme cycle d’informatique, Universit´e scientiﬁque et m´edicale de Grenoble, Grenoble, 12 Mar. 1979. 169 47. C. Hankin and D. Le M´etayer. Lazy type inference and program analysis. Sci. Comput. Programming, 25(2–3):219–249, 1995. 169 48. M.J. Harrold, D. Liang, and S. Sinha. An approach to analyzing and testing component-based systems. In Proc. 1st Int. ICSE Workshop on Testing Distributed Component-Based Systems. Los Angeles, 1999. 168 49. M.J. Harrold and M.L. Soﬀa. Eﬃcient computation of interprocedural deﬁnitionuse chains. TOPLAS, 16(2):175–204, Mar. 1994. 172 50. M. Hind, M. Burke, P. Carini, and J.-D. Choi. Interprocedural pointer alias analysis. TOPLAS, 21(4):848–894, Jul. 1999. 171 51. M. Hind and A. Pioli. Assessing the eﬀects of ﬂow-sensitivity on pointer alias analyses. In G. Levi (ed), Proc. 5th Int. Symp. SAS ’98, LNCS 1503, 57–81. Springer, 1998. 171 52. S. Horwitz. Precise ﬂow-insensitive may-alias analysis is NP-hard. TOPLAS, 19(1):1–6, Jan. 1997. 171 53. S. Jagannathan, P. Thiemann, S. Weeks, and A.K. Wright. Single and loving it: Must-alias analysis for higher-order languages. In 25th POPL, 329–341, San Diego, Jan. 1998. ACM Press. 171 54. N. Jones, C.K. Gomard, and P. Sestoft. Partial Evaluation and Automatic Program Generation. Int. Series in Computer Science. Prentice-Hall, June 1993. 165

178

Patrick Cousot and Radhia Cousot

55. V. Kuncak, P. Lam, and M. Rinard. Role analysis. In 29th POPL, 17–32, Portland, Jan. 2002. ACM Press. 168 56. W.A. Landi. Undecidability of static analysis. ACM Lett. Prog. Lang. Syst., 1(4):323–337, Dec. 1992. 171 57. O. Lee and K. Yi. A proof method for the correctness of modularized kCFAs. Technical Memorandum ROPAS-2000-9, Research On Program Analysis System, Korea Advanced Institute of Science and Technology, Nov. 2000. http://ropas.kaist.ac.kr/~cookcu/paper/tr2000b.ps.gz. 164 58. D. Liang and M.J. Harrold. Eﬃcient points-to analysis for whole-program analysis. In O. Nierstrasz and M. Lemoine (eds), Software Engineering - ESEC/FSE’99, 7th European Software Engineering Conference, LNCS 1687, 199–215, 1999. 171 59. D. Liang and M.J. Harrold. Eﬃcient computation of parameterized pointer information for interprocedural analyses. In P. Cousot (ed), Proc. 8th Int. Symp. SAS ’01, LNCS 2126, 279–298. Springer, 2001. 169, 171 60. F. Mal´esieux, O. Ridoux, and P. Boizumault. Abstract compilation of LambdaProlog. In J. Jaﬀar (ed), JICSLP ’98, Manchester, 130–144. MIT Press, 1992. 165 61. D. Mass´e. Combining forward and backward analyzes of temporal properties. In O. Danvy and A. Filinski (eds), Proc. 2nd Symp. PADO ’2001, LNCS 2053, 155–172. Springer, 2001. 161, 168 62. L. Mauborgne. Abstract interpretation using typed decision graphs. Sci. Comput. Programming, 31(1):91–112, May 1998. 171 63. A. Min´e. A new numerical abstract domain based on diﬀerence-bound matrices. In O. Danvy and A. Filinski (eds), Proc. 2nd Symp. PADO ’2001, LNCS 2053, 155–172. Springer, 2001. 162, 173 64. G. Ramalingam. The undecidability of aliasing. TOPLAS, 16(5):1467–1471, Sep. 1994. 171 65. F. Randimbivololona, J. Souyris, and A. Deutsch. Improving avionics software veriﬁcation cost-eﬀectiveness: Abstract interpretation based technology contribution. In Proceedings DASIA 2000 – DAta Systems In Aerospace, Montreal. ESA Publications, May 2000. 174 66. A. Rountev and S. Chandra. Oﬀ-line variable substitution for scaling points-to analysis. In Proc. ACM SIGPLAN ’00 Conf. PLDI. ACM SIGPLAN Not. 35(5), 47–56, Vancouver, June 2000. 165 67. A. Rountev and B. Ryder. Points-to and side-eﬀect analyses for programs built with precompiled libraries. In R. Wilhelm (ed), Proc. 10th Int. Conf. CC ’2001, LNCS 2027, 20–36. Springer, 2001. 165, 166, 168, 171 68. A. Rountev, B.G. Ryder, and W. Landi. Data-ﬂow analysis of program fragments. In O. Nierstrasz and M. Lemoine (eds), Software Engineering - ESEC/FSE’99, 7th European Software Engineering Conference, LNCS 1687, 235–252. Springer, 1999. 167 69. B.G. Ryder, W. Landi, P.A. Stocks, S. Zhang, and R. Altucher. A schema for interprocedural side eﬀect analysis with pointer aliasing. TOPLAS, 2002. To appear. 171 70. O. Shivers. The semantics of scheme control-ﬂow analysis. In P. Hudak and N.D. Jones (eds), Proc. PEPM ’91, ACM SIGPLAN Not. 26(9), 190–198. ACM Press, Sep. 1991. 174 71. Y.M. Tang and P. Jouvelot. Separate abstract interpretation for control-ﬂow analysis. In M. Hagiya and J.C. Mitchell (eds), Proc. Int. Conf. TACS ’95, LNCS 789, 224–243. Springer, 1994. 168

Modular Static Program Analysis

179

72. A. Venet. Automatic analysis of pointer aliasing for untyped programs. Sci. Comput. Programming, Special Issue on SAS’96, 35(1):223–248, Sep. 1999. 171 73. Z. Xu, T. Reps, and B.P. Miller. Typestate checking of machine code. In D. Sands (ed), Proc. 10th ESOP ’2001, LNCS 2028, 335–351. Springer, 2001. 173

StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, and Saman Amarasinghe Laboratory for Computer Science, Massachusetts Institute of Technology Cambridge, MA 02139 {thies, karczma,saman}@lcs.mit.edu

Abstract. We characterize high-performance streaming applications as a new and distinct domain of programs that is becoming increasingly important. The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain. At the same time, the StreamIt compiler aims to improve the performance of streaming applications via stream-speciﬁc analyses and optimizations. In this paper, we motivate, describe and justify the language features of StreamIt, which include: a structured model of streams, a messaging system for control, a re-initialization mechanism, and a natural textual syntax.

1

Introduction

Applications that are structured around some notion of a “stream” are becoming increasingly important and widespread. There is evidence that streaming media applications are already consuming most of the cycles on consumer machines [1], and their use is continuing to grow. In the embedded domain, applications for hand-held computers, cell phones, and DSP’s are centered around a stream of voice or video data. The stream abstraction is also fundamental to high-performance applications such as intelligent software routers, cell phone base stations, and HDTV editing consoles. Despite the prevalence of these applications, there is surprisingly little language and compiler support for practical, large-scale stream programming. Of course, the notion of a stream as a programming abstraction has been around for decades [2], and a number of special-purpose stream languages have been designed (see [3] for a review). Many of these languages and representations are elegant and theoretically sound, but they often lack features and are too inﬂexible to support straightforward development of modern stream applications, or their implementations are too ineﬃcient to use in practice. Consequently, most programmers turn to general-purpose languages such as C or C++ to implement stream programs. There are two reasons that general-purpose languages are inadequate for stream programming. Firstly, they are a mismatch for the application domain.

For more information about StreamIt, see http://compiler.lcs.mit.edu/streamit.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 179–196, 2002. c Springer-Verlag Berlin Heidelberg 2002

180

William Thies et al.

That is, they do not provide a natural or intuitive representation of streams, thereby having a negative eﬀect on readability, robustness, and programmer productivity. Moreover, because the widespread parallelism and regular communication patterns of data streams are left implicit in general-purpose languages, compilers are not stream-conscious and do not perform stream-speciﬁc optimizations. As a result, performance-critical loops are often hand-coded in a low-level assembly language and must be re-implemented for each target architecture. This practice is labor-intensive, error-prone, and very costly. Secondly, general-purpose languages are a mismatch for the emerging class of grid-based architectures [4,5,6] that are especially well-suited for stream processing. Perhaps the primary appeal of C is that it provides a “common machine language” for von-Neumann architectures. That is, it abstracts away the idiosyncratic diﬀerences between machines, but encapsulates their common properties: a single program counter, arithmetic operations, and a monolithic memory. However, for grid-based architectures, the von-Neumann model no longer holds, as there are multiple instruction streams and distributed memory banks. Thus, C no longer serves as a common machine language–in fact, it provides the wrong abstraction for the underlying hardware, and architecture-speciﬁc directives are often needed to obtain reasonable performance. Again, this greatly complicates the job of the programmer and hampers portability. StreamIt is a language and compiler speciﬁcally designed for modern stream programming. The StreamIt language has two goals: ﬁrst, to provide high-level stream abstractions that improve programmer productivity and program robustness within the streaming domain, and second, to serve as a common machine language for grid-based processors. At the same time, the StreamIt compiler aims to perform stream-speciﬁc optimizations to achieve the performance of an expert programmer. This paper motivates, describes, and justiﬁes the high-level language features of StreamIt, version 1.0. The major limitation of StreamIt 1.0 is that all ﬂow rates in the streams must be static; applications such as compression that have dynamically varying ﬂow rates will be the subject of future work. A large set of applications can be implemented with static rates, and while dynamic rates will require a diﬀerent runtime model, it will still be essential to fully analyse and optimize static sub-sections in order to obtain high performance. The paper is organized as follows. In Section 2, we characterize the domain of streaming programs that motivates the design of StreamIt, and in Section 3 we describe the language features in detail. We present an in-depth example of a software radio in Section 4, preliminary results in Section 5, related work in Section 6, and conclusions in Section 7.

2

Streaming Application Domain

The applications that make use of a stream abstraction are diverse, with targets ranging from embedded devices, to consumer desktops, to high-performance servers. Examples include systems such as the Click modular router [7] and the

StreamIt: A Language for Streaming Applications

181

Booster

ReadFromAtoD

RFtoIF

FFT

CheckFreqHop

CheckQuality

AudioBackEnd

Fig. 1. A block diagram of our frequency-hopping software radio Spectrumware software radio [8,9]; speciﬁcations such as the Bluetooth communications protocol [10], the GSM Vocoder [11], and the AMPS cellular base station[12]; and almost any application developed with Microsoft’s DirectShow library [13], Real Network’s RealSDK [14] or Lincoln Lab’s Polymorphous Computing Architecture [15]. We have identiﬁed a number of properties that are common to such applications–enough so as to characterize them as belonging to a distinct class of programs, which we will refer to as streaming applications. We believe that the salient characteristics of a streaming application are as follows: 1. Large streams of data. Perhaps the most fundamental aspect of a streaming application is that it operates on a large (or virtually inﬁnite) sequence of data items, hereafter referred to as a data stream. Data streams generally enter the program from some external source, and each data item is processed for a limited time before being discarded. This is in contrast to scientiﬁc codes, which manipulate a ﬁxed input set with a large degree of data reuse. 2. Independent stream filters. Conceptually, a streaming computation represents a sequence of transformations on the data streams in the program. We will refer to the basic unit of this transformation as a filter: an operation that–on each execution step–reads one or more items from an input stream, performs some computation, and writes one or more items to an output stream. Filters are generally independent and self-contained, without references to global variables or other ﬁlters. A stream program is the composition of ﬁlters into a stream graph, in which the outputs of some ﬁlters are connected to the inputs of others. 3. A stable computation pattern. The structure of the stream graph is generally constant during the steady-state operation of a stream program. That is, a certain set of ﬁlters are repeatedly applied in a regular, predictable order to produce an output stream that is a given function of the input stream. 4. Occasional modification of stream structure. Even though each arrangement of ﬁlters is executed for a long time, there are still dynamic modiﬁcations to the stream graph that occur on occasion. For instance, if a wireless network interface is experiencing high noise on an input channel, it might react by adding some ﬁlters to clean up the signal; a software radio re-initializes a portion of the stream graph when a user switches from AM to FM. Sometimes, these re-initializations are synchronized with some data in the stream–for instance, when a network protocol changes from Bluetooth to 802.11 at a certain point of a transmission. There is typically an enumerable number of

182

William Thies et al.

class FIRFilter extends Filter { float[] weights; int N;

Stream

Stream

(a) A Pipeline.

void init(float[] weights) { setInput(Float.TYPE); setOutput(Float.TYPE); setPush(N); setPop(1); setPeek(N); this.weights = weights; this.N = weights.length; } void work() { float sum = 0; for (int i=0; i
Stream

Stream

Join

Split Stream

(b) A SplitJoin. Stream

} class Main extends Pipeline { void init() { add(new DataSource()); add(new FIRFilter(N)); add(new Display()); } }

Fig. 2. An FIR ﬁlter in StreamIt

Join

Stream

Split

(c) A FeedbackLoop.

Fig. 3. Stream structures supported by StreamIt

conﬁgurations that the stream graph can adopt in any one program, such that all of the possible arrangements of ﬁlters are known at compile time. 5. Occasional out-of-stream communication. In addition to the high-volume data streams passing from one ﬁlter to another, ﬁlters also communicate small amounts of control information on an infrequent and irregular basis. Examples include changing the volume on a cell phone, printing an error message to a screen, or changing a coeﬃcient in an upstream FIR ﬁlter. 6. High performance expectations. Often there are real-time constraints that must be satisﬁed by streaming applications; thus, eﬃciency (in terms of both latency and throughput) is of primary concern. Additionally, many embedded applications are intended for mobile environments where power consumption, memory requirements, and code size are also important.

3

Language Overview

StreamIt includes stream-speciﬁc abstractions and representations that are designed to improve programmer productivity for the domain of programs described above. In this paper, we present StreamIt in legal Java syntax1 . Using Java has many advantages, including programmer familiarity, availability of compiler frameworks and a robust language speciﬁcation. However, the resulting syntax can be cumbersome, and in the future we plan to develop a cleaner and more abstract syntax that is designed speciﬁcally for stream programs. 1

However, for the sake of brevity, the code fragments in this paper are sometimes lacking modiﬁers or methods that would be needed to make them strictly legal Java.

StreamIt: A Language for Streaming Applications

3.1

183

Filters

The basic unit of computation in StreamIt is the Filter. An example of a Filter from our software radio (see Figure 1) is the FIRFilter, shown in Figure 2. The central aspect of a ﬁlter is the work function, which describes the ﬁlter’s most ﬁne grained execution step in the steady state. Within the work function, a ﬁlter can communicate with neighboring blocks using the input and output channels, which are FIFO queues declared as ﬁelds in the Filter base class. These high-volume channels support the three intuitive operations: 1) pop() removes an item from the end of the channel and returns its value, 2) peek(i) returns the value of the item i spaces from the end of the channel without removing it, and 3) push(x) writes x to the front of the channel. The argument x is passed by value; if it is an object, a separate copy is enqueued on the channel. A major restriction of StreamIt 1.0 is that it requires ﬁlters to have static input and output rates. That is, the number of items peeked, popped, and pushed by each ﬁlter must be constant from one invocation of the work function to the next. In fact, as described below, the input and output rates must be declared in the ﬁlter’s init function. If a ﬁlter violates the declared rates, StreamIt throws a runtime error and the subsequent behavior of the program is undeﬁned. We plan to support dynamically changing rates in a future version of StreamIt. Each Filter also contains an init function, which is called at initialization time. The init function serves two purposes. Firstly, it is for the user to establish the initial state of the ﬁlter. For example, the FIRFilter records weights, the coeﬃcients that it should use for ﬁltering. A ﬁlter can also push, pop, and peek items from within the init function if it needs to set up some initial state on its channels, although this usually is not necessary. A user should instantiate a ﬁlter by using its constructor, and the init function will be called implicitly with the same arguments that were passed to the constructor2 . The second purpose of the init function is to specify the ﬁlter’s I/O types and data rates to the StreamIt compiler. The types are speciﬁed with calls to setInput and setOutput, while the rates are speciﬁed with calls to setPush, setPop, and setPeek. The setPeek call can be ommitted if the peek count is the same as the pop count. Rationale StreamIt’s representation of a ﬁlter is an improvement over generalpurpose languages. In a procedural language, the analog of a ﬁlter is a block of statements in a complicated loop nest (see Figure 4). This representation is unnatural for expressing the feedback and parallelism that is inherent in streaming systems. Also, there is no clear abstraction barrier between one ﬁlter and another, and high-volume stream processing is muddled with global variables and control ﬂow. The loop nest must be re-arranged if the input or output ratios of a ﬁlter changes, and scheduling optimizations further inhibit the readability of the code. In contrast, StreamIt places the ﬁlter in its own independent unit, 2

This design might seem unnatural, but it is necessary to allow inlining (Section 3.2) and re-initialization (Section 3.4) within a Java-based syntax.

184

William Thies et al.

int N = 5; int BLOCK_SIZE = 100;

class FIRFilter { int N; float[] input;

void step(float[] input, float[] output, int numIn, int numOut) { float sum = 0; for (int k=0; k
FIRFilter(int N) { this.N = N; } float[] getData(float[] output, int offset, int length) { if (input==null) { input = new float[MAX_LENGTH]; source.getData(input, 0, N+length); } else { source.getData(input, N, length); }

void main() { float input[] = new float[N]; float output[] = new float[BLOCK_SIZE]; int numIn, numOut;

for (int i=0; i
for (numIn=0; numIn
for (int i=0; i
int wholeSteps = (BLOCK_SIZE-numOut)/N; for (int k=0; k<wholeSteps; k++) for (numIn=0; numIn
} }

for (numIn=0; numOut
Fig. 4. An optimized FIR ﬁlter in a procedural language. A complicated loop nest is required to avoid mod functions and to use memory eﬃciently, and the structure of the loops depends on the data rates (e.g., BLOCK SIZE) within the stream. An actual implementation might inline the calls to step

void main() { DataSource datasource = new DataSource(); FIRFilter filter = new FIRFilter(5); Display display = new Display(); filter.source = datasource; display.source = filter; display.run(); }

Fig. 5. An FIR ﬁlter in an object oriented language. A “pull model” is used by each ﬁlter object to retrieve a chunk of data from its source, and straight-line code connects one ﬁlter to another

making explicit the parallelism and inter-ﬁlter communication while hiding the grungy details of scheduling and optimization from the programmer. One could also use an object-oriented language to implement a stream abstraction (see Figure 5). This avoids some of the problems associated with a procedural loop nest, but the programming model is again complicated by efﬁciency concerns. That is, a runtime library usually executes ﬁlters according to a pull model, where a ﬁlter operates on a block of data that it retrieves from the input channel. The block size is often optimized for the cache size of a given architecture, which hampers portability. Moreover, operating on largegrained blocks obscures the fundamental ﬁne-grained algorithm that is visible in a StreamIt ﬁlter. Thus, the absence of a runtime model in favor of automated scheduling and optimization again distinguishes StreamIt.

StreamIt: A Language for Streaming Applications

class Delay extends Filter { void init(int delay) { setInput(Float.TYPE); setOutput(Float.TYPE); setPush(1); setPop(1); for (int i=0; i<delay; i++) output.push(0); } void work() { output.push(input.pop()); } } class EchoEffect extends SplitJoin { void init() { setSplitter(Duplicate()); add(new Delay(100)); add(new Delay(0)); setJoiner(RoundRobin()); } } class AudioEcho extends Pipeline { void init() { add(new AudioSource()); add(new EchoEffect()); Adder is defined add(new Adder()); in Figure 8. add(new Speaker()); } }

Fig. 6. An echo eﬀect in StreamIt 3.2

185

class Fibonnacci extends FeedbackLoop { void init() { setDelay(2); setJoiner(RoundRobin(0,1)); setBody(new Filter() { void init() { setInput(Integer.TYPE); setOutput(Integer.TYPE); setPush(1); setPop(1); setPeek(2); } void work() { output.push(input.peek(0)+input.peek(1)); input.pop(); } }); setSplitter(Duplicate()); } int initPath(int index) { return index; } }

Fig. 7. A FeedbackLoop version of Fibonnacci

Connecting Filters

StreamIt provides three constructs for composing ﬁlters into a communicating network: Pipeline, SplitJoin, and FeedbackLoop (see Figure 3). Each structure speciﬁes a pre-deﬁned way of connecting ﬁlters into a single-input, single-output block, which we will henceforth refer to as a “stream”. That is, a stream is any instance of a Filter, Pipeline, SplitJoin, or FeedbackLoop. Every StreamIt program is a hierarchical composition of these stream structures. The Pipeline construct is for building a sequence of streams. Like a Filter, a Pipeline has an init function that is called upon its instantiation. Within init, component streams are added to the Pipeline via successive calls to add. For example, in the AudioEcho in Figure 6, the init function adds four streams to the Pipeline: an AudioSource, an EchoEffect, an Adder, and a Speaker. This sequence of statements automatically connects these four streams in the order speciﬁed. Thus, there is no work function in a Pipeline, as the component streams fully specify the behavior. The channel types and data rates are also implicit from the connections. Each of the stream constructs can either be executed on its own, or embedded in an enclosing stream structure. The AudioEcho can execute independently, since the ﬁrst component consumes no items and the last component produces no items. However, the EchoEffect must be used as a component, since the ﬁrst stream inputs items and the last stream outputs items. When a stream is embedded in another construct, the ﬁrst and last components of the stream are implicitly connected to the stream’s neighbors in the parent construct. The SplitJoin construct is used to specify independent parallel streams that diverge from a common splitter and merge into a common joiner. As in a Pipeline, the components of a SplitJoin are speciﬁed with successive calls to add from the

186

William Thies et al.

init function. For example, the EchoEffect in Figure 6 adds two streams that run in parallel, each of which is a Delay ﬁlter. The splitter speciﬁes how items from the input of the SplitJoin are distributed to the parallel components. For simplicity, we allow only compiler-deﬁned splitters, of which there are three types: 1) Duplicate, which replicates each data item and sends a copy to each parallel stream, 2) RoundRobin(i1 , i2 , . . ., ik ), which sends the ﬁrst i1 data items to the stream that was added ﬁrst, the next i2 data items to the stream that was added second, and so on, and 3) Null, which means that none of the parallel components require any input, and there are no input items to split. If the weights are ommitted from a RoundRobin, then they are assumed to be equal to one for each stream. Note that RoundRobin can function as an exclusive selector if one or more of the weights are zero. Likewise, the joiner is used to indicate how the outputs of the parallel streams should be interleaved on the output channel of the SplitJoin. There are two kinds of joiners: 1) RoundRobin, whose function is analogous to a RoundRobin splitter, and 2) Null, which means that none of the parallel components produce any output, and there are no output items to join. The splitter and joiner types are speciﬁed with calls to setSplitter and setJoiner, respectively. The EchoEffect uses a Duplicate splitter so that each item appears both directly and as an echo; it uses a RoundRobin joiner to interleave the immediate signals with the delayed ones. In AudioEcho, an Adder is used to combine each pair of interleaved signals. The FeedbackLoop construct provides a way to create cycles in the stream graph. The Fibonacci stream in Figure 7 illustrates the use of this construct. Each FeedbackLoop contains: 1) a body stream, which is the block around which a backwards “feedback path” is being created, 2) a loop stream, which can perform some computation along the feedback path, 3) a splitter, which distributes data between the feedback path and the output channel at the bottom of the loop, and 4) a joiner, which merges items between the feedback path and the input channel at the top of the loop. These components are speciﬁed from within the init function via calls to setBody, setLoop, setSplitter, and setJoiner, respectively. The splitters and joiners can be any of those for SplitJoin, except for Null. The call to setLoop can be ommitted if no computation is performed along the feedback path. The FeedbackLoop has a special semantics when the stream is ﬁrst starting to run. Since there are no items on the feedback path at ﬁrst, the stream instead inputs items from an initPath function deﬁned by the FeedbackLoop; given an index i, initPath provides the i’th initial input for the feedback joiner. With a call to setDelay from within the init function, the user can specify how many items should be calculated with initPath before the joiner looks for data items from the feedback channel. Evident in the Fibonnacci example of Figure 7 is another feature of the StreamIt syntax: inlining. The deﬁnition of any stream can be inlined at the point of its instantiation, thereby preventing the deﬁnition of many small classes that are used only once, and, moreover, providing a syntax that reveals the

StreamIt: A Language for Streaming Applications

187

hierarchical structure of the streams from the indentation level of the code. In our Java syntax, we make use of anonymous classes for inlining [16]. Rationale StreamIt diﬀers from other languages in that it imposes a welldeﬁned structure on the streams; all stream graphs are built out of a hierarchical composition of Pipelines, SplitJoins, and FeedbackLoops. This is in contrast to other environments, which generally regard a stream as a ﬂat and arbitrary network of ﬁlters that are connected by channels. However, arbitrary graphs are very hard for the compiler to analyze, and equally diﬃcult for a programmer to describe. Most programmers either resort to straight-line code that links one ﬁlter to another (thereby making it very hard to visualize the stream graph), or using an ad-hoc graphical programming environment that admits no good textual representation. In contrast, StreamIt is a clean textual representation that–especially with inlined streams–makes it very easy to see the shape of the computation from the indentation level of the code. The comparison of StreamIt’s structure with arbitrary stream graphs could be likened to the diﬀerence between structured control ﬂow and GOTO statements. Though sometimes the structure restricts the expressiveness of the programmer, the gains in robustness, readability, and compiler analysis are immense. Though graphical programming languages have not gained large-scale acceptance, a graphical editor for StreamIt would have advantages since every stream graph has a precise textual equivalent that could also be edited by the programmer. Further, the hierarchical structure of the stream graph could simplify visualization. On ﬁrst glance, the statements within a StreamIt init function might appear more like a verbose API than a novel language. However, it was actually a careful design decision to specify all “stream conﬁguration information” via function calls from within the init functions. While the current syntax is somewhat tedious, there is great ﬂexibility in this approach, since the user can intermix conﬁguration directives with statements that calculate the conﬁguration parameters. This allows for fully parameterized graph construction–the FFT stream in Figure 8 inputs a parameter N and adjusts the number of butterﬂy stages appropriately. This further improves the modularity and readability of the code. 3.3

Messages

StreamIt provides a dynamic messaging system for passing irregular, low-volume control information between ﬁlters and streams. Messages are sent from within the body of a ﬁlter’s work function, perhaps to change a parameter in another ﬁlter. For example, in our software radio code (see Figure 8), the CheckFreqHop stage sends a message upstream to change the frequency of the receiver if it detects that the transmitter is about to change frequencies. The sender can continue to execute while the message is en route, and the setFreq method will be invoked in the receiver with argument FREQ[k] when the message arrives. Since message delivery is asynchronous, there can be no return value; only void methods can be message targets.

188

William Thies et al.

class RFtoIF extends Filter { int size, count, N; float weight[]; void init(int N, float freq) { setInput(Float.TYPE); setOutput(Float.TYPE); setPush(1); setPop(1); this.N = N; setFreq(freq); } void work() { output.push(input.pop()*weight[count++]); if (count==size) count = 0; } Frequency-Hop void setFreq(float freq) { count = 0; Message size = CARRIER_FREQ/freq*N; weight = new float[size]; for (int i=0; i<size; i++) weight[i] = Math.sin(i*PI/size); } } class CheckFreqHop extends SplitJoin { RFtoIFPortal freqPortal; void init(RFtoIFPortal freqPortal) { this.freqPortal = freqPortal; setSplitter(RoundRobin(N/4-2,1,1, N/2,1,1,N/4-2)); int k = 0; for (int i=1; i<=5; i++) { if ((i==2)||(i==4)) { for (int j=0; j<2; j++) { add(new Filter() { void init() { setInput(Float.TYPE); setOutput(Float.TYPE); setPush(1); setPop(1); } void work() { float val = input.pop(); if (val >= MIN_THRESHOLD) freqPortal.setFreq( FREQ[k], new Latency(4*N,6*N)); output.push(val); }}); k++; } } else add(new Identity()); } setJoiner(RoundRobin(N/4-2,1,1, N/2,1,1,N/4-2)); } }

class CheckQuality extends Filter { float aveHi, aveLo; BoosterPortal boosterPortal; boolean boosterOn; void init(BoosterPortal bp, boolean on) { setInput(Float.TYPE); setOutput(Float.TYPE); setPush(1); setPop(1); aveHi = 0; aveLo = 1; this.boosterPortal = bp; this.boosterOn = on; } void work() { float val = input.pop(); aveHi = max(0.9*aveHi, val); aveLo = min(1.1*aveLo, val); if (aveHi - aveLo < FAIL_QUAL && !booosterOn) { boosterPortal.init(true, BEST_EFFORT); boosterOn = true; } if (aveHi - aveLo > PASS_QUAL && boosterOn) { boosterPortal.init(false, BEST_EFFORT); boosterOn = false; } Booster output.push(val); Re-Initialization } }

Message

class Booster extends Pipeline { void init(int N, boolean enabled) { if (enabled) add(new FIRFilter(BOOST_WEIGHTS)); } } class TrunkedRadio extends Pipeline { int N = 64; BoosterPortal boosterPortal = new BoosterPortal(); RFtoIFPortal freqPortal = new RFtoIFPortal(); void init() { ReadFromAtoD in = add(new ReadFromAtoD()); RFtoIF rf2if = add(new RFtoIF(N, STARTFREQ)); Booster booster = add(new Booster(N, false)); add(new FFT(N)); add(new CheckFreqHop(freqHop)); add(new CheckQuality(onOff, false)); AudioBackEnd out = add(new AudioBackEnd()); freqPortal.register(rf2if); boosterPortal.register(booster); MAX_LATENCY(in, out, 10); } }

class Butterfly extends Pipeline { void init(int N, int W) { add(new SplitJoin() { void init() { setSplitter(RoundRobin(N, N)); add(new Filter() { float weight[] = new float[W]; int curr; void init() { See setInput(Float.TYPE); setOutput(Float.TYPE); Fig. 9 for setPush(1); setPop(1); Diagram for (int i=0; i<W; i++) weight[i] = calcWeight(i, N, W); curr = 0; } void work() { output.push(input.pop()*weight[curr++]); if (curr>=W) curr = 0; }}); add(new Identity()); setJoiner(RoundRobin()); class Adder extends Filter { }}); void init() { add(new SplitJoin() { setInput(Float.TYPE); setOutput(Float.TYPE); void init() { setPush(1); setPop(2); setSplitter(Duplicate()); } add(new Subtractor()); void work() { add(new Adder()); output.push(input.pop() + input.pop()); setJoiner(RoundRobin(N, N)); } }}); } } class FFT extends Pipeline { void init(int N) { add(new SplitJoin() { void init() { setSplitter(RoundRobin(N/2, N/2)); for (int i=0; i<2; i++) add(new SplitJoin() { void init() { setSplitter(RoundRobin()); add(new Identity()); add(new Identity()); setJoiner(RoundRobin(N/4, N/4)); }}); setJoiner(RoundRobin()); }}); for (int i=2; i
Fig. 8. StreamIt code for a software radio. Arrows denote the paths of messages

StreamIt: A Language for Streaming Applications

189

Message timing.The central aspect of the messaging system is a sophisticated timing mechanism that allows ﬁlters to specify when a message will be received relative to the ﬂow of information between the sender and the receiver. Recall that each ﬁlter executes independently, without any notion of global time. Thus, the only way for two ﬁlters to talk about a time that is meaningful for both of them is in terms of the data items that are passed through the streams from one to the other. In StreamIt, one can specify a range of latencies for each message delivery. This latency is measured in terms of an information “wavefront” from one ﬁlter to another. For example, in the CheckFreqHop example of Figure 8, the sender indicates an interval of latencies between 4N and 6N . This means that the receiver will receive the message immediately following the last invocation of its own work function which produces an item aﬀecting the output of the sender’s 4N ’th to 6N ’th work functions, counting the sender’s current work function as number 0. Due to space limitations, we cannot deﬁne this notion precisely in this paper (see [17,18] for a formal semantics), but the general idea is simple: the receiver is invoked when it sees the information wavefront that the sender sees in 4N to 6N execution steps. In some cases, the ability to synchronize the arrival of a message with some element of the data stream is very important. For example, CheckFreqHop knows that the transmitter will change the frequency between 4N and 6N steps later, in terms of the frame that CheckFreqHop is inputting. To ensure that the radio changes frequencies at the same time–so as not to lose any data at the old or new frequency–CheckFreqHop instructs the receiver to switch frequencies when the receiver sees one of the last data items at the old frequency. Portals for broadcast messaging. StreamIt also has support for modular broadcast messaging. When a sender wants to send a message that will invoke method M of the receiver R upon arrival, it does not call M on the object R. Rather, it calls M on a Portal of which R is a member. Portals are typed containers that forward all messages they receive to the elements of the container. Portals could be useful in cases when a component of a ﬁlter library needs to announce a message (e.g., that it is shutting down) but does not know the list of recipients; the user of the library can pass to the ﬁlter a Portal containing all interested receivers. As for message delivery constraints, the user speciﬁes a single time interval for each message, and that interval is interpreted separately (as described above) for each receiver in the Portal. In a language with generic data types, a Portal could be implemented as a templated list. However, since Java does not yet support templates, we automatically generate an <X>Portal class for every class and interface <X>. Our syntax for using Portals is evident in the TrunkedRadio class in Figure 8. Rationale Stream programs present a challenge in that ﬁlters need both regular, high-volume data transfer and irregular, low-volume control communication. Moreover, there is the problem of reasoning about the relative “time” between ﬁlters when they are running asynchronously and in parallel.

190

William Thies et al.

A diﬀerent approach to messaging is to embed control messages in the data stream instead of providing a separate mechanism for dynamic message passing. This does have the eﬀect of associating the message time with a data item, but it is complicated, error-prone, and leads to unreadable code. Further, it could hurt performance in the steady state (if each ﬁlter has to check whether or not a data item is actual data or control, instead) and complicates compiler analysis, too. Finally, one can’t send messages upstream without creating a separate data channel for them to travel in. Another solution is to treat messages as synchronous method calls. However, this delays the progress of the stream when the message is en route, thereby degrading the performance of the program and restricting the compiler’s freedom to reorder ﬁlter executions. We feel that the StreamIt messaging model is an advance in that it separates the notions of low-volume and high-volume data transfer–both for the programmer and the compiler–without losing a well-deﬁned semantics where messages are timed relative to the high-volume data ﬂow. Further, by separating message communication into its own category, fewer connections are needed for steady-state data transfer and the resulting stream graphs are more amenable to structured stream programming. 3.4

Re-initialization

One of the characteristics of a streaming application is the need to occaisionally modify the structure of part of the stream graph. StreamIt allows these changes through a re-initialization mechanism that is integrated with its messaging model. If a sender targets a message at the init function of a stream or ﬁlter S, then when the message arrives, it re-executes the initialization code and replaces S with a new version of itself. However, the new version might have a diﬀerent structure than the original if the arguments to the init call on re-initialization were diﬀerent than during the original initialization. When an init message arrives, it does not kill all of the data that is in the stream being re-initialized. Rather, it drains the stream until the wavefront of information (as deﬁned for the messaging model) from the top of the stream has reached the bottom. The draining occurs without consuming any data from the input channels to the re-initialized region. Instead, a drain function of each ﬁlter is invoked to provide input when its other input source is frozen. (Each ﬁlter can override the drain function as part of its deﬁnition.) If the programmer prefers to kill the data in a stream segment instead of draining it, this can be indicated by sending an extra argument to the message portal with the re-initialization message. Rationale Re-initialization is a headache for stream programmers because–if done manually–the entire runtime system could be put on hold to re-initialize a portion of the stream. The interface to starting and stopping streams could be complicated when there is not an explicit notion of initialization time vs.

StreamIt: A Language for Streaming Applications

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

11 10 9 8 3 2 1 0

10 8 2 0 round robin

11 9 3 1

weighted round robin (2, 2)

11 9 10 8 3 1 2 0

weighted round robin (4, 4)

round robin

15 14 13 12 7 6 5 4

14 12 6 4 round robin

15 13 7 5

weighted round robin (2, 2)

15 13 14 12 7 5 6 4

191

15 11 13 9 14 10 12 8 7 3 5 1 6 2 4 0

Fig. 9. The bit-reversal phase in the FFT, with N=8. A bit-reversal permutation is one that swaps all elements with indices whose binary representations are the reverse of each other. The Butterﬂy stage is similar, but ommitted for lack of space steady-state execution time, and ad-hoc draining techniques could risk losing data or deadlocking the system. StreamIt improves on this situation by abstracting the re-initialization process from the user. That is, no auxillary control program is needed to drain the old streams and create the new structure; the user need only trigger the reinitialization process through a message. Additionally, any hierarchical stream construct automatically becomes a possible candidate for re-initialization, due to the well-deﬁned stream structure and the simple interface with the init function. Finally, it is easy for the compiler to recognize stream re-initialization possibilities and to account for all possible conﬁgurations of the stream ﬂow graph during analysis and optimization. 3.5

Latency Constraints

Lastly, StreamIt provides a simple way of restricting the latency of an information wavefront in traveling from the input of one ﬁlter to the output of a downstream ﬁlter. Issuing the directive MAX LATENCY(A, B, n) from within an init means that A can only execute up to the wavefront of information that B will see after n invocations of its own work function.

4

Detailed Example

We now discuss the StreamIt implementation of the Trunked Radio illustrated in Figure 1. The Trunked Radio is a frequency-hopping system in which the receiver switches between a set of known frequencies whenever it hears certain tones from the transmitter. The toplevel class, TrunkedRadio, is implemented as a seven-stage Pipeline (see Figure 8). The RFtoIF stage modulates the input signal from RF to a

192

William Thies et al.

frequency band around the current IF frequency. To support a change in the IF frequency when frequency hopping occurs, the RftoIF ﬁlter contains a setFreq method that is invoked via a message from the CheckFreqHop stage. The message is sent from CheckFreqHop with a latency range of 4N to 6N , which means that RFtoIF must deliver between 4N and 6N items using the old modulation scheme before changing to the new frequency. The optional Booster stage provides ampliﬁcation for weak signals, but is usually turned oﬀ to conserve power. The Booster is toggled by a re-initialization message from the CheckQuality stage, which estimates the signal quality by the shape of the frequency spectrum. If all the frequencies have similar amplitudes, CheckQuality assumes that the signal-to-noise ratio is low and sends a message to activate the Booster. This message is sent using best-eﬀort delivery. The FFT stage converts the signal from the time domain to the frequency domain; please refer to p. 796 of [19] for a diagram of the parallel FFT algorithm. The StreamIt implementation consists of a bit-reversal permutation followed by a series of Butterfly stages. The bit-reversal phase illustrates how data can be reshuﬄed with just a few SplitJoin constructs (see Figure 9). The Butterfly stage–which is parameterized to allow for a compact representation of the FFT– also employs SplitJoins to select groups of items for its computation. We believe that the StreamIt version of the FFT is clean and intuitive, as the SplitJoin constructs expose the natural parallelism of the algorithm.

5

Results

We have implemented a fully-functional prototype of the StreamIt compiler as an extension to the Kopi Java Compiler, a component of the open-source Kopi Project [20]. At this time, our compiler is a proof-of-concept and does not yet include the stream-speciﬁc optimizations that we are working on; we generate C code that is compiled with a StreamIt runtime library to produce the ﬁnal executable. We have also developed a library in Java that allows StreamIt code to be executed as pure Java, thereby providing a veriﬁcation mechanism for the output of the compiler. The compilation process for streaming programs contains many novel aspects because the basic unit of computation is a stream rather than a procedure. In order to compile stream modules separately, we have developed a runtime interface–analogous to that of a procedure call for traditional codes–that speciﬁes how one can interact with a black box of streaming computation. The stream interface contains separate phases for initialization and steady-state execution; in the execution phase, the interface includes a contract for input items, output items, and possible message production and consumption. Though we have yet to add optimizations to our compiler, it is nonetheless interesting to evaluate its baseline performance. For this purpose, we developed StreamIt implementations of four applications: 1) A GSM Decoder, which takes GSM-encoded parameters as inputs, and uses these to synthesize audible speech[11], 2) A system from the Polymorphic Computing Architecture

StreamIt: A Language for Streaming Applications

Table 1. Application Characteristics Benchmark Lines Filters Graph Size PCA Demo 484 5 7 FM Radio 411 5 27 perftest4 347 5 20 GSM Decoder 3050 11 21

Table 2. µsec/item)

Performance

193

Results

(in

Benchmark StreamIt SpectrumWare C PCA Demo 1.3 3.4 N/A FM Radio 4.9 9.9 N/A perftest4 330 330 N/A GSM Decoder 4.88 N/A .47

(PCA) [15] which encapsulates the core functionality of modern radar, sonar, and communications signal processors, 3) A software-based FM Radio with equalizer, and 4) A performance test from the SpectrumWare system that implements an Orthogonal Frequency Division Multiplexor (OFDM) [8]. Table 1 gives characteristics of the above applications including the number of ﬁlters implemented and the size of the stream graph as coded. Table 2 gives the performance of our compiler by comparing the StreamIt implementation against either the SpectrumWare implementation or (in the case of GSM) a hand-optimized C version. SpectrumWare [8] is a high-performance runtime library for streaming programs, implemented in C++. The StreamIt language oﬀers a higher level of abstraction than SpectrumWare (see Section 3.1), and yet the StreamIt compiler is able to beat the SpectrumWare performance by a factor of two for the PCA Demo and FM Radio. For the GSM application, the extensively hand-optimized C version incorporates many transformations that rely on a high-level knowledge of the algorithm, and StreamIt performs an order of magnitude slower. However, this version of the compiler is only a prototype, and is not yet intended to compete with handcoded C. Our code generation strategy currently has many ineﬃciencies, and in the future we plan to generate optimized assembly code by interfacing with a code generator. We believe that stream-conscious optimizations can improve the performance by an order of magnitude on uniprocessors; moreover, we have yet to consider parallel targets, and this is where we expect to ﬁnd the most pronounced beneﬁts of the abundant parallelism and regular communication patterns exposed by StreamIt.

6

Related Work

A large number of programming languages have included a concept of a stream; see [3] for a survey. Those that are perhaps most related to StreamIt 1.0 are synchronous dataﬂow languages such as LUSTRE [21] and ESTEREL [22] which require a ﬁxed number of inputs to arrive simultaneously before ﬁring a stream node. However, most special-purpose stream languages do not contain features such as messaging and support for modular program development that are essential for modern stream applications. Also, most of these languages are so abstract and unstructured that the compiler cannot perform enough analysis and optimization to result in an eﬃcient implementation.

194

William Thies et al.

At an abstract level, the stream graphs of StreamIt share a number of properties with the synchronous dataﬂow (SDF) domain as considered by the Ptolemy project [23]. Each node in an SDF graph produces and consumes a given number of items, and there can be delays along the arcs between nodes (corresponding loosely to items that are peeked in StreamIt). As in StreamIt, SDF graphs are guaranteed to have a static schedule and there are a number of nice scheduling results incorporating code size and execution time [24]. However, previous results on SDF scheduling do not consider constraints imposed by point-to-point messages, and do not include a notion of StreamIt’s information wavefronts, re-initialization, and programming language support. A speciﬁcation package used in industry bearing some likeness to StreamIt is SDL: Speciﬁcation and Description Language [25]. SDL is a formal, objectoriented language for describing the structure and behavior of large, real-time systems, especially for telecommunications applications. It includes a notion of asynchronous messaging based on queues at the receiver, but does not incorporate wavefront semantics as does StreamIt. Moreover, its focus is on speciﬁcation and veriﬁcation whereas StreamIt aims to produce an eﬃcient implementation.

7

Conclusions and Future Work

This paper presents StreamIt, a novel language for high-performance streaming applications. Stream programs are emerging as a very important class of applications with distinct properties from other recognized application classes. This paper develops fundamental programming constructs for the streaming domain. The primary goal of StreamIt is to raise the abstraction level in stream programming without sacriﬁcing performance. We have argued that StreamIt’s mechanisms for ﬁlter deﬁnition, ﬁlter composition, messaging, and re-initialization will improve programmer productivity and program robustness within the streaming domain. Also, we believe that StreamIt is a viable common machine language for grid-based architectures (e.g., [4,5,6]), just as C is a common machine language for von-Neumann machines. StreamIt abstracts away the target’s granularity, memory layout, and network interconnect, while capturing the notion of independent processors that communicate in regular patterns. We are developing ﬁssion and fusion algorithms that can automatically adjust the granularity of a stream graph to match that of a given target. We have a number of extensions planned for the next version of the StreamIt language. The current version is designed primarily for uniform one-dimensional data processing, but constructs for hierarchical frames of data would be useful for image processing. Moreover, a future version will support dynamically varying I/O rates of the ﬁlters in the stream. We expect that such support will require new language constructs–for instance, a type-dispatch splitter that routes items to the components of a SplitJoin based on their type, and a fall-through joiner that pulls items from any stream in a SplitJoin as soon as they are produced.

StreamIt: A Language for Streaming Applications

195

Our immediate focus is on developing a high-performance optimizing compiler for StreamIt 1.0. As described in [18], the structure of StreamIt can be exploited by the compiler to perform a wide range of stream-speciﬁc optimizations. Our goal is to match the performance of hand-coded applications, such that the abstraction beneﬁts of StreamIt come with no performance penalty.

Acknowledgements The StreamIt compiler was implemented with Michael Gordon and David Maze, with applications support of Jeremy Wong, Henry Hoﬀman, and Matthew Brown; we also thank Matt Frank for many helpful comments. This work was supported in part by the MIT Oxygen Project and DARPA Grant DBT6396-C0036.

References 1. Rixner, S., et al: A Bandwidth-Eﬃcient Architecture for Media Processing. In: HPCA, Dallas, TX (1998) 179 2. Abelson, H., Sussman, G.: Structure and Interpretation of Computer Programs. MIT Press, Cambridge, MA (1985) 179 3. Stephens, R.: A Survey of Stream Processing. Acta Informatica 34 (1997) 491–541 179, 193 4. Mai, K., Paaske, T., Jayasena, N., Ho, R., Dally, W., Horowitz, M.: Smart memories: A modular recongurable architecture (2000) 180, 194 5. Waingold, E., et al.: Baring it all to Software: The Raw Machine. MIT-LCS Technical Report 709, Cambridge, MA (1997) 180, 194 6. Sankaralingam, K., Nagarajan, R., Keckler, S., Burger, D.: A Technology-Scalable Architecture for Fast Clocks and High ILP. UT Austin Tech Report 01-02 (2001) 180, 194 7. Kohler, E., Morris, R., Chen, B., Jannotti, J., Kaashoek, M. F.: The click modular router. ACM Trans. on Computer Systems 18 (2000) 263–297 180 8. Tennenhouse, D., Bose, V.: The SpectrumWare Approach to Wireless Signal Processing. Wireless Networks (1999) 181, 193 9. Bose, V., Ismert, M., Welborn, M., Guttag, J.: Virtual radios. IEEE/JSAC, Special Issue on Software Radios (April 1999) 181 10. B. Volume and B. July: Bluetooth Spec. Vol. 1. Bluetooth Consortium (1999) 181 11. Mouly, M., Pautet, M.: The GSM System for Mobile Communications. Cell&Sys, Palaiseau, France (1992) 181, 192 12. EIA/TIA: Mobile station-land station compatibility spec. Tech. Rep. 553 (1989) 181 13. Microsoft Corporation: Microsoft directshow. Online Documentation (2001) 181 14. RealNetworks: Software Developer’s Kit. Online Documentation (2001) 181 15. Lebak, J.: Polymorphous Computing Architecture (PCA) Example Applications and Description. External Report, MIT Lincoln Laboratory (August 2001) 181, 193 16. Gosling, Joy, Steele: The Java Language Speciﬁcation. Addison Wesley (1997) 187

196

William Thies et al.

17. Thies, B., Karczmarek, M., Amarasinghe, S.: StreamIt: A Language for Streaming Applications. MIT-LCS Technical Memo TM-620, Cambridge, MA (August, 2001) 189 18. Thies, W., Karczmarek, M., Gordon, M., Maze, D., Wong, J., Hoﬀmann, H., Brown, M., Amarasinghe, S.: StreamIt: A Compiler for Streaming Applications. MIT-LCS Technical Memo TM-622, Cambridge, MA (December, 2001) 189, 195 19. Cormen, T. H., Leiserson, C. E., Rivest, R. L.: Introduction to Algorithms. The MIT Electrical Engineering and Computer Science Series. MIT Press/McGraw Hill (1990) 192 20. Vincent Gay-Para, Thomas Graf, A. G. L., Wais, E.: Kopi Reference manual. http://www.dms.at/kopi/docs/kopi.html (2001) 192 21. Halbwachs, N., Caspi, P., Raymond, P., Pilaud, D.: The synchronous data-ﬂow programming language LUSTRE. Proceedings of the IEEE 79 (1991) 1305–1320 193 22. Berry, G., Gonthier, G.: The Esterel Synchronous Programming Language: Design, Semantics, Implementation. Science of Computer Programming 19 (1992) 87–152 193 23. Lee, E. A.: Overview of the Ptolemy Project. UCB/ERL Technical Memorandum UCB/ERL M01/11, Dept. EECS, University of California, Berkeley, CA (2001) 194 24. Bhattacharyya, S. S., Murthy, P. K., Lee, E. A.: Software Synthesis from Dataﬂow Graphs. Kluwer Academic Publishers (1996) 189 pages. 194 25. CCITT Recommendation Z.100: Speciﬁcation and Description Language. ITU, Geneva (1992) 194

Compiling Mercury to High-Level C Code Fergus Henderson and Zoltan Somogyi Department of Computer Science and Software Engineering The University of Melbourne, Victoria 3010, Australia {fjh,zs}@cs.mu.oz.au

Abstract. Many logic programming implementations compile to C, but they compile to very low-level C, and thus discard many of the advantages of compiling to a high-level language. We describe an alternative approach to compiling logic programs to C, based on continuation passing, that we have used in a new back-end for the Mercury compiler. The new approach compiles to much higher-level C code, which means the compiler back-end and run-time system can be considerably simpler. We present a formal schema for the transformation, and give benchmark results which show that this approach delivers performance that is more than competitive with the fastest previous implementation, with greater simplicity and better portability and interoperability. The approach we describe can also be used for compiling to other target languages, such as IL (the Microsoft .NET intermediate language). Keywords: compilation techniques, programming language implementation, logic programming, Mercury, C, GNU C.

1

Introduction

Nowadays many implementations of high-level languages compile to C [6,9,15]. We have used the technique ourselves in the original implementation [7,14,10] of Mercury, a strongly-typed declarative programming language which supports functional and logic programming. The popularity of compilation to C is not surprising, because its beneﬁts are by now well known: – C code can be portable. C compilers exist for almost every important hardware architecture. In comparison to generating assembler or writing a JIT compiler, generating portable C code greatly reduces the eﬀort required to port the high-level language implementation to a diﬀerent hardware architecture. – C is eﬃcient. High quality C compilers are widely and often freely available. Generating native code via C can result in considerably better performance than writing an interpreter, and the performance will usually be close to what could be obtained by generating assembler directly, especially if there is a close match between the source language and C. Indeed the performance may well be R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 197–212, 2002. c Springer-Verlag Berlin Heidelberg 2002

198

Fergus Henderson and Zoltan Somogyi

better in practice, since more resources are available for improving the C compiler’s optimizer than would be available for a less established language. – C has critical mass. There is good tool support, lots of programmers know C, there is lots of existing C code to link to, and so on. – C is higher level than assembler. This can make compiling to C much easier. Again, the beneﬁt is greatest if there is a close match between the source language and C. Unfortunately, however, logic programming (LP) languages are not a good match with C. There are two key problems: tail recursion and backtracking. Tail Recursion. In logic programs, recursion is the primary method of iteration. To ensure that recursive loops can operate in constant space, logic programming implementations perform tail call optimization, not just for directly recursive tail calls, but also for tail calls to other procedures (which might be part of an indirectly recursive loop). However, C programs generally use explicit looping constructs for iteration, and C implementations generally don’t optimize tail calls. Even those implementations which do make some attempt at this generally only do so in a very limited set of circumstances. The problem is that the semantics of C make it very diﬃcult for the C compiler to perform tail call optimization if any local variable has its address taken. Furthermore for most C programs, the payoﬀ of optimizing the diﬃcult cases is likely to be very small. When compiling to C, an LP language compiler can recognize directly tailrecursive loops and output C looping constructs for them. But if procedure calls in the LP language are to be mapped to function calls in C, then tail calls other than directly recursive tail calls can’t be optimized so easily. Inlining can reduce some indirectly recursive loops to directly recursive loops, but it won’t handle the general case; indirect tail recursion can even span module boundaries. Backtracking. The presence of nondeterminism and backtracking in LP languages leads to a completely diﬀerent model of procedure calling. In C and other traditional languages, each procedure is called and then, after some processing, the procedure will normally return to the caller. The stack frame can be allocated on calls and deallocated on returns. In contrast, Prolog and other languages that support backtracking use a fourport model (CALL, EXIT, REDO, FAIL) [4]. A procedure is called (CALL), does some processing, and then returns an answer to the caller (EXIT); but after an answer has been returned, the caller can re-enter the procedure to look for more solutions (REDO). If there are more solutions, the procedure will return another answer again (EXIT), and the caller may again ask for more solutions (REDO). Eventually, when there are no more solutions, the procedure will FAIL; only then can the stack frame be deallocated.

Compiling Mercury to High-Level C Code

199

The Traditional Solution. Because of the diﬀerence in procedure calling model imposed by backtracking, LP predicate calls and exits cannot be mapped directly to C function calls and returns. Instead, LP language compilers that target C use their own data areas for parameter passing and storing local variables. These data areas are then manipulated explicitly in the generated C code. This solution also helps solve the problem with tail calls, because the LP language compiler has complete control over the LP data areas. The C stack can be kept to a ﬁxed size using a driver loop, e.g. typedef void * Func(void); void driver(Func *entry) { register Func *fp = entry; while (fp != NULL) { fp = (Func *) (*fp)(); } } with each C function returning the address of the next C function to call. Various optimizations on this basic model (such as loop unrolling, using GNU C extensions, inline assembler jumps, etc.) are possible, and were exploited by earlier versions of the Mercury compiler [10]. Drawbacks of the Traditional Solution. The traditional approach means that the LP language implementation will eﬀectively deﬁne its own virtual machine, with the virtual machine instructions typically implemented as C macros. This approach is workable, but because it doesn’t use the C calling convention, it unfortunately discards many of the advantages of compiling to a high-level language: – The LP language compiler needs to do much of the work of a traditional compiler, including allocating variables to virtual machine stack slots or virtual registers. – The generated code is low level, and hard to read. – The performance is not as good as it could be, because the LP compiler often ends up working against the C compiler, rather than with it. For example, if GNU C extensions are used to map virtual machine registers into real registers, then those registers can’t be used by the C compiler for other purposes. Another example is that because all data manipulation is done via the LP implementation’s data structures, rather than local variables, the C compiler’s ability to analyze possible aliasing may be signiﬁcantly inhibited, which can harm the C compiler’s ability to optimize the code. – There is a forced trade-oﬀ between eﬃciency, simplicity, and portability; the optimizations mentioned above, which are needed to achieve good eﬃciency, compromise portability and/or increase complexity of the source code. In this paper, we describe an alternative approach to transforming logic programs to C, using continuation passing to handle nondeterminism, that avoids these drawbacks.

200

2

Fergus Henderson and Zoltan Somogyi

Preliminaries

Abstract Syntax. The transformation described in this paper takes as its input logic programs which have been reduced to a simpliﬁed intermediate form: uniﬁcations are ﬂattened, and each predicate has only one clause (multiple clauses having been converted into explicit disjunctions). The abstract syntax for goals in our simpliﬁed core language is as shown in Figure 1.

goal → ; ; ; ; ; ; ;

(goal , goal ) (goal ; goal ) not(goal ) (goal -> goal ; goal ) once (goal ) pred (var , ...) var = var var = functor (var , ...).

Conjunction Disjunction Negation If-then-else Pruning Calls Uniﬁcation (assignment/test) Uniﬁcation (construction/deconstruction)

Fig. 1. Mercury core abstract syntax

Procedures. A key aspect of our translation is that it is mostly a one-to-one mapping. Each procedure in the logic program is mapped to one C function (possibly containing nested functions, as explained below). Each head variable in a clause is mapped to a corresponding C function parameter, and each nonhead variable is mapped to a corresponding C local variable. Types and Modes. Since Mercury is a (mostly) statically typed and moded language — i.e. the programmer declares and/or the compiler infers the type of each variable, and whether each parameter is input or output — we have full type and mode information available when generating code. So each Mercury type is mapped to the corresponding C type, with input parameters passed by value, and output parameters passed by reference (i.e. using pointer arguments in the generated C code). However, the transformation scheme described here does not require static type or mode information; for a dynamically typed and dynamically moded LP language, it would be possible to map every type in the source language to a single C type, and to always pass arguments by reference, though this would of course add the usual run-time overheads for dynamic typing and dynamic modes. Procedures with Multiple Modes. An important feature of logic programming is that it supports multi-moded predicates. For example, the same predicate append/3, can be used either to append two lists together (the ‘(in, in, out)’

Compiling Mercury to High-Level C Code

201

mode), or to ﬁnd all the ways of splitting a single list into two sublists (the ‘(out, out, in)’ mode). For multi-moded Mercury predicates, each mode of the predicate is treated as a diﬀerent procedure, and so we generate a diﬀerent C function for each mode of the predicate. Determinism Analysis. The transformation requires that each (sub-)goal in the abstract syntax be annotated with its determinism, which indicates how many times that goal can succeed (EXIT) each time it is invoked (CALLed). For Mercury, this information is readily available, since the Mercury language includes determinism declarations (optional for procedures local to a module, but mandatory for procedures exported for use by other modules). The compiler’s determinism checking and inference [11] produces the information that we need. We use a simpliﬁed form of the determinism categories used by the Mercury language, which takes into account only those distinctions which are important for code generation: – m det indicates that the goal will succeed exactly once (unless it does not terminate, or throws an exception) – m semi indicates that the goal will succeed at most once – m non indicates that the goal may succeed any number of times For example, ‘(in, in, out)’ mode of append/3 has determinism m det, while the ‘(out, out, in)’ mode has determinism m non. Some other LP languages, such as Turbo/PDC/Visual Prolog, also have similar compile-time determinism checking/inference. For other LP languages the determinism information could be obtained by static analysis of the program. This kind of analysis has been done by optimizing Prolog compilers such as Parma [16], Aquarius Prolog [17] and Ciao-Prolog [3]. The transformation scheme relies fairly heavily on having determinism information about every goal (and accurate determinism analysis in turn also requires accurate type and mode information). If determinism information isn’t available, it would be possible to use a conservative approximation — in the worst case assigning the determinism m non to every procedure — but this will signiﬁcantly reduce the eﬃciency of the generated code.

3

Our Transformation Scheme

Continuation Passing Style. For nondeterministic procedures, we generate code using an explicit continuation passing style. Each nondeterministic procedures gets translated into a function which takes an extra parameter which is a function pointer that points to the success continuation. On success, the function calls its success continuation, and on failure it returns. To keep things easy, our transformation generates code which may contain nested functions (as in Pascal, or GNU C). Our use of nested functions is restricted to what are often known as “downward closures”: when we take the

202

Fergus Henderson and Zoltan Somogyi

address of a nested function, we only ever do two things with it: pass it as a continuation argument, or call it. The continuations are never returned and never stored inside heap objects or global variables. These conditions are suﬃcient to ensure that we never keep the address of a nested function after the containing function has returned, so we won’t get any dangling continuations. If the target language doesn’t support nested functions (or, like GNU C, doesn’t support them eﬃciently enough) then after the transformation is complete, we have a separate pass that transforms the generated C code into a form that does not use nested functions, by explicitly passing a pointer to an environment struct to each function that was originally nested. Due to space limitations, we do not describe how we do this conversion. Techniques for implementing nested functions are well described in the literature (e.g. [2]). Calling Convention. In each procedure, we declare a local variable ‘bool succeeded’. This is used to hold the success status of m semi sub-goals. The transformation schemas below show local declarations for the ‘succeeded’ variable in all the places where they would be needed if we were generating them locally. However, in our current implementation we actually just generate a single ‘succeeded’ variable for each procedure. This is simpler, but may not be quite as eﬃcient. The calling convention for sub-goals is as follows. – m det goal: On success, fall through. (May overwrite ‘succeeded’.) – m semi goal: On success, set ‘succeeded’ to TRUE and fall through. On failure, set ‘succeeded’ to FALSE and fall through. – m non goal: On success, call the current success continuation. On failure, fall through. (May overwrite ‘succeeded’ in either case.)

Notation. We use the following notation to distinguish between calls in the diﬀerent code models: Code model Notation Deﬁnition m det do Goal Execute Goal (which must be m det). succeeded = Goal Execute Goal , and set ‘succeeded’ to m semi TRUE if the goal succeeds and FALSE if it fails. Goal && Cont () Execute Goal , calling the success conm non tinuation function Cont () every time it succeeds, and falling through when it fails. We also use the following notation for the transformation rules used by our translator: situation:

construct

=⇒

code

Compiling Mercury to High-Level C Code

203

This means that in the situation described by situation, the speciﬁed construct should be translated by the LP language compiler into the speciﬁed code. The code will in general be a mixture of C code and fragments inside . . . which need to be further translated. 3.1

Converting between Diﬀerent Code Models

If a m foo goal occurs in a m bar context, where foo=bar , then we need to modify the code that we emit for the goal so that it conforms to the calling convention expected for m bar . Normally determinism analysis will ensure that the determinism expected by the context is more permissive than the determinism of the goal. (There is one exception, “commits”; they are dealt with below.) So we only have the following cases to deal with: m det Goal in m semi context: succeeded = Goal =⇒ do Goal succeeded = TRUE; m det Goal in m non context: Goal && Cont () =⇒ do Goal Cont (); 3.2

m semi Goal in m non context: Goal && Cont () =⇒ bool succeeded; succeeded = Goal if (succeeded) Cont ();

Code for Conjunctions

Code for empty conjunctions (‘true’) is trivial, and if the ﬁrst goal is m det, it is also straight-forward: m det goal: do true =⇒ /* fall through */ m semi goal: succeeded = true =⇒ succeeded = TRUE;

m non goal: true && Cont () =⇒ Cont (); m det Goal : (Goal , Goals) =⇒ do Goal Goals

If the ﬁrst goal is m semi, then there are two cases: if the conjunction as a whole is m semi, things are simple, and if the conjunction as a whole is m non, then we do the same as for the m semi case, except that we also (ought to) declare a local ‘succeeded’ variable.

204

Fergus Henderson and Zoltan Somogyi

m semi Goal in m semi conjunction: succeeded = (Goal , Goals) =⇒ succeeded = Goal if (succeeded) { Goals }

m semi Goal in m non conjunction: Goal && Goals =⇒ bool succeeded; succeeded = Goal if (succeeded) { Goals }

The really interesting case comes when the ﬁrst goal is m non. In that case, we need to create a new local continuation function succ funcn () which we use as the continuation when generating code for the ﬁrst goal. The continuation function just evaluates the remaining goal(s), with the original continuation function. m non Goal : (Goal , Goals) && Cont () =⇒ succ funcn() { Goals && Cont () } Goal && succ funcn()

3.3

Code for Disjunctions

Code for empty disjunctions (‘fail’) is trivial: m semi goal: succeeded = fail =⇒ succeeded = FALSE;

m non goal: fail && Cont () =⇒ /* fall through */

Code for non-empty disjunctions diﬀers depending on the code model of the disjunction, and on the determinism of the goal that is the ﬁrst disjunct. (a) m det disjunction: m semi Goal : do (Goal ; Goals) =⇒ m det Goal : bool succeeded; do (Goal ; Goals) =⇒ do Goal succeeded = Goal /* Goals is unreachable */ if (!succeeded) { do Goals }

Compiling Mercury to High-Level C Code

(b) m semi disjunction: m det Goal : succeeded = (Goal ; Goals) =⇒ bool succeeded; do Goal succeeded = TRUE /* Goals is unreachable */

m semi Goal : succeeded = (Goal ; Goals) =⇒ bool succeeded; succeeded = Goal if (!succeeded) { succeeded = Goals }

(c) m non disjunction: m det Goal : (Goal ; Goals) && Cont () =⇒ Goal Cont (); Goals && Cont ()

m semi Goal : (Goal ; Goals) && Cont () =⇒ bool succeeded; succeeded = Goal if (succeeded) Cont (); Goals && Cont ()

m non Goal : (Goal ; Goals) && Cont () =⇒ Goal && Cont () Goals && Cont () 3.4

Code for If-Then-Else

m det Cond : (Cond -> Then ; Else) =⇒ Cond Then m semi Cond : (Cond -> Then ; Else) =⇒ bool succeeded; succeeded = Cond if (succeeded) { Then } else { Else }

m non Cond : (Cond -> Then ; Else) =⇒ bool condn ; void then func() { condn = TRUE; Then } condn = FALSE; Cond && then func() if (!condn) { Else }

205

206

Fergus Henderson and Zoltan Somogyi

If-then-elses with m det and m semi conditions translate easily into C, as shown in the left-hand column above. Mercury also allows if-then-elses with m non conditions, in which case execution can backtrack from the Then part back into the Cond part. (This is unlike Prolog’s standard if-then-else, which always prunes over the Cond , but like e.g. SICStus Prolog’s if/3.) Handling these is a little more tricky. We need to ensure that execution won’t backtrack into the Else if the Cond succeeds, even if it is later backtracked over. To do this, we introduce a fresh boolean variable (which we call condn ) to record whether or not Cond has ever succeeded, as shown in the right-hand column above. We also use the translation rules for if-then-else to handle negations, because we handle not(Goal ) as if it were (Goal -> fail ; true) . Note that there are some complications with if-then-else and liveness-accurate garbage collection, but due to lack of space we cannot elaborate on these. 3.5

Code for Commits

Most LP languages provide some way to execute a nondeterministic goal, ﬁnd the ﬁrst solution, and prune away all the other solutions to that goal. For example, Prolog has ‘!’ (“cut”) and once/1, while Mercury has committed choice nondeterminism and automatic pruning of nondeterministic goals with no output variables. With our continuation-based approach for handling nondeterminism, implementing commits requires some way of unwinding the stack. Depending on the exact target language (which may be e.g. C, GNU C, C++, etc.) there are several diﬀerent ways in which this can be done: – using setjmp() / longjmp() – using GNU C’s builtin setjmp() / builtin longjmp() – exiting nested functions via GNU C non-local gotos that jump to their containing functions – using catch/throw – by testing a ﬂag after each call The ﬁrst four alternatives, which are all preferable to the last one, are quite similar. In our implementation, we wanted to support multiple diﬀerent target languages. So we transform the code to an intermediate representation which abstracts away the diﬀerences between the ﬁrst four approaches using ‘TRY COMMIT’ and ‘DO COMMIT’ operations. In the Mercury compiler, places where pruning is required show up after determinism analysis as calls to m non goals in m det or m semi contexts; these are equivalent to calls to ‘once/1’ in Prolog. Mercury has no direct equivalent to Prolog’s cut, so we don’t give a transformation schema for handling cut, but the TRY COMMIT/DO COMMIT operations shown below would also be quite suitable for implementing Prolog’s cut (including !/1 as in e.g. SWI-Prolog, as well as the standard !/0). The Abstract Transformation. The transformation rules below are the abstract version, using TRY COMMIT/DO COMMIT.

Compiling Mercury to High-Level C Code

m non in m semi context: succeeded = once(Goal ) =⇒ COMMIT TYPE ref; void success() { DO COMMIT(ref); } TRY COMMIT(ref, { Goal && success() succeeded = FALSE; }, { succeeded = TRUE; })

207

m non in m det context: do once(Goal ) =⇒ COMMIT TYPE ref; void success() { DO COMMIT(ref); } TRY COMMIT(ref, { Goal && success() }, {})

setjmp/longjmp. When using setjmp()/longjmp(), the abstract operations mentioned above are deﬁned as follows: ‘COMMIT TYPE’ is ‘jmp buf’, ‘DO COMMIT(ref)’ is ‘longjmp(ref,1)’, and ‘TRY COMMIT(ref,s1,s2)’ is ‘if (setjmp(ref)) s2 else s1’. Care is required when using longjmp()/setjmp(), because the ANSI/ISO C standard says that longjmp() is allowed to destroy the values of any non-volatile local variables in the function that called setjmp() which have been modiﬁed between the setjmp() and the longjmp(). To avoid this, whenever we generate a commit, we put it in its own nested function, with the local variables (e.g. succeeded, plus any outputs from the goal that we are committing over) remaining in the containing function. This ensures that none of the variables which get modiﬁed between the setjmp() and the longjmp() and which get referenced after the longjmp() are local variables in the function containing the setjmp(). Due to lack of space, we omit discussion of the other alternatives. 3.6

Calls

Generating code for individual calls is straight-forward. Predicate calls are mapped directly to C function calls: m det call: do p(A1 , A2 , ...) =⇒ p(A1 , A1 , ...); m semi call: succeeded = p(A1 , A2 , ...) =⇒ succeeded = p(A1 , A2 , ...);

m non call: p(A1 , A2 , ...) && Cont () =⇒ p(A1 , A2 , ..., Cont );

208

Fergus Henderson and Zoltan Somogyi

The only signiﬁcant complication is that output arguments should be passed by reference. The details for handling this are straight-forward but tedious, so we omit a detailed description, and instead refer interested readers to the Mercury compiler sources. 3.7

Uniﬁcations

The code generated for uniﬁcations is straight-forward, but will depend on the exact data representation chosen. For m semi uniﬁcations, we need to set succeeded to indicate whether the uniﬁcation succeeded or not. The other details are much the same as in traditional approaches to compiling logic programs to C, so we only give a sample:

m det deconstruct: succeeded = (X = f(A1 , A2 , ...)) =⇒ /* extract arguments */ A1 = arg(X, f, 1); A2 = arg(X, f, 2); ...

m semi deconstruct: X = f (A1 , A2 , ...) =⇒ /* tag test */ succeeded = (X = f ( , , ...)) if (succeeded) { /* extract arguments */ A1 = arg(X, f , 1); A2 = arg(X, f , 2); ... }

Here arg() could be deﬁned as a C macro or function in the runtime library.

4

Implementation and Benchmarks

We have implemented this approach in a new back-end for the Mercury compiler, which is included, together with the original back-end, in Mercury 0.10 and 0.10.1 (released April 2001). The choice of back-end is controlled by a compiler option. We have tested the correctness of our implementation by successfully bootstrapping the compiler using the new back-end, and by passing, on several architectures, all the appropriate tests (several hundred) in the Mercury test suite. We have evaluated our new compilation scheme, which we refer to as high level C (hlc.gc), by comparing it to our old scheme [14], which was the fastest existing Mercury implementation. The old scheme compiles Mercury to C code that is so low level that it uses C only as a portable assembler [10]; the fastest version of this scheme, asm fast.gc, uses small pieces of assembly code as well as GNU C extensions. The two schemes use identical data representations and the same garbage collector [1]. The results are shown in Figure 2. The SLOC column shows the number of Source Lines of Code for each benchmark, excluding comments and blank lines. The following two groups of columns compare asm fast.gc and hlc.gc with respect to CPU times and executable sizes.

Compiling Mercury to High-Level C Code Program mmc compress icfp2000 icfp2001 nuc

SLOC 171474 385 4341 458 3120

time (in seconds) size (in kb) asm fast.gc hlc.gc ratio asm fast.gc hlc.gc 17.20 20.36 1.18 6320 4504 26.73 18.98 0.71 1328 944 65.36 30.74 0.47 1848 1148 33.02 32.17 0.97 1344 952 40.53 31.39 0.77 1392 1040

209

ratio 0.71 0.71 0.62 0.71 0.75

Fig. 2. Benchmark speed ratios

The mmc test case is the Mercury compiler translating a large source ﬁle. Compress is a Mercury version of the 129.compress benchmark from the SPECint95 suite. The next two entries involve our group’s entries in recent ICFP programming contests. The 2000 entry is a ray tracer that generates .ppm ﬁles from a structural description of a scene, while the 2001 entry is a sourceto-source compression program for a hypothetical markup language. Nuc is a Mercury version of the pseudoknot benchmark, executed 1000 times. The benchmark machine was a Gateway 5150XL laptop (700 MHz PIII, 256 Mb, Linux 2.2.18). Further details of the test setup are available from our web site. On the two ﬂoating-point intensive programs (icfp2000 and nuc), the high level C back end already outperforms the old back end, principally because it boxes ﬂoating point values only when they are stored on the heap, not when they are stored on the stack. Storing unboxed ﬂoating point values on the stack comes naturally when the C compiler is managing the stack frames, but doing the same in our low-level back-end would be diﬃcult, because it would signiﬁcantly complicate our stack slot allocation algorithm. The high level C back end also outperforms the old back end on compress, mainly because compress’s work is dominated by complex integer expressions, and the C compiler can store the intermediate results in machine registers whereas the Mercury compiler must usually put them in virtual registers that are actually stored in memory. In both cases, the new back end wins because it does better at reusing the development eﬀort already invested in existing C compilers. For the remaining two programs, the picture is mixed. On mmc, the high level C back end is slower than the old back end; on icfp2000, it is about the same speed. One advantage of the old back end is that it has a more streamlined calling convention. However, the old back end relies on exploiting GNU extensions to C for its eﬃciency. Projects that need to use a C compiler other than gcc (e.g. Microsoft Visual C) cannot use these extensions. Without those extensions, the low level back end loses much of its speed; e.g. the time for mmc increases from 17.20s to 27.42s. Since the high level C back end does not need to use gcc extensions for its speed, it consistently outperforms the old back end on such projects. This last point makes our scheme especially useful for commercial users, who often need to link Mercury programs with software such as Microsoft Foundation Classes, and therefore need to use Microsoft compilers.

210

5

Fergus Henderson and Zoltan Somogyi

Related Work

The idea of implementing nondeterminism by invoking a continuation on success and falling through on failure is not new. It has been proposed several times in the literature, in several contexts — for example Prolog meta-interpreters and implementing nondeterminism in languages such as Lisp [5] — and it is closely related to the idea of binarization of Prolog programs [15] and to the implementation of generators in languages such as Icon [13]. However, few of these papers have formal translation rules, and few consider the optimization opportunities presented by knowledge of determinism information, which can derived by program analysis even for languages such as Prolog in which determinism is not a fundamental concept. Few have practical, well-tested implementations. None have translation rules and practical implementations. The only papers that we know of that use a translation scheme that is reasonably closely related to the one presented in this paper are [12] and [18], which describe schemes for translating strongly typed variants of Prolog to Pascal and C respectively. Both transformations have signiﬁcant limitations. They do not handle if-then-else or the Prolog cut operator, nor do they handle nested disjunctions. While they both have examples showing how one can exploit determinism information to generate better code, they do not describe, even informally, the rules that govern the generation of that better code.

6

Conclusions

We have presented a scheme for translating Mercury to high level C code. The new compilation scheme has already shown itself to be competitive with our previous scheme for compilation to low level C, beating its performance for programs that are ﬂoating point intensive and in environments where one cannot use gcc as the C compiler and therefore cannot use GNU C extensions to the C language. Furthermore, this performance is achieved with a model that is in our opinion signiﬁcantly simpler than earlier approaches such as the original Mercury compiler or WAM-based Prolog to C compilers. We have no need for additional global data structures, such as virtual machine registers, environment or choice point stacks, and the like, and we avoid the need to do our own register or stack slot allocation. Since the C code that we generate is closer to what an ordinary C programmer would write, the C compiler can be expected to optimize it better, and it is less likely to trigger obscure bugs in the C compiler. Our translation scheme can also be adapted to target languages other than C. We have used it as the basis of the Mercury code generator that targets IL, the intermediate language of the .NET Common Language Runtime [8], and as the basis of an (as yet incomplete) code generator that emits Java. Most of the diﬃcult issues in those ports concern issues such as data representation that are orthogonal to the topic of this paper; the adaptation of the translation scheme has been relatively straightforward.

Compiling Mercury to High-Level C Code

211

The main drawback of our translation scheme is that we cannot guarantee tail call optimization for indirectly recursive tail calls, unless there is explicit support for this in the target language (as is the case for IL and C--).

Acknowledgements We would like to thank David Overton, Ralph Becket, Bernard Pope, Kevin Glynn, and the anonymous referees for reviewing earlier drafts of this paper, and Microsoft for their ﬁnancial support.

References 1. H. Boehm and M. Weiser. Garbage collection in an uncooperative environment. Software Practice and Experience, 18:807–820, 1988. 208 2. T. M. Breuel. Lexical closures for C++. In Proceedings of the 1988 USENIX C++ Conference, pages 293–304, Denver, Colorado, 1988. 202 3. F. Bueno, D. Cabeza, M. Carro, M. Hermenegildo, P. L´ opez-Garc´ıa, and G. Puebla. The Ciao Prolog system. reference manual. Technical Report CLIP3/97.1, School of Computer Science, Technical University of Madrid (UPM), August 1997. Available from http://www.clip.dia.ﬁ.upm.es/. 201 4. L. Byrd. Understanding the control of Prolog programs. Technical Report 151, University of Edinburgh, 1980. 198 5. M. Carlsson. On implementing Prolog in functional programming. New Generation Computing, 2(4):347–359, 1984. 210 6. P. Codognet and D. Diaz. wamcc: Compiling Prolog to C. In Proceedings of the Twelfth International Conference on Logic Programming, pages 317–331, Kanagawa, Japan, June 1995. 197 7. T. Conway, F. Henderson, and Z. Somogyi. Code generation for Mercury. In Proceedings of the Twelfth International Conference on Logic Programming, pages 242–256, Portland, Oregon, December 1995. 197 8. T. Dowd, F. Henderson, and P. Ross. Compiling Mercury to the .NET Common Language Runtime. In Proceedings of the First International Workshop on MultiLanguage Infrastructure and Interoperability, pages 70–85, Firenze, Italy, September 2001. 210 9. B. Hausman. Turbo Erlang: approaching the speed of C. In E. Tick, editor, Implementations of logic programming systems, pages 119–135. Kluwer, 1994. 197 10. F. Henderson, Z. Somogyi, and T. Conway. Compiling logic programs to C using GNU C as a portable assembler. In Proceedings of the ILPS ’95 Postconference Workshop on Sequential Implementation Technologies for Logic Programming Languages, Portland, Oregon, December 1995. 197, 199, 208 11. F. Henderson, Z. Somogyi, and T. Conway. Determinism analysis in the Mercury compiler. In Proceedings of the Australian Computer Science Conference, pages 337–346, Melbourne, Australia, January 1996. 201 12. J. F. Nilsson. On the compilation of a domain-based Prolog. In Proceedings of the Ninth IFIP Congress, pages 293–298, Paris, France, 1983. 210 13. J. O’Bagy and R. E. Griswold. A recursive interpreter for the Icon programming language. In Proceedings of the 1987 SIGPLAN Symposium on Interpreters and Interpretive Techniques, pages 138–149, St. Paul, Minnesota, 1987. 210

212

Fergus Henderson and Zoltan Somogyi

14. Z. Somogyi, F. Henderson, and T. Conway. The execution algorithm of Mercury, an eﬃcient purely declarative logic programming language. Journal of Logic Programming, 29(1–3):17–64, October–December 1996. 197, 208 15. P. Tarau, K. D. Bosschere, and B. Demoen. Partial translation: towards a portable and eﬃcient Prolog implementation technology. Journal of Logic Programming, 29(1–3):65–83, October–December 1996. 197, 210 16. A. Taylor. LIPS on a MIPS: results from a Prolog compiler for a RISC. In Proceedings of the Seventh International Conference on Logic Programming, pages 174–185, Jerusalem, Israel, June 1990. 201 17. P. Van Roy and A. Despain. High-performance logic programming with the Aquarius Prolog compiler. IEEE Computer, 25(1):54–68, January 1992. 201 18. J. Weiner and S. Ramakrishnan. A piggy-back compiler for Prolog. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, pages 288–296, Atlanta, Georgia, June 1988. 210

CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs George C. Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer Department of Electrical Engineering and Computer Sciences University of California, Berkeley, {necula,smcpeak,sprahul,weimer}@cs.berkeley.edu

Abstract. This paper describes the C Intermediate Language: a highlevel representation along with a set of tools that permit easy analysis and source-to-source transformation of C programs. Compared to C, CIL has fewer constructs. It breaks down certain complicated constructs of C into simpler ones, and thus it works at a lower level than abstract-syntax trees. But CIL is also more high-level than typical intermediate languages (e.g., three-address code) designed for compilation. As a result, what we have is a representation that makes it easy to analyze and manipulate C programs, and emit them in a form that resembles the original source. Moreover, it comes with a front-end that translates to CIL not only ANSI C programs but also those using Microsoft C or GNU C extensions. We describe the structure of CIL with a focus on how it disambiguates those features of C that we found to be most confusing for program analysis and transformation. We also describe a whole-program merger based on structural type equality, allowing a complete project to be viewed as a single compilation unit. As a representative application of CIL, we show a transformation aimed at making code immune to stack-smashing attacks. We are currently using CIL as part of a system that analyzes and instruments C programs with run-time checks to ensure type safety. CIL has served us very well in this project, and we believe it can usefully be applied in other situations as well.

1

Introduction

The C programming language is well-known for its ﬂexibility in dealing with low-level constructs. Unfortunately, it is also well-known for being diﬃcult to understand and analyze, both by humans and by automated tools. When we embarked on our project to analyze and instrument C programs in order to bring out the existing safe usage of pointers or to enforce it when it was not

This research was supported in part by the National Science Foundation Career Grant No. CCR-9875171, and ITR Grants No. CCR-0085949 and No. CCR-0081588, and gifts from AT&T Research and Microsoft Research. The information presented here does not necessarily reﬂect the position or the policy of the Government and no oﬃcial endorsement should be inferred.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 213–228, 2002. c Springer-Verlag Berlin Heidelberg 2002

214

George C. Necula et al.

apparent, we examined a number of existing C intermediate languages and front ends before deciding to create our own. None of the available toolkits met all of our requirements. Some (e.g., [3,9]) were too high-level to support detailed analyses; some were designed to be fed to a compiler and were thus too low level, and some (e.g., SUIF [14,8]) failed to handle GCC extensions, which prevented them from working on software that use these extensions, such as Linux device drivers and kernels.

1 2 3 4

struct { int *fld; } *str1; struct { int fld[5]; } str2[4]; str1[1].fld[2]; str2[1].fld[2];

Fig. 1. A short C program fragment highlighting ambiguous syntax

Extracting the precise meaning of a C program often requires additional processing of the abstract syntax. For example, consider lines 3 and 4 in Fig. 1. They have the same syntax but diﬀerent meanings: line 3 involves three memory references while line 4 involves only one. While low-level representations do not have such ambiguities, they typically lose structural information about types, loops and other high-level constructs. In addition, it is diﬃcult to print out such a low-level representation in a way that is faithful to the original source. Our goal has been to ﬁnd a compromise between the two approaches. The applications we are targeting are systems that want to carry out analyses and source-to-source transformations on C programs. A good intermediate language for such a task should be simple to analyze, close to the source and able to handle real-world code. This paper describes CIL, a highly-structured “clean” subset of C that meets these requirements. CIL features a reduced number of syntactic and conceptual forms; for example, all looping constructs are reduced to a single form, all function bodies are given explicit return statements and syntactic sugar like “->” is eliminated. CIL also separates type declarations from code, makes type promotions explicit and ﬂattens scopes (with alpha renaming) within function bodies. These simpliﬁcations reduce the number of cases that must be considered when manipulating a C program, making it more amenable to analysis and transformation. Many of these steps are carried out at some stage by most C compilers, but CIL makes analysis easier by exposing more structure in the abstract syntax. CIL’s conceptual design tries to stay close to C, so that conclusions about a CIL program can be mapped back to statements about the source program. Additionally, translating from CIL to C is fairly easy, including reconstruction of common C syntactic idioms. Finally, a key requirement for CIL is the ability to parse and represent the variety of constructs which occur in real-world systems code, such as compiler-

CIL: Intermediate Language and Tools for Analysis and Transformation

215

speciﬁc extensions and inline assembly. CIL supports all GCC and MSVC extensions except for nested functions, and it can handle the entire Linux kernel. The rest of this paper describes our handling of C features and CIL applications. In Section 2 we describe the syntax, typing and semantics for our language of lvalues. We present expressions and instructions in Section 3 and control-ﬂow information in Section 4. Section 5 details our treatment of types. We discuss source-level attributes in Section 6. Having described the features of CIL we move on to using it for analysis in Section 7 and applying it to existing multi-ﬁle programs in Section 8. In Section 9 we discuss related work and we conclude in Section 10.

2

Handling of Lvalues

An lvalue is an expression referring to a region of storage [7]. Only an lvalue can appear on the left-hand side of an assignment. Understanding lvalues in C requires more than a simple abstract syntax tree. As shown in Fig. 1, the C fragment str1[1].fld[2] may involve one, two or three memory references depending on the types involved. If str1 and fld are both arrays, the fragment actually refers to an oﬀset within a single contiguous object named str1. If str1 is an array and fld is a pointer, the value at str[1].fld must be loaded and then an oﬀset from that value must be referenced. The case when str1 is a pointer and fld is an array is similar. Finally, if both are pointers, str1, str1[1].fld and str1[1].fld[2] must all be referenced. As a result, program analyses that care about these diﬀerences will ﬁnd it hard to analyze lvalues in abstract-syntax tree form.

lvalue ::= lbase, loﬀset lbase ::= Var(variable) | Mem(exp) loﬀset ::= NoOﬀset | Field(ﬁeld, loﬀset) | Index(exp, loﬀset)

Fig. 2. The abstract syntax of CIL lvalues

As shown in Fig. 2, in CIL an lvalue is expressed as a pair of a base plus an oﬀset. The base address can be either the starting address for the storage for a variable (local or global) or any pointer expression. We distinguish the two cases so that we can tell quickly whether we are accessing a component of a variable or a memory region through a pointer. An oﬀset in the variable or memory region denoted by the base consists of a sequence of ﬁeld or index designators. The meaning of an lvalue is a memory address along with the type of the object stored there. Fig. 3 shows the deﬁnitions of two judgments that deﬁne the meaning. The meaning of a variable base is the address of the variable and its type. The judgment Γ lbase ⇓ (a, τ ) says that the lvalue base lbase refers to an object of type τ at address a. Lvalue oﬀsets are treated as functions

216

George C. Necula et al. Γ (x) = τ

Γ e : Ptr(τ )

Γ Var(x) ⇓ (&x, τ )

Γ Mem(e) ⇓ (e, τ )

Γ (a, τ )@NoOﬀset ⇓ (a, τ ) τ1 = Struct(f : τf , ...)

Γ (a1 + OﬀsetOf (f, τ1 ), τf )@oﬀ ⇓ (a2 , τ2 )

Γ (a1 , τ1 )@Field(f, oﬀ ) ⇓ (a2 , τ2 ) τ1 = Array(τ )

Γ (a1 + e ∗ SizeOf (τ ), τ )@oﬀ ⇓ (a2 , τ2 )

Γ (a1 , τ1 )@Index(e, oﬀ ) ⇓ (a2 , τ2 )

Fig. 3. Typing and evaluation rules for CIL lvalues

that shift address-type pairs to new address-type pairs within the same object. The judgment Γ (a1 , τ1 )@o ⇓ (a2 , τ2 ) means that the lvalue oﬀset o, when applied an lvalue denoting (a1 , τ1 ), yields an lvalue denoting an object of type τ2 at address a2 . In this latter judgment a2 is an address within the range [a1 , a1 + sizeof(τ1 )). Considering again the example from Fig. 1, the two lvalues shown there have the following CIL representations in which it is obvious when we reference a variable or a pointer indirection. str1[1].fld[2] = Mem(2 + LvalueMem(1 + LvalueVar(str1), NoOﬀset), Field(fld, NoOﬀset)) str2[1].fld[2] = Var(str2), Index(1, Index(2, NoOﬀset)) This interpretation of lvalues upholds standard C equivalences like “x == *&x” and “(*(&a.f)).g == a.f.g”, and makes tasks like instrumenting every memory access in the program much easier. As in other intermediate representations, all occurrences of the same variable share a variable declaration. This makes it easy to change variable properties (like the variable name or type) and allows for the use of pointer equality checks when comparing variables.

3

Expressions and Instructions

CIL syntax has three basic concepts: expressions, instructions, and statements. Expressions represent functional computation, without side-eﬀects or control ﬂow. Instructions express side eﬀects, including function calls, but have no local (intraprocedural) control ﬂow. Statements capture local control ﬂow. The abstract syntax for CIL expressions is given in Fig. 4. Constants are fully typed and their original textual representation is maintained in addition to their value. SizeOf and AlignOf expressions are preserved both because computing them is dependent on compiler and compilation options, and also because a transformation may wish to change types. Casts are inserted explicitly to make the program conform to our type system, which has no implicit coercion rules. The StartOf expression has no explicit C syntax but is used to represent the implicit coercion from an array to the address of its ﬁrst element. Without such

CIL: Intermediate Language and Tools for Analysis and Transformation

217

a rule a typing judgment for *exp must do a case analysis based on the type of exp, leading to two distinct typing rules for *exp. The addition of StartOf allows for syntax-directed type checking, by making the coercion explicit in the source. The StartOf operator is not printed, and has the following type rule (it is the only way to convert an array to a pointer to the ﬁrst element): lvalue ⇓ (a, Array(τ )) StartOf(lvalue) ⇓ (a, Ptr(τ )) The other C expressions (such as the “? :” operator or expressions that can have side-eﬀects) are converted to CIL instructions or statements, which are discussed next.

exp ::= | | |

Constant(const) SizeOfType(type) UnOp(unop, exp) AddressOf(lvalue)

| | | |

Lvalue(lvalue) | SizeOfExp(exp) AlignOfExp(exp) | AlignOfType(type) BinOp(binop, exp, exp) | Cast(type, exp) StartOf(lvalue)

instr ::= Set(lvalue, exp) | Call(lvalue option, exp, exp list) | Asm(raw strings, lvalue list, exp list)

Fig. 4. The syntax of CIL expressions and instructions

Each instruction contains a single assignment or function call. The Set instruction updates the value of an lvalue. The Call instruction has an optional lvalue into which the return value of the function is stored. The function component of the Call instruction must be of function type; CIL removes redundant & and * operators applied to functions or function pointers. The arguments to functions are expressions (without side-eﬀects or embedded control ﬂow). Finally, the Asm instruction is used to capture the common occurrence of inline assembly in systems programs. CIL understands Microsoft- and GNU-style assembly directives and reports the inputs (as a list of expressions) and the outputs (as a list of lvalues) of the assembly block. Other information (volatility, raw assembly template strings) is stored, but not interpreted. CIL also stores location information with all statements and can take advantage of this information to insert #line directives when emitting output. This allows error messages in a heavily-transformed program to line up with the correct source line in the original program.

4

Integrating a CFG into the Intermediate Language

On top of the lvalues, expressions and instructions, CIL provides both highlevel program structure and low-level control-ﬂow information. The program structure is captured by a recursive structure of statements, with every statement

218

George C. Necula et al.

annotated with successor and predecessor control-ﬂow information. This single program representation can be used with routines that require an AST (e.g., type-based analyses or pretty-printers), as well as with routines that require a CFG (e.g., dataﬂow analyses).

stmt ::= | | |

Instr(instr list) Goto(stmt) Continue Switch(exp, stmt list, stmt list)

| | | |

Return(exp option) Break If(exp, stmt list, stmt list) Loop(stmt list)

Fig. 5. The syntax of CIL statements Fig. 5 shows the syntax of CIL statements. In addition to the information we show, each statement also contains labels, source location information and a list of successor and predecessor statements. Assignments and function calls are grouped under Instr and do not have any control ﬂow embedded within them. CIL can resolve Break and Continue to Gotos if desired, but leaving them as they are makes code-motion transformations (e.g., loop unrolling) easier. A Return statement optionally records the return value. Every function in CIL has at least one Return statement. An If statement records the condition, which is an expression, together with the two branches, which are lists of statements. CIL has only a loop-forever looping construct and we always use a Break statement to exit from such a loop. In many cases the pretty printer is able to print out a nicer-looking while loop. Notice that Fig. 5 does not have any syntax for case, which is used in switch statements. The reason is we implement case as an optional label that can be associated with any statement. A switch statement then consists of an expression, a list of statements which represent the entire body of the switch (with the case labels indicating the starting point of the various cases). To provide faster access to the individual cases, we also store the starting points of the cases as a separate list in the switch statement.

5

Handling of Types

Fig. 6 describes the representation of C types in CIL. The Named type arises from uses of type names deﬁned with typedef. The other types have their usual counterparts in C. The notable features of CIL with respect to type handling have to do with composite types, i.e. structs and unions. C programs can declare named and anonymous composite types at the ﬁle scope or in local scopes. This makes it hard to move expressions that involve locally deﬁned types and also forces one to scan the entire AST to ﬁnd declarations of such types. To simplify these tasks CIL moves all type declarations to the beginning of the program and gives them global scope. All anonymous composite types are given unique names in CIL and every composite types has its own declaration at the top-level. All references to a

CIL: Intermediate Language and Tools for Analysis and Transformation

219

type::=Void | Int(intKind) | Float(ﬂoatKind) | Ptr(type) | Array(type, exp) | Fun(type, variable list) | Enum(enumInfo) | Named(string, type) | Struct(compInfo) | Union(compInfo) enumInfo ::= (string, item list) compInfo ::= (string, ﬁeld list)

Fig. 6. The abstract syntax of CIL types composite type share the same instance of the compInfo structure, which makes it easy to change the deﬁnition of a composite type and also provides a common place to watch for recursive type deﬁnitions (all such deﬁnitions must involve at least one compInfo). As far as types are concerned, CIL is similar to SUIF except that SUIF eliminates all user-deﬁned typedefs and introduces extraneous ones, while CIL is careful to maintain the typedef structure present in the source.

6

Handling of Attributes

It is often useful to have a mechanism for the programmer to communicate additional information to the program analysis. We decided to use and extend for this purpose the GNU C notation for pragmas and attributes. Pragmas can appear only at top-level while attributes can be associated with identiﬁers and with their types. The advantage of this method is that gcc will still be able to process the annotated ﬁle (since it ignores attributes and pragmas that it does not recognize). In GNU C a declaration can contain a number of attributes of the form attribute ((a)) where a is the attribute. For example, here is the prototype for the printk function found in the Linux kernel: int printk(const char *fmt, ...) __attribute__ ((format (printf, 1, 2))); The attribute above is associated with the name being declared, and it indicates that printk is a printf-like function, whose ﬁrst argument is a format string, and arguments starting from the second are to be matched with the format speciﬁers. One diﬃculty in using the GNU C notation for attributes is the apparent lack of a formal speciﬁcation for attribute placement and attribute association with types and identiﬁers. We have worked out a speciﬁcation that seems to extend both that of GNU C and the placement of type qualiﬁers in ANSI C [6]. Attributes and pragmas can use the sub-language of C expressions excluding the comma expression and side-eﬀecting expressions but including a constructed attribute such as the format attribute in the example above.

220

George C. Necula et al.

The following is the syntax of C declarations that our front-end supports: declaration ::= declarator ::= | | | |

base type attributesopt declarator attributesopt initopt identiﬁer declarator [ expopt ] ∗ attributesopt declarator declarator ( parametersopt ) ( attributesopt declarator )

The attributes that appear at the end of the declaration are associated with the declared identiﬁer. All other attributes are associated with types. In particular, attributes appearing after a base type are associated with that type, those appearing after the pointer type constructor * are associated with the pointer type. Finally, the attributes appearing before the declarator in a parenthesized declarator are associated with the type of the declarator. For example in the declaration below we declare an array a of 8 pointers to functions with no arguments and returning pointers to integers: int A1 * A2 (A3 * (A4 a)[8])(void) A5; The attribute A1 belongs to the type int and A2 to the pointer type int A1 *. The attribute A3 belongs to the function type and A4 to the array type (the type of a). The attribute A5 applies to the declared name a. The gcc compiler accepts most of this attribute language but does not accept all of it in all contexts in which declarations occur. For example the name attributes are accepted in function prototypes but not in function deﬁnitions. This suggests that the placement of attributes has not been carefully designed in gcc but rather added in an ad-hoc manner.

7

Using CIL for Analyses and Source-to-Source Transformations

This section describes two concrete uses of CIL. The ﬁrst is a small example that demonstrates the ease with which CIL can be used to encode simple program transformations. The second shows how CIL can be used to support serious program analysis and transformation tasks. 7.1

Preventing Buﬀer Overruns

To demonstrate the use of CIL for source-to-source transformations we present the CIL encoding of a reﬁnement for the StackGuard [4] buﬀer overrun defense. StackGuard is a gcc patch that places a special “canary” word next to the return address on the stack and checks the validity of the canary word before returning from a function. It is likely that any buﬀer overrun that rewrites the return address will also modify the canary and thus be detected. As presented, however, the algorithm still has a slight chance of failure (e.g., if the attacker

CIL: Intermediate Language and Tools for Analysis and Transformation

1 2 3 4 5 6 7 8 9

exception NeedsGuarding

221

(* should we guard this function? *)

class containsArray = object (* does this type contain an array? *) inherit nopCilVisitor (* only visit types *) method vtype t = match t with (* inspect the type *) TArray -> raise NeedsGuarding (* found an array, guard it *) TPtr -> SkipChildren (* do not follow pointers *) | (* not array *) -> DoChildren (* no array yet, keep looking *) end

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

class sgFixupReturn restore ra stmt = object (* rewrite all returns *) inherit nopCilVisitor (* only look for returns *) method vstmt s = match s.skind with (* check each statement *) Return -> let new block = mkBlock (* restore the ra *) [restore ra stmt ; s] in ChangeTo(mkStmt new block) | (* not Return *) -> DoChildren (* descend in other statements *) end class sgAnalyzeVisitor f get and push ra restore ra = object inherit nopCilVisitor (* consider each function *) method vfunc fundec = (* do we need to guard this one? *) try (* raise an exception if we need to guard it *) List.iter (fun vi -> (* inspect each local variable *) visitCilType (new containsArray) vi.vtype ; (* find arrays *) ) fundec.slocals ; SkipChildren (* no local arrays found, return *) with NeedsGuarding -> (* local arrays present, guard this *) fundec.sbody.bstmts <- get and push ra :: fundec.sbody.bstmts ; let modify = new sgFixupReturn restore ra in fundec.sbody <- visitCilBlock modify fundec.sbody ; ChangeTo(fundec) (* now this function saves the *) end (* return address on entry and restores it on exit *) let stackguard (f : file) = (* apply the transformation *) let make stmt fundec = mkStmt (Instr [Call(None, Lval(Var(fundec.svar),NoOffset), [], locUnknown)]) in (* get and push ra and restore ra are external functions *) (* build up CIL statements that call those functions *) let get and push ra = make stmt (emptyFunction "get and push ra") in let restore ra = make stmt (emptyFunction "restore ra") in visitCilFile (new sgAnalyzeVisitor get and push ra restore ra) f

Fig. 7. Complete OCaml source for a reﬁned StackGuard transformation using CIL

222

George C. Necula et al.

guesses the canary value) and incurs overhead even for functions that do not have local array variables. Fig. 7 shows a reﬁned implementation of StackGuard. This transformation pushes the current return address on a private stack when a function is entered (line 28) and pops the saved value before returning (line 15). We assume that there are two external functions get and push ra and restore ra for this purpose. Only functions with local variables that contain arrays are modiﬁed (the code in lines 3–9 implements the check). This transformation is simpliﬁed by the fact that all CIL functions have explicit returns (checked for on line 14). The code makes use of CIL library routines (like visitors). After applying this transformation, all that remains is to provide (at link-time) the implementation for the functions that save and restore the return address. This transformation would be signiﬁcantly more complicated when performed on an AST. In fact the transformation would have to perform ﬁrst some of the elaboration that CIL performs. 7.2

Ensuring Memory Safety of C Programs

CCured [12] is a system that combines type inference and run-time checking to make existing C programs memory-safe. It carries out a whole-program analysis of the structure and use of the types in the program. It uses the results of the analysis to change the type deﬁnitions and memory accesses in the program. When the safety of a memory reference cannot be statically veriﬁed, an appropriate run-time check is inserted. The analysis involves iterating over all the types in the program and comparing those that are involved in casts using a form of structural equality. CIL’s simpler type language, in which recursion is limited to composite types, makes this easier. As a result of this analysis, some pointers are transformed into multiword structures that carry extra run-time information (for example array-bounds information). Memory reads and writes involving such pointers are instrumented to contain run-time checks. These transformations are quite extensive and require detailed modiﬁcations of types, lvalues, variables and declarations. Without the clear disambiguation of these features provided by CIL, it would be diﬃcult to determine which syntactic constructs represent accesses to memory and how to change them. The transformed program is available for user inspection and compiler consumption. CIL’s high-level structural information means that the resulting output is quite faithful to the original source, allowing the two to be compared more easily than is possible with conventional intermediate representations. CCured makes use of the whole-program merger, described in Section 8, to handle entire software projects. It also uses attributes, described in Section 6, to communicate detailed information about pointer structure.

CIL: Intermediate Language and Tools for Analysis and Transformation

8

223

A Whole-Program Merger

We have described so far an intermediate language that makes both program analysis and source-to-source transformation easy. However, many analyses are most eﬀective when applied to the whole program. Therefore we designed and implemented a tool that merges all of a program’s compilation units into a single compilation unit, with proper renaming to preserve semantics. We designed the merger application to impersonate a compiler (it works with both the GNU C and Microsoft Visual C compilers) and to keep track of all the ﬁles that are compiled to build the whole program, along with the speciﬁc compiler options that were used for each ﬁle. When the compiler is invoked for compilation only (no linking) our tool creates the expected object ﬁle but stores in it only the preprocessed source ﬁle. The actual compilation is delayed until link time. When the compiler is invoked to link the program, it learns the names of all the object ﬁles that constitute the project. All of the associated preprocessed source ﬁles can then be loaded and merged. This setup has the beneﬁt that it can be used with make-based projects by simply changing of the name of the invoked compiler and linker. The actual merging of compilation units turned out to be surprisingly tricky. First, ﬁle-scope identiﬁers must be renamed properly to avoid clashes with globals and with similar identiﬁers in diﬀerent ﬁles. In C these are the identiﬁers of variables and functions declared static, the names of types introduced with typedef, and the tags of union, structure and enumeration types. Unfortunately this is not suﬃcient because ﬁle-scope type identiﬁers declared in header ﬁles will result in multiple copies with diﬀerent names at each inclusion point. Since C uses name equivalence for types, such copies will no longer be compatible, leading to numerous type errors in the merged program. As a result we need to do a more careful renaming of ﬁle-scope identiﬁers. To illustrate the problem consider the two ﬁle fragments below. For clarity, we add a “2” suﬃx to the ﬁle-scope names from the second ﬁle; in reality the names might be identical in the two ﬁles, especially if they originate from the same header ﬁle. File 1: struct list { int data; struct list * next; }; extern struct list *head; struct tree { struct stuff *data; struct tree *l, *r;}; struct stuff { int elem; }; ... File 2: struct list2 { int data; struct list2 * next; }; extern struct list2 *head; struct tree2 { struct stuff2 *data; struct tree2 *l, *r;}; struct stuff2 { int start; int elem; }; Note that the tags list and list2 could use the same name. In fact they must use the same name: if we give them diﬀerent names then the merged program will have conﬂicting declarations of the global head. Because of the extra start ﬁeld in stuff2, however, the tags stuff and tree must have names dif-

224

George C. Necula et al.

ferent from stuff2 and tree2 respectively. In this case if we fail to rename the tree tag then the program will misbehave in a very strange way. Such situations do actually occur in practice (e.g. vortex and gcc among the SPECINT95 benchmarks [13]). Such renaming errors can be very hard to ﬁnd in a large program. This motivated us to try to describe precisely the problem and the merging algorithm involved in such a way that we can argue that we do not change the behavior of the program. Our naming problem arises from the fact that C uses name equivalence for types yet diﬀerent compilation units are free to use diﬀerent names even for types that are intended to interoperate with other units. In essence this means that the linked program cares only about structural type equivalence. Thus when we try to merge diﬀerent modules together we have to go beyond name equivalence and use structural type equivalence. A similar problem occurs in distributed systems via remote-procedure call or remote storage where diﬀerent components might use diﬀerent type names for types that are structurally equivalent and thus compatible [11]. This is in fact a common argument in favor of using structural type equivalence [1,2]. Our merging algorithm makes one pass over all the compilation units, incrementally accumulating a merged program. For each ﬁle there are two merging phases. In the ﬁrst phase we merge the types and tags (since they do not depend on variable names). Then in the second stage we rewrite the variable declarations and function bodies. In order to merge the types we ﬁrst expand all of the typedef deﬁnitions. This is possible because in C the body of a typedef cannot refer to the name being deﬁned or to type names not already deﬁned. This leaves us with a set of tag deﬁnitions, which can be recursive as shown above. Without loss of generality we can model the tag deﬁnitions as follows: Tag deﬁnition Type

d ::= struct t {T1 ; T2 } T ::= Int | Ptr(struct t)

Note that the constructor is always applied to a tag. The case when a pointer or array constructor would be applied to a base type is modeled as a base type and the case when the constructor would be applied to another constructed type is treated itself as a constructor application. Given two sets of tag deﬁnitions, one from the already merged program M and one from the ﬁle being merged F, we must ﬁnd which of the latter set of tags can share names with already deﬁned tags. For the language of tag deﬁnitions considered above this is precisely structural type equivalence for recursive types. For each pair of tags struct t {T1 ; T2 } from M and struct t {T1 ; T2 } from F we scan the bodies of the deﬁnitions and we ﬁnd either that they always match, or that they cannot possibly match under any renaming, or that they match provided some other tags are renamed to the same name. Notice that we consider only renaming of tags with other tags. Thus exactly one of the following two kinds of constraints will be generated for each pair of tag deﬁnitions (the second kind of constraint can have zero, one or two equalities on the right of the equivalence):

CIL: Intermediate Language and Tools for Analysis and Transformation

225

t = t t = t ⇐⇒ t1 = t1 ∧ t2 = t2 Once we decide on the names for the tags in a ﬁle we process the variable and function deﬁnitions. Among variable declarations we can share only static and inline function deﬁnitions. We also remove duplicate global function prototypes and extern variable declarations. The whole implementation of the merger algorithm is about 600 lines of OCaml code. We have tried the merger on various programs. The largest were those from the SPECINT95 and SPECINT00 benchmark suites. We have found it to work reasonably fast, with the biggest cost being that of saving the preprocessed source ﬁles instead of the object ﬁles. For example, to merge the sources of the gcc compiler on a machine using an Intel Pentium 400MHz, it took 90 seconds to preprocess and save all of the sources, then 9 seconds to parse the preprocessed sources with another 9 seconds to merge them. gcc consists of 116 source and header ﬁles, totaling about 100,000 lines. The result of preprocessing them has two million tokens while the result of the merging is a ﬁle with only 600,000 tokens (two-thirds of all tokens are shared between modules). We have found similar results for other programs. As side beneﬁts from using the merger we have observed that both the gcc and the Microsoft C compiler parse faster and sometimes produce slightly faster executables from the merged ﬁles, supposedly due to increased ability to optimize the program. However, the increased opportunity for inlining can also make the optimization phase substantially slower when full optimization is turned on.

9

Related Work

A variety of intermediate languages have been developed for use by compilers. Most of them are too low-level to extract recognizable source after transformation. Some intermediate representations have been designed speciﬁcally to aid high-level analyses, but they do not do suﬃcient elaboration of the source (as CIL does for lvalues, for example) to enable detailed and trouble-free analysis or transformation. Microsoft’s AST Toolkit [3] supports all of ANSI C and C++, along with Microsoft’s extensions, but it does not support GNU extensions. It is tightly integrated with the MSVC compiler, works on any program the compiler works on, and oﬀers hooks into various compilation stages. Its high-level program representation is harder to use in source-to-source transformations. For example, it does not provide expressions without side-eﬀects as CIL does. Ckit [9] is a C front end written in Standard ML. It uses abstract syntax trees and does not come with built-in support for control ﬂow graphs. Although it does full ANSI C type checking, it does not annotate the code with explicit casts and type promotions. Edison Design Group’s front end [5] features a high quality parser for the full ANSI C and C++ languages. Its emphasis is on thorough syntax analysis and error checking. It uses a high-level intermediate language, and it leaves the

226

George C. Necula et al.

task of elucidating complicated C constructs to an appropriate back end. It also works on one source ﬁle at a time. C-Breeze [10] is an infrastructure for building C compilers. It initially parses a program into an abstract syntax tree. Although it comes with a library of routines that can construct control ﬂow graphs and carry out various analyses, these routines work on a much lower representation of the program, which is derived from the abstract syntax tree. No built-in support is provided for analyzing programs spanning several ﬁles. The system that meets our requirements most closely is SUIF [14,8]. SUIF is an infrastructure for compiler research, consisting of an intermediate language, front ends for C and C++ (based on Edison’s front end), and a library of routines to manipulate the intermediate representation. The intermediate language has an object-oriented design and supports program representation at various levels. The library includes transformers that can ensure most of the properties that are part of CIL’s design. Although SUIF handles the full ANSI C language, it does not support many of the GCC extensions that appear in programs such as the Apache web server or the Linux kernel. For example, it cannot handle GNUstyle assembly instructions or attributes. As a result, we have not been able to use SUIF to process large open-source projects like the Linux kernel or the SPECINT95 gcc benchmark. In addition, compared to SUIF’s C output, CIL’s external representation is usually closer to the original source. In many cases (e.g. typedefs) SUIF does not retain user-supplied names, and it introduces many extraneous casts that can confuse certain kinds of analyses, such as CCured. For example, line 3 of the example on page 214 is emitted by SUIF as: (((((int *) (*(((int **) (((char *) &((((struct type_1 *) (str1))))[1]) + 0U)))))))[2]); CIL output makes the memory accesses in this statement more apparent (as described in Section 2), and at the same time its output stays close to the source: *((str1 + 1)->fld + 2); Finally, although SUIF comes with some support for merging multiple source ﬁles, in some cases it fails to do it correctly. For example, SUIF (version 2.2) does not correctly handle the example described in Section 8 (although some earlier versions appear to).

10

Conclusion and Future Work

The C programming language supports a number of features that make it attractive for systems programming. Unfortunately, many of these features are diﬃcult to reason about. And even though there is abundant expertise on interpreting the constructs of the C programming language there are very few tools that make program analysis and especially source-to-source transformation easy.

CIL: Intermediate Language and Tools for Analysis and Transformation

227

CIL is a minimal design that attempts both to distill the C language constructs into a few ones with precise interpretation, and also to stay fairly close to the high-level structure of the code so that the results of source-to-source transformations bear suﬃcient resemblance to the source code. We have used CIL successfully both for simple analyses and transformations and also for a pervasive transformation that instruments C programs with code to ensure its memory safety. We thus believe that CIL indeed comes close to what we desire of an analysis and transformation infrastructure. All of the CIL features came about in the context of one task or another that we used CIL for. It was surprisingly diﬃcult in the beginning to handle lvalues and types correctly with most of the diﬃculties being generated by the implicit conversions in C between an array and a pointer to its ﬁrst element and between a function and a pointer to it. We found that the most satisfactory solution to the ﬁrst of these problems was to introduce the StartOf construct that does not exist in C. The only feature of CIL that we have not exercised as much as the others is the embedded control-ﬂow graph. CCured includes a simple data ﬂow analysis in support of array bounds checking elimination and we are starting to use CIL in yet another project where data ﬂow analysis will be preeminent. We expect that out of these experiences we’ll either gain more conﬁdence in this part of the design or change it to better suit the needs of such analyses. We have also found that it is extremely useful to have a whole-program merger that can act like a compiler and can be used transparently with makebased project. Merging errors that manage to get past the compiler and linker can be a nightmare to ﬁnd in a large program, thus it is important to specify carefully how the merging algorithm works. We found that a restricted version of structural type equivalence for recursive types is both simple and suﬃcient for most purposes. CIL currently handles all of ANSI C and almost all of GCC and MSVC extensions. The exception is GCC’s trampoline extension for nested functions, which we have yet to encounter in practice. The next step is to extend the system to handle C++. The source code for CIL and the associated tools are available at http://www.cs.berkeley.edu/~necula/cil.

Acknowledgments We wish to thank Aman Bhargava and Raymond To for help with the implementation of the CIL infrastructure, Mihai Budiu for assistance with the SUIF experiments, and the anonymous referees for their suggestions in improving this paper.

228

George C. Necula et al.

References 1. Roberto M. Amadio and Luca Cardelli. Subtyping recursive types. ACM Transactions on Programming Languages and Systems, 15(4):575–631, 1993. 224 2. Luca Cardelli, James Donahue, Mick Jordan, Bill Kalsow, and Greg Nelson. The Modula-3 type system. In Proceedings of the 16th Annual ACM Symposium on Principles of Programming Languages, pages 202–212, January 1989. 224 3. Microsoft Corporation. The AST Toolkit. http://research.microsoft.com/sbt /asttoolkit/ast.asp. 214, 225 4. Crispan Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. StackGuard: Automatic adaptive detection and prevention of buﬀer-overﬂow attacks. In Proceedings of the 7th USENIX Security Conference, pages 63–78, January 1998. 220 5. Edison Design Group. The C++ Front End. http://www.edg.com/cpp.html. 225 6. ISO/IEC. ISO/IEC 9899:1999(E) Programming Languages – C. 219 7. Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language (second edition). Prentice-Hall, Englewood Cliﬀs, N. J., 1988. 215 8. Holger Kienle and Urs H¨ olzle. Introduction to the SUIF 2.0 compiler system. Technical Report TRCS97-22, University of California, Santa Barbara. Computer Science Dept., December 10, 1997. 214, 226 9. Bell Labs. ckit: A Front End for C in SML. http://cm.bell-labs.com/cm/cs/what /smlnj/doc/ckit/overview.html. 214, 225 10. Calvin Lin, Samuel Guyer, Daniel Jimenez, and Teck Bok Tok. C-Breeze. http://www.cs.utexas.edu/users/c-breeze/. 226 11. Paul McJones and Andy Hisgen. The Topaz system: Distributed multiprocessor personal computing. In Proceedings of the IEEE Workshop on Workstation Operating Systems, November 1987. 224 12. George C. Necula, Scott McPeak, and Westley Weimer. CCured: Type-safe retroﬁtting of legacy code. In Proceedings of the 29th Annual ACM Symposium on Principles of Programming Languages, January 2002. 222 13. Standard Performance Evaluation Corportation. SPEC 95 Benchmarks. July 1995. http://www.spec.org/osg/cpu95/CINT95. 224 14. Robert Wilson, Robert French, Christopher Wilson, Saman Amarasinghe, Jennifer Anderson, Steve Tjiang, Shih-Wei Liao, Chau-Wen Tseng, Mary Hall, Monica Lam, and John Hennessy. The SUIF compiler system: a parallelizing and optimizing research compiler. Technical Report CSL-TR-94-620, Stanford University, Computer Systems Laboratory, May 1994. 214, 226

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints 1 Hanspeter Mössenböck and Michael Pfeiffer University of Linz, Institute of Practical Computer Science {moessenboeck,pfeiffer}@ssw.uni-linz.ac.at

Abstract. Linear scan register allocation is an efficient alternative to the widely used graph coloring approach. We show how this algorithm can be applied to register-constrained architectures like the Intel x86. Our allocator relies on static single assignment form, which simplifies data flow analysis and tends to produce short live intervals. It makes use of lifetime holes and instruction weights to improve the quality of the allocation. Our measurements confirm that linear scan is several times faster than graph coloring for medium-sized to large programs.

1

Introduction

Register allocation is the task of assigning registers to the variables and temporaries of a program. It is crucial for the efficiency of the compiled code. The standard algorithm for register allocation is based on graph coloring [4, 3]: it builds an interference graph, in which the nodes represent the values in a program. An edge is drawn between two values if they are live at the same time. The graph is then colored such that adjacent nodes get different colors. If colors are viewed as registers we get a register allocation in which two values are kept in different registers if they are live at the same time. There are situations, however, in which graph coloring is too slow, for example in a just in time (JIT) compiler that translates an intermediate program representation to machine code at load time or even at run time. JIT compilers must do their job in almost no time but should still produce high quality code. This conflict has led to a new register allocation technique that is called Linear Scan [10,11,13]. It assigns registers to values in a single linear scan over the live intervals of all values in a program. A live interval of a value v is the range of instructions starting at the defining instruction and ending at the instruction where v was used for the last time. If the live intervals of two values overlap, the values cannot reside in the same register. Although graph coloring leads to a slightly better register allocation than linear scan, the latter runs several times faster and is therefore an attractive register allocation technique in JIT compilers. This paper describes an implementation of the linear scan register allocation technique making two contributions: Firstly, and in contrast to [10,11,13,8], we base 1

This work was supported by Sun Microsystems, California.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 229-246, 2002.  Springer-Verlag Berlin Heidelberg 2002

230

Hanspeter Mössenböck and Michael Pfeiffer

our allocator on programs in static single assignment form (SSA form). This simplifies data flow analysis and tends to produce shorter live intervals but requires modifications to the original linear scan algorithm. Secondly, we show how linear scan can be applied to register-constrained architectures such as the Intel x86. While [10,11,13 8] describe the algorithm for RISC architectures, a CISC machine like the Intel x86 requires modifications to the basic algorithm because of its two-address instructions and the fact that some operations expect or deliver values in specific registers. The work described in this paper was done in a joint project with Sun Microsystems, in which their Java HotSpot™ client compiler [7] was extended with SSA form, register allocation and various optimizations. The HotSpot client compiler is a JIT compiler that is invoked for frequently called methods. Our modified compiler builds a control flow graph from the bytecodes of the method, translates the bytecodes to intermediate instructions of a register machine, brings them in SSA form (eliminating loads and stores for local variables), performs global common subexpression elimination and register allocation, and finally generates code for the Intel x86. The first version of our compiler used a graph coloring register allocator. Since this was not fast enough, we reimplemented the allocator using the linear scan technique. Section 2 of this paper describes the original linear scan algorithm both in its simple form and in a refined form in which lifetime holes are exploited to fill them with other live intervals. We also explain how SSA form affects the computation of live intervals. Section 3 explains the data structures on which our algorithm relies and Section 4 describes how the intermediate code is prepared for register allocation. In Section 5 we explain our linear scan technique taking the peculiarities of the Intel architecture into account. Section 6 evaluates the complexity of our algorithm, compares it with related approaches and shows some measurements. Finally, Section 7 summarizes the results.

2

Linear Scan Register Allocation

Linear scan was introduced by Poletto et al. [10, 11] as an alternative to graph coloring allocation. It computes the live intervals of values in a program and scans them sequentially to find overlaps. Non-overlapping intervals can be assigned the same register. Since the live interval of a value v may contain holes in which v is not live, a refined version of this algorithm (called second-chance binpacking) was described by Traub et al. [13]. Although more complicated, this algorithm results in a better usage of registers. It also splits live intervals so that a value may reside in different registers during its lifetime. Both algorithms, however, do not take into account, that many optimizing compilers keep the intermediate program representation in SSA form. Therefore Section 2.3 describes how SSA form affects the linear scan allocation technique. 2.1

Basic Algorithm

The live interval of a value v is the range of instruction numbers [i, j[ such that i is the instruction where v starts to live and j is the instruction where it ends living. The value

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

231

v may still be used at j but it does not interfere with another value defined at j. Thus the interval is open on the right-hand side. The instructions are numbered consecutively through all basic blocks in a topological order of the control flow graph without backward edges. The live variable information is obtained by data-flow analysis [1]. Fig.1 shows an example of four live intervals computed from a linear sequence of instructions.

1: a = ... 2: b = ... 3: c = b + 1 4: d = a + c 5: ... = c 6: ... = d 7: ... = b

1 2 3 4 5 6 7 a b c d

Fig. 1. A simple instruction sequence and its live intervals

The linear scan algorithm traverses all intervals in the order of increasing start points maintaining a list, active, which contains those intervals that overlap the start point of the current interval. Initially all registers are free. For every interval i the algorithm performs the following steps: • •

If there are live intervals j in active that already expired before i begins (i.e., j.end ≤ i.beg), remove them from active and add j.reg to the set of free registers. If there are still free registers, assign one of them to i and add i to active. If there are no free registers, spill the interval with the largest end point among i and all intervals in active. If an interval from active was spilled, assign its register to i, and add i to active.

Assuming that we have 2 registers, r1 and r2, the algorithm processes the intervals of Fig. 1 as follows: interval a b c d

free r1, r2 r2 r2 r1

active ar1 ar1, br2 ar1 ar1, cr2 cr2

action assign r1 to a; make a active assign r2 to b; make b active spill b since it ends after c; make r2 free assign r2 to c; make c active remove a from active (expired); make r1 free assign r1 to d; make d active

In this example, a and d end up in r1 and c in r2. The value b was first assigned to a register, but later it was spilled and thus resides in memory. 2.2

Holes in Live Intervals

Between the first definition and the last use of a value there may be points at which the value is not live. Consider for example the program in Fig. 2 in which we look at the live intervals of the variables a and b.

232

Hanspeter Mössenböck and Michael Pfeiffer

1: a = ... 2: ... 3: ... = a 4: ... 5: a = ... 6: ... 7: b = ... 8: ... 9: ... = b

Live intervals without holes 1 2 3 4 5 6 7 8 9 10 11 12 13 a b 10: ... 11: ... = a 12: ...

13: ...

Live intervals with holes 1 2 3 4 5 6 7 8 9 10 11 12 13 a b

Fig. 2. Holes in live intervals

The live interval of a has two holes, the first one between instructions 3 and 5 where a is not used any more before it is redefined, and the second one between instructions 7 and 9 resulting from the order in which we numbered the instructions. Since the interval of b exactly falls into such a hole it can be assigned the same register as the interval of a. Keeping track of holes in live intervals makes the linear scan algorithm more complicated but it pays off since we get more values into registers. The refinement of linear scan with lifetime holes was described by Traub et al. [13]. The idea is also used in our algorithm, which we will describe in Section 5. Traub et al. add a second improvement to the linear scan algorithm. If an interval is assigned a register but gets spilled later, a spill instruction is inserted at that point and the interval is split into two halves. In the first half the value resides in a register, in the second half it resides in memory unless it is selected for being reloaded into a register later. They call their algorithm second-chance binpacking because a spilled value gets a second chance to reside in a register later. We did not use this idea in our algorithm, because our live intervals tend to be shorter due to SSA form as we will describe in Section 2.3. In Traub's algorithm the decision which interval is spilled if the allocator runs short of registers is based on weights that are computed from the distance to the next use of a value and the nesting level. We use similar weights based on the number of accesses to the value and the nesting level. 2.3

Live Intervals and Static Single Assignment Form

Many optimizing compilers keep the intermediate program in Static Single Assignment Form (SSA form) [6, 9] because it simplifies data flow analysis and optimizations. In SSA form, every assignment introduces a new and uniquely named variable so that there is never more than one assignment statement per variable. Thus, given a variable name one immediately knows where this variable received a value. If two variables have the same name they must also have the same value. Fig. 3 shows a statement sequence and its transformation to SSA form.

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

original

SSA form

a = ... b=a+1 a = ... b=b+a

a1 = ... b1 = a1 + 1 a2 = ... b2 = b1 + a2

233

a1 a2 b1 b2

Fig. 3. A statement sequence and its SSA form

If there are multiple assignments to a variable we get several smaller live intervals— one for every copy of this variable—instead of a single large interval for the original variable. Each of these intervals can reside in a different register, decreasing register pressure. Thus, the need for splitting intervals as it is done in second-chance binpacking is not as important as without SSA form. If two values of a variable flow together at a basic block, SSA form requires the insertion of a so-called φ-function (or phi-function), which is a pseudo instruction that creates yet another copy of this variable. This is shown in Fig. 4. 1: a1 = ... 2: b1 = ... 3: ...

B1

4: a2 = ... 5: b2 = ... 6: ...

7: a3 = φ (a1, a2) 8: b3 = φ (b1, b2) 9: ... = b3

B2

B3

Fig. 4. φ-functions in a merge block

The φ-function in instruction 8 means that if the control flow comes via the left branch b3 becomes b1, otherwise b3 becomes b2. It creates a single definition point for the value of b that flows from here and is used in instruction 9. Unfortunately, φ-functions become a problem in the computation of live intervals. For example, the live interval of b1 is [2,4[, [7,8[ and the live interval of b2 is [5,7[, [7,8[. This would lead to an overlap of the two intervals in instruction 7 forcing them into different registers. However, this is exactly what we do not want, since b1 and b2 are two values of the same variable and should end up in the same register if possible so that the φ-function in instruction 8 can be eliminated and the same register can be used for b1, b2 and b3. In fact, b1 and b2 are not live at the same time in instruction 7. b1 is only live if we come via the left branch and b2 is only live if we come via the right branch. If we could insert move instructions at the end of B1 and B2 and eliminate the φ-functions in B3 the overlap would be removed (Fig. 5a). However, this would invalidate SSA form. The solution is to insert move instructions while keeping the φ-functions, and to treat φ-functions as special cases for liveness analysis (Fig. 5b).

234

Hanspeter Mössenböck and Michael Pfeiffer

1: a1 = ... B1 2: b1 = ... 3: ... 4: a3 = a1 5: b3 = b1 a)

6: a2 = ... B2 7: b2 = ... 8: ... 9: a3 = a2 10: b3 = b2

13: ... = b3

B3

1: a1 = ... B1 2: b1 = ... 3: ... 4: a3 = a1 5: b4 = b1 b)

6: a2 = ... B2 7: b2 = ... 8: ... 9: a3 = a2 10: b5 = b2

13: b3 = φ (b4, b5) 14: ... = b3

B3

Fig. 5. Move instructions are inserted for the operands of φ-functions

In Fig. 5b the live interval of b1 is [2,5[, the live interval of b2 is [7,10[, and the live interval of b3 is [14,14[ (φ-functions are excluded from live intervals as described in Section 4.3). There is no overlap any more and b1, b2 and b3 can be put into the same register. By coalescing (Section 4.4) we can possibly also eliminate instructions 5 and 10. If only b1 and b3 can be put into the same register but not b2 (e.g., because this register is used for some other purpose in B2) instruction 10 remains a register move.

3

Data Structures

The data structures for basic blocks are as described in Fig. 6. Every block has pointers to its successors and predecessors as well as a pointer to its first and last instruction and to the first φ-function (φ-functions precede the ordinary instructions).

b

b.phi

(points to the first φ-function)

b.first (points to the first ordinary instruction) b.last

(points to the last instruction)

Fig. 6. Data structures for basic blocks

Every instruction i has an instruction number i.n and a field i.reg that holds the register that the allocator assigns to the value created by i. The reg fields are initialized to -1 (any) meaning that no register was assigned so far. If an instruction i should produce a value in a specific register r (as it is sometimes the case on Intel processors) i.reg is initialized to r (r ≥ 0) and the register allocator does not overwrite this value. This technique is sometimes called precoloring and is described in more detail for example in [5]. When the bytecodes are transformed to instructions of the intermediate representation (IR) we eliminate stores and loads for local variables (except for loads of parameters). Every instruction produces a value that is stored in a new virtual register,

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

235

assuming that we have an unlimited number of virtual registers. Fig. 7 shows an example of a Java function and the IR instructions generated for it.

int f(int a) { int b = a * a; return b + a; }

1: i1 = load a 2: i2 = i1 * i1 3: i3 = i2 + i1 4: ret i3

Fig. 7. Every instructions produces a value in a virtual register

Instructions 1, 2 and 3 produce a value in a new virtual register (i1, i2, i3), thus the IR is in SSA form. The reg fields of these instructions are initialized to any; the register allocator will assign physical registers to them later. Instruction 4 does not produce a value. Nevertheless it has a reg field, which the register allocator ignores. Stores and loads of the variable b have been eliminated. Live intervals are stored as a sorted sequence of sub-intervals (ranges) that are open on the right-hand side. For example, the interval [3,5[, [10,15[, [18,20[ consists of three ranges. The first range starts at instruction 3 where the value is live and ends at instruction 5 where the value may be used but is not live any more when a new value is defined there. All live intervals are kept in an array interval (see Fig. 8). The live interval of a value defined in instruction i can be found in interval[i.n]. interval

ranges

1

beg end

2

beg end beg end

3

beg end

beg end

beg end

... Fig. 8. Live intervals and their ranges

Note that the array interval is automatically sorted in the order of increasing start points of the live intervals, since every instruction (except return, goto, etc.) creates a new value and is the start of this value’s live interval. We say that the live interval of a value v is fixed if v.reg ≥ 0 prior to register allocation. Fixed intervals with the same register are joined (see Section 4.4) into a single interval. In order to make sure that fixed intervals of the same register do not overlap, we insert moves before or after the instructions that generate or use values in fixed registers. If an instruction x = y op z

requires y to be in a specific register r, we insert a move instruction in front of it u = y x = u op z

and set u.reg to r. If the instruction leaves its result x in a specific register r, we insert a move instruction after it

236

Hanspeter Mössenböck and Michael Pfeiffer

v = y op z x = v

and set v.reg to r. The moves make sure that fixed intervals of the same register do not overlap. Many of these moves can be eliminated by coalescing (see Section 4.4). Finally we use live sets that we obtain by live variable analysis [1] and store them as bit sets. Live variable analysis is considerably simplified by SSA form as described for example in [9]. Every basic block b stores in b.live the set of values that are live immediately before the instruction b.first.

4

Preparing the IR for Linear Scan

4.1

Generating Moves for φ-Operands

As explained in Section 2, we have to generate moves for the operands of φ-functions. Fig. 9 shows the result of this process and algorithm GENMOVES() explains the details. Since there is no block in the original graph to hold instruction 6 we have to insert one. a = ... ... b = ... a=b+1

1: i1 = ... 2: i2 = ... 3: i3 = ... 4: i4 = i3 + 1

... = a

7: i7 = φ (i4, i1) 8: i8 = ... i7

1: i1 = ... 2: i2 = ... 3: i3 = ... 4: i4 = i3 + 1 5: i5 = i4

6: i6 = i1

7: i7 = φ (i5, i6) 8: i8 = ... i7

Fig. 9. Move instructions 5 and 6 are generated for the φ-function 7

GenMoves() for all blocks b do for all predecessors p of b do if b.no_of_predecessors > 1 and p.no_of_successors > 1 then insert a new block n between p and b else n ← p for each φ-function phi of b do i ← new RegMove(phi.opd(p)) // the φoperand corresponding to p phi.opd(p) ← i append i to n join i with phi // see Section 4.4

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

4.2

237

Numbering the Instructions

After moves have been inserted for φ-operands the instructions have to be numbered consecutively. In order to do that we traverse all basic blocks in topological order so that a block b is only visited after all its predecessors that have forward branches to b have been visited. Fig. 10 shows some valid visit sequences.

1 2

1 3

3

4

2 4

1

1

2

2

3

4

4

3

Fig. 10. Valid visit sequences of blocks for instruction numbering

4.3

Computing Live Intervals

In SSA form there is only one assignment to every variable. This assignment marks the beginning of the variable’s lifetime. The variable lives in all paths from its definition to its last use. For every block b and every variable v we compute a range rv,b that denotes the live range of v in b as shown in Fig. 11. If v is live at the end of b it must have been defined either in b or in some predecessor block p. If v was defined in p then rv,b begins at b.first and ends after b.last. If it was defined in b then rv,b begins at the instruction v and ends after b.last. If v is not live at the end of b but is used in b then rv,b begins as described above and ends at the last use of v in b. The last use of a variable is detected using the live sets: the instructions of b are traversed in reverse order; if a variable v is used at instruction i but is not in the live set at the end of i then i is the last use of v. live: {v}

live: {}

live: {v}

live: {}

first: ... ... ... ... ... last: ... live: {v}

first: ... ... v: ... ... ... last: ... live: {v}

first: ... ... ... i: v ... ... last: ... live: {}

first: ... v: ... ... i: v ... ... last: ... live: {}

rv,b: [first, last+1[

rv,b: [v, last+1[

rv,b: [first, i[

rv,b: [v, i[

Fig. 11. Computation of the live range rv,b of a variable v in block b

The live interval of a φ-function i in block b does not start at i but at the first ordinary instruction in this block (b.first). This avoids undesired conflicts between the φfunctions of a block. It is an invariant of our algorithm that the defining instruction of a φ-function never appears in a live interval. The algorithm ADDRANGE(i, b, end) computes the range ri,b of instruction i in block b (according to Fig. 11) assuming that we already know that i ends living at the instruction with the number end. It then adds the range to the live interval of i.

238

Hanspeter Mössenböck and Michael Pfeiffer

ADDRANGE(i: Instruction; b: Block; end: integer) if b.first.n • i.n • b.last.n then range ← [i.n, end[ else range ← [b.first.n, end[ add range to interval[i.n] // merging adjacent ranges If possible, adjacent ranges of the same live interval are merged. For example, the ranges [1,3[, [3,7[ are merged into a single range [1,7[. The algorithm BUILDINTERVALS() traverses the control flow graph in an arbitrary order, finds out which values are live at the end of every block, and computes the ranges for these values as described above. BuildIntervals() for each block b do live ← {} for each successor s of b do live ← live ∪ s.live for each φ-function phi in s do live ← live – {phi} ∪ {phi.opd(b)} for each instruction i in live do ADDRANGE(i, b, b.last.n+1) for all instructions i in b in reverse order do live ← live – {i} for each operand opd of i do if opd ∉ live then live ← live ∪ {opd} ADDRANGE(opd, b, i.n)

Fig. 12 shows a sample program in source code and in intermediate representation with a φ-function for the value d and corresponding move instructions in the predecessor blocks. Fig. 13 shows the live intervals that are computed for this program by BUILDINTERVALS(). Note that the live intervals of i2 and i11 exclude instruction 11 since φ-functions never appear in live intervals. a = ... b = ... ... = a c=b d = ... ... = c e = ... ... = d ... = e

1: i1 = ... 2: i2 = ... d = ... ... = a

3: i3 = ... i1 4: i4 = ... i2 5: i5 = ... 6: i6 = ... i4 7: i7 = i5

8: i8 = ... 9: i9 = ... i1 10: i10 = i8

11: i11 = φ(i7, i10) 12: i12 = ... 13: i13 = i2 + i11 14: i14 = ...i12

Fig. 12. Sample program in source code and in intermediate representation

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

239

1 2 3 4 5 6 7 8 9 10 11 12 13 14 i1: i2: i4: i5: i7: i8: i10: i11: i12:

[1,3[, [8,9[ [2,11[, [12,13[ [4,6[ [5,7[ [7,8[ [8,10[ [10,11[ [12,13[ [12,14[ Fig. 13. Live Intervals computed from the program in Fig. 12

4.4

Joining Values

Sometimes we want that two values go into the same register, for example: • a φ-function and its operands (so that the φ-function can be eliminated); • the left-hand and right-hand sides of register moves (so that the move can be eliminated); • the first operand y and the result x of a two-address instruction x = y op z as it is required by the Intel x86 architecture. If the live intervals of the two values do not overlap we can join them, i.e. we merge their intervals so that the register allocator assigns the same register to them. This is also called coalescing ([2]). Note that coalescing leads to longer intervals possibly introducing additional conflicts that force more values into memory. Currently we do not try to minimize such conflicts although it could be done as described for example in [2]. A group of joined values is represented by only one of those values, its representative, using a union-find algorithm ([12]). Every instruction i has a field i.join, which points to its representative. Initially, i.join = i for all instructions i. If we have three values, a, b, and c, and if we join b with c, and then a with b we get a group with c as its representative as shown in Fig. 14.

a

a.join c

b Fig. 14. A group of four joined values with c as its representative

Taking into account that certain values have to be in specific registers we can join two values x and y only if they are compatible, i.e. if • both do not have to be in specific registers, or • both have to be in the same specific register, or • x must be in a specific register and the interval of y does not overlap any other interval to which x.reg has been assigned (or vice versa). More formally: • x.reg ≥ 0 ∧ ¬ (∃ interval iv: iv.reg = x.reg & interval[y.n] overlaps iv) ∨ • y.reg ≥ 0 ∧ ¬ (∃ interval iv: iv.reg = y.reg & interval[x.n] overlaps iv)

240

Hanspeter Mössenböck and Michael Pfeiffer

The algorithm JOIN(x, y) joins the two values x and y if they are compatible: JOIN(x, y: Instruction) i ← interval[REP(x).n] j ← interval[REP(y).n] if i ∩ j = {} and x and y are compatible then interval[REP(y).n] ← i ∪ j drop interval[REP(x).n] x.join ← REP(y) REP(x: Instruction): Instruction if x.join = x then return x else return REP(x.join) If we look at the program in Fig. 12 we can join the values 11, 7 and 10 (the φfunction and its operands) as well as 5 with 7 and 8 with 10 (the left- and right-hand sides of the register moves). The resulting intervals are shown in Fig. 15. The live intervals are now in a form that can be used for linear scan register allocation. This will be described in the next section. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i1: i2: i4: i5,7,8,10,11: i12:

[1,3[, [8,9[ [2,11[, [12,13[ [4,6[ [5,11[, [12,13[ [12,14[

Fig. 15. Live intervals of Fig. 13 after join operations

5

The Linear Scan Algorithm

The register allocator has to map an unbounded number of virtual registers to a small set of physical registers. If a value cannot be mapped to a register it is assigned to a memory location. Many instructions of the Intel x86 allow memory operands so there is a good chance that this value never has to be loaded into a register. If it has to be in a register, however, we load it into a scratch register (one scratch register is excluded from register allocation). If an instruction needs more than one scratch register the code generator spills one of the registers and uses it as a temporary scratch register. When the spilled value is needed again the code generator reloads it into the same register as before. Note that spilling instructions are emitted by the code generator and not by the register allocator, which only decides if a value should reside in a register or in memory. The register allocator assumes that all live intervals of a method are sorted in the order of increasing start points. It makes the first interval the current interval (cur) and divides the remaining intervals into the following four sets: • • • •

unhandled set: all intervals that start after cur.beg; handled set: all intervals that ended before cur.beg or were spilled (see below); active set: all intervals where one of their ranges overlaps cur.beg; inactive set: all intervals where cur.beg falls into one of their holes.

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

241

Throughout register allocation the following invariants hold: Registers assigned to intervals in the handled set are free; registers assigned to intervals in the active set are not free; a register assigned to an interval i in the inactive set is either free or occupied by a currently active interval j that does not overlap i (i.e. fully lies in a hole of i). When i becomes active again, j already ended so that i can reclaim its register. The algorithm LINEARSCAN() repeatedly picks the first interval cur from unhandled updating the sets active, inactive and handled appropriately. LinearScan() unhandled ← all intervals in increasing order of their start points active ← {}; inactive ← {}; handled ← {} free ← set of available registers while unhandled • {} do cur ← pick and remove the first interval from unhandled //----- check for active intervals that expired for each interval i in active do if i ends before cur.beg then move i to handled and add i.reg to free else if i does not overlap cur.beg then move i to inactive and add i.reg to free //----- check for inactive intervals that expired or become reactivated for each interval i in inactive do if i ends before cur.beg then move i to handled else if i overlaps cur.beg then move i to active and remove i.reg from free //----- collect available registers in f f ← free for each interval i in inactive that overlaps cur do f ← f – {i.reg} for each fixed interval i in unhandled that overlaps cur do f ← f – {i.reg} //----- select a register from f if f = {} then ASSIGNMEMLOC(cur) // see below else if cur.reg < 0 then cur.reg ← any register in f free ← free – {cur.reg} move cur to active

If we cannot find a free register for cur we assign a memory location to either cur or to any of the other currently active or inactive intervals, whichever has a lower weight. The weights are computed from the accesses to the intervals weighted by the nesting level in which the accesses occur. Here is the algorithm:

242

Hanspeter Mössenböck and Michael Pfeiffer

ASSIGNMEMLOC(cur: Interval) for all registers r do w[r] ← 0 // clear register weights for all intervals i in active, inactive and (fixed) unhandled do if i overlaps cur then w[i.reg] ← w[i.reg] + i.weight // if fixed i.weight = • find r such that w[r] is a minimum if cur.weight < w[r] then assign a memory location to cur and move cur to handled else // assign memory locations to the intervals occupied by r move all active or inactive intervals to which r was assigned to handled assign memory locations to them cur.reg ← r move cur to active

Table 1 shows how LINEARSCAN() works through the intervals of Fig. 15 assuming that we have 2 registers available. The weights of the intervals can be computed from the accesses to values (see Fig. 12) and are as follows: i1:3, i2:3, i4:2, i5:7, i12:2 (accesses in a φ-function are neglected). Table 1. Simulation of LINEARSCAN() for the intervals of Fig. 15

cur action initialize 1 assign r1 to interval 1 2 assign r2 to interval 2 4 move interval 1 to inactive assign r1 to interval 4 5 put interval 2 into memoy assign r2 to interval 5 12 move int. 1 and 4 to handled assign r1 to interval 12

free r1, r2 r2 r1 r2 r1 -

unhandled 1, 2, 4, 5, 12 2, 4, 5, 12 4, 5, 12 5, 12 5, 12 12 12 -

active 1r1 1r1, 2r2 2r2 2r2, 4r1 4r1 4r1, 5r2 5r2 5r2, 12r1

inactive 1r1 1r1 1r1 1r1 -

handled 2m 2m 1r1, 2m, 4r1 1r1, 2m, 4r1

Interval 2 was put into memory because its weight (3) is less than the cumulated weights of intervals 1 and 4 that occupy the same register at that time (weight = 5) and of the current interval 5 (weight = 7). Fig. 16 shows the result of the register allocation for Fig. 15. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i1: i2: i4: i5: i12:

r1 memory r1 r2 r1

Fig. 16. Result of the register allocation with 2 available registers

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

6 6.1

243

Evaluation Complexity

LINEARSCAN takes linear time to scan the intervals. For every interval it has to inspect the active, inactive and unhandled fixed sets in order to find overlaps. Since there cannot be more active intervals than registers, the length of the active set is bounded by the number of registers, which is a small constant. The length of the inactive set can come close to the total number of intervals, which would lead to a quadratic time complexity in the worst case. In practice, however, there are only very few inactive intervals (typically less than 2) at any point in time so the behavior is still linear. Finally, the number of unhandled fixed intervals is bounded by the number of available registers, because fixed intervals with the same register are joined into a single interval. Therefore, if n is the number of live intervals, the overall complexity of our algorithm is O(n2) in the worst case but linear in practice. During preprocessing we have to generate moves for φ-functions. This takes time proportional to the number of φ-functions, which is smaller than n. Live intervals are generated in sorted order so we do not need a separate pass to sort them. 6.2

Comparison with Related Work

The novelty of our approach lies in the fact that it is applicable to programs in SSA form and that it can deal with values that have to reside in specific registers. The adaptations for SSA form are done in a preprocessing step in which moves are inserted into the instruction stream in order to neutralize the φ-functions. After this step, SSA form does not affect the linear scan register allocation since φ-functions do not show up in the live intervals any more. In contrast to Poletto and Sarkar [11] our linear scan algorithm can deal with lifetime holes and fixed intervals, which makes it more complicated: In addition to the three sets unhandled, handled and active we need a fourth set, inactive, to hold intervals with a hole into which the start of the current interval falls. We also have to exclude registers that are occupied by overlapping fixed intervals from the register selection. Otherwise our algorithm is very close to the one described in [11]. Traub et al. [13] emit spill and reload instructions during register allocation eliminating a separate pass in which the instruction stream is rewritten. A spilled value can be reloaded into any free register later so that a value can reside in different registers during its life. While the ability to split long intervals is definitely an advantage, SSA form tends to produce shorter intervals from the beginning. For example, the live interval of the value v in Fig. 17a is [1,9[. In SSA form (Fig. 17b) the interval is split into 4 intervals ([1,2[, [4,7[, [9,10[, [12,12[), each of which can reside in a different register. Therefore the need for interval splitting seems not to be as urgent as without SSA form. Traub’s algorithm has to insert register moves at certain block boundaries because values can be in different locations at the beginning and the end of a control flow edge. In a similar way, we insert moves for the operands of φ-functions (instructions 7 and 10 in Fig. 17b) and eliminate unnecessary moves by coalescing values later.

244

Hanspeter Mossenbock and Michael Pfeiffer

2: . 4: v l

=

vo

=

6: ...

...

9: v2 = ...

I

11: v3 =$@I,t2) 12: . = v 3 Fig. 17. Length of live intervals a) without and b) with SSA form

6.3

Measurements

The first version of our compiler used a graph coloring register allocator, which we later replaced by a linear scan allocator. In order to compare their speed we compiled the first 1000 classes of the Java class library. Fig. 18 shows the time used for register allocation (in milliseconds) depending on the size of the compiled methods (in bytecodes). We can see that linear scan has a nearly linear time behavior and remaim efficient even for larger methods, whereas the time for graph coloring tends to increase disproportionally. For large programs linear scan is several times faster than graph coloring. 20.0 180 16.0 14.0 12.0 100 80 6.0 4.0 2.0 00

A

0

200

400

600

800

1000

Fig. 18. Run time of graph coloring vs. linear scan

linear scan

1200

Linear Scan Register Allocation in the Context of SSA Form and Register Constraints

7

245

Summary

We described how to adapt the linear scan register allocation technique for programs in SSA form. Due to SSA form the live intervals of most values become short and allow us to keep the same variable in different registers during its lifetime without splitting live intervals. We also showed how to deal with values that have to reside in specific registers as it is common in many CISC architectures.

Acknowledgements. We would like to thank Robert Griesemer, Srdjan Mitrovic and Kenneth Russell from Sun Microsystems for supporting our project as well as the anonymous referees for providing us with valuable comments on an early draft of this paper.

References 1.

Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley (1986) 2. Appel, A.W.: Modern Compiler Implementation in Java. Cambridge University Press (1998) 3. Briggs, P., Cooper, K., Torczon, L: Improvements to Graph Coloring Register Allocation. ACM Transactions on Programming Languages and Systems 16, 3 (1994) 428-455 4. Chaitin, G.J., Auslander, M.A., Chandra, A.K., Cocke, J., Hopkins, M.E., Markstein, P.W.: Register Allocation via Coloring. Computer Languages 6 (1981) 47-57 5. Chow F. C., Hennessy J. L.: The Priority-Based Coloring Approach to Register Allocation. ACM Transactions on Programming Languages and Systems 12, 4 (1990) 501-536 6. Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N.: Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Transactions on Programming Languages and Systems 13, 4 (1991) 451 - 490 7. Griesemer, R., Mitrovic, S.: A Compiler for the Java HotSpot™ Virtual Machine.. In Böszörmenyi et al. (ed.): The School of Niklaus Wirth. dpunkt.verlag (2000) 8. Johansson, E., Sagonas, K.: Linear Scan Register Allocation in the HiPE Compiler. International Workshop on Functional and (Constraint) Logic Programming (WFLP 2001), Kiel, Germany, September 13-15, 2001 9. Mössenböck, H.: Adding Static Single Assignment Form and a Graph Coloring Register Allocator to the Java HotSpot Client Compiler. TR-15-2000, University of Linz, Institute of Practical Computer Science, 2000 10. Poletto, M., Engler, D.R., Kaashoek, M.F.: A System for Fast, Flexible, and High-Level Dynamic Code Generation. Proceedings of the ACM SIGPLAN Conf. on Programming Language Design and Implementation, Las Vegas (1997) 109-121

246

Hanspeter Mössenböck and Michael Pfeiffer

11. Poletto, M., Sarkar, V.: Linear Register Allocation. ACM Transactions on Programming Languages and Systems 21, 6 (1999) 895-913 12. Sedgewick, R.: Algorithms, 2nd edition. Addison Wesley (1988) 13. Traub, O., Holloway, G., Smith, M.D.: Quality and Speed in Linear-Scan Register Allocation. Proceedings of the ACM SIGPLAN Conf. on Programming Language Design and Implementation (1998) 142-151

Global Variable Promotion: Using Registers to Reduce Cache Power Dissipation Andrea G. M. Cilio1 and Henk Corporaal2 1

Delft University of Technology, Computer Engineering Dept. Mekelweg 4, 2628CD Delft, The Netherlands [email protected] 2 IMEC, DESICS division Leuven, Belgium [email protected]

Abstract. Global variable promotion, i.e. allocating unaliased globals to registers, can signiﬁcantly reduce the number of memory operations. This results in reduced cache activity and less power consumption. The purpose of this paper is to evaluate global variable promotion in the context of ILP scheduling and estimate its potential as a software technique for reducing cache power consumption. We measured the frequency and distribution of accesses to global variables and found that few registers are suﬃcient to replace the most frequently referenced variables and capture most of the beneﬁts. In our tests, up to 22% of memory operations are removed. Four registers, for example, are suﬃcient to reduce the energy-delay product by 7 to 26%. Our results suggest that global variable promotion should be included as a standard optimization technique in power-conscious compilers.

1

Introduction

Certain code optimizations, like register allocation, oﬀer increased potential for code improvement when applied to whole programs. Several research works, some of which resulting in a production compiler [15], have explored the potential of inter-module register allocation and global variable promotion. The latter technique allocates global variables in registers for a part of the lifetime crossing procedure and module boundaries (possibly for the entire lifetime). These works have always considered execution time the primary metric of evaluation. However, as we show in this paper, in the context of instruction scheduling for ILP processors performance is not so sensitive to inter-module register allocation; in this context, earlier results do not apply anymore.. With the increasing importance of low-power designs, due to the rapidly growing portable electronics market, we believe that metrics like energy and energy-delay product should be used to evaluate these and other software techniques. From the point of view of execution cycle count, reserving a register to a global variable throughout the program lifetime is advantageous when the target architecture oﬀers enough registers with respect to the number of interfering live R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 247–261, 2002. c Springer-Verlag Berlin Heidelberg 2002

248

Andrea G. M. Cilio and Henk Corporaal

ranges, which may be limited by, e.g., the lack of instruction-level parallelism. In these situations, a number of registers may be left underutilized. Modern multimedia, general-purpose and DSP processors, like Trimedia TM1000 [7], Intel’s IA-64 and Analog Devices’ ADSP-TS001M, oﬀer large register ﬁles. Although this large number of registers is necessary to sustain high levels of ILP, the compiler ILP-enhancing techniques may not always succeed in utilizing them all eﬀectively. By assigning underutilized registers to global scalar variables, the compiler can eliminate all the load and store operations that access those variables, thereby reducing the dynamic operation count and the cache-processor traﬃc. From the point of view of power consumption, this is advantageous, because a large fraction of the overall power consumption in modern processors is due to cache activity [12]. The purpose of this paper is to evaluate global variable promotion in the context of instruction-level parallel (ILP) scheduling and to estimate its potential as a software technique for reducing cache power consumption. Also, we investigate possible trade-oﬀs points between execution time and energy consumption for diﬀerent caches and CPU conﬁgurations with varying degrees of ILP. The rest of this paper is organized as follows. Section 2 analyses the potential of global variable promotion and inter-module register allocation and presents the algorithm used to promote global scalar variables. Using the power dissipation model presented in section 3, section 4 evaluates the eﬀect of global variable promotion on performance and two energy-related metrics. Section 5 reviews related work. Finally, section 6 summarizes the results obtained.

2

Global Register Allocation

A number of code generation systems extend the program analyses and optimizations to the inter-module (or whole-program) scope. Among these optimizations, inter-module register allocation and global variable promotion have received some attention [18] [17] [2] [15]. In this section we ﬁrst evaluate the potential of these two optimizations techniques on our compiler. After concluding that only global variable promotion seems promising, we present an algorithm for global variable promotion. 2.1

Potential of Inter-module Register Allocation and Global Variable Promotion

Inter-module Register Allocation (and its restricted inter-procedural variant) aims at reducing the execution overhead due to save and restore code around function calls. While this can be eﬀective when compiling for languages with frequent function calls, like LISP [17], the potential measured in other works, even though using more sophisticated approaches, seems low for languages like C and Pascal; the speedup ranges from 1 to 3% [2] [15]. To verify that the potential of inter-module register allocation is scarce in our C compiler (based on gcc), we performed a number of tests with and without

Global Variable Promotion

249

Table 1. Eﬀect of function inlining on a set of benchmarks % call operations % size % cycles benchmark original inline increase reduction compress 0.368 0.005 18.143 1.557 cjpeg 1.263 0.063 6.682 11.306 djpeg 0.209 0.018 3.089 2.119 mpeg2dec 1.395 0.132 24.472 30.802 average 0.809 0.055 13.097 11.446

Table 2. Potential speedup of inter-module allocation of local variables: upper bounds % reduction cycles mops benchmark original inline original inline compress 2.305 0.010 6.499 0.044 cjpeg 1.124 0.748 8.919 3.906 djpeg 0.334 0.198 2.844 0.721 mpeg2dec 15.731 1.809 36.441 4.571 average 4.873 0.691 13.676 2.311

function inlining (see table 1). Details about the target machine can be found in section 4.3, while the benchmarks are presented in section 4.2. Columns 4 and 5 of Table 1 show, respectively, the code size increase and the speedup of the inlined program with respect to the original program. Function inlining drastically reduces the number of function calls at the cost of a modest code size increase. The very low fraction of call operations after inlining (column 3) suggests that save and restore code does not constitute a large overhead. A good upper bound to the speedup that could be achieved by means of inter-module register allocation is obtained by totally disabling the generation of save and restore code around calls. The performance is correctly measured by our cycle-accurate simulator, which takes care of saving and restoring the used registers “on behalf” of the program. Columns 2 and 3 of table 2 show the speedup obtained when the original and the inlined versions of the programs are compiled without generating save/restore code, while the last two columns show the reduction in memory operations. From these data we can conclude that the potential of inter-module register allocation is negligible after function inlining has been applied. Also, notice that this upper bound is not always achievable: recursive functions, for example, still require some save and restore code. In addition to the low fraction of function calls, another reason contributes to these very low upper bounds: the caller- and callee-saved register conventions [2] are eﬀectively used in our compiler [8] to minimize the unnecessary save and restore code for registers that are not live around a function call.

250

Andrea G. M. Cilio and Henk Corporaal

Table 3. Memory operations and accesses to global scalar variables as fractions of all operations and of memory operations executed, respectively % globals benchmark mops % unscheduled scheduled compress 31.4 33.0 25.9 cjpeg 24.8 26.5 16.0 djpeg 22.8 18.5 8.6 mpeg2dec 31.3 20.4 12.7 average 25.58 24.6 15.8

Promoting global scalar variables appears to be more promising than intermodule register allocation. Previous works reported speedups ranging from 7% [15] to 10–20%, for a set of small benchmarks [18] and found that global variable promotion is of greater beneﬁt than inter-procedural register allocation. These works have also shown that scalar variable accesses represent a substantial fraction of the total number of memory operations that access global (static) data. Our measurements, however, do not fully conﬁrm this fact, as shown in table 3. Columns 3 and 4 contain the total accesses to global scalar variables as a fraction of the total memory operations. These values have been measured in unscheduled (and only partially optimized) and scheduled code, respectively. The measured diﬀerence can be ascribed to function inlining (which is not applied to unscheduled code) and the additional optimizations performed during scheduling. The diﬀerence with previously reported results can be partially explained by the fact that, while we only count variables residing in memory, the baseline register allocator used by Wall [18] considers also constants and link-time constant addresses ‘globals’, and stores them in memory. These amount to a substantial portion of the overall memory references. In fact, Wall reports that the most important globals are few, frequently used numeric constants, and that keeping them in global registers captures much of the link-time allocation advantage. Since our compiler encodes all constant values (including link-time constant addresses) in immediate ﬁelds, it is not surprising that we ﬁnd fewer globals. 2.2

Algorithm for Global Variable Promotion

The scarce potential shown by inter-module optimization, discussed in previous section, lead us to focus on variable promotion. The results reported by Santhanam [15], suggest that a simple algorithm for global variable promotion performs almost as well as the most sophisticated. For this reason, we chose blanket promotion, a simple algorithm which replaces a set of selected global variables with registers throughout the program. To obtain alias information on global-scope scalar variables, we added a postlinkage analysis pass. This pass determines which variables have their address taken in at least one of the modules and are thus not eligible for promotion. All

Global Variable Promotion

251

unaliased global variables are candidates for assignment to registers. The decision of which global variable to select, given a budget of registers for promoted variables, is taken based on the number of load and store operations that would be eliminated. The frequencies are obtained with proﬁling. Variable promotion is applied after all modules and library functions have been linked together, before instruction scheduling [3].

3

Cache Power Consumption

The power dissipation due to on-chip caches is a signiﬁcant portion of the overall power dissipated by a modern microprocessor. For example, the on-chip D-cache of a low-power microprocessor, the StrongARM 110, consumes 16% of its total power [12]. The current trend towards larger on-chip L1 caches emphasizes the importance of reducing their power dissipation for two reasons: ﬁrst, larger caches require larger capacitances to be driven; second, larger L1 caches have higher hit rate and therefore reduce the relative power spent in L2 caches or in oﬀ-chip memory communication. 3.1

Cache Power Model

To evaluate the reduction of cache power dissipation we used the analytical model for cache timing and power consumption found in CACTI 2.0 [14], which is based on the cache model proposed by Wilton and Jouppi [19]. The source of power dissipation considered in this model is the charging and discharging of capacitative loads caused by signal transitions. The energy dissipated for a voltage transition 0 → V or V → 0 is approximated with: E=

1 CV 2 2

(1)

where C is the capacitance driven. An analytical model of the cache power consumption includes the equivalent capacitance of the relevant cache components. The power consumption is estimated by combining (1) and the transition count at the inputs and outputs of each modeled component. The cache components fully modeled are: address decoder, wordline, bitline, sense ampliﬁers, data output driver. In addition, the address lines going oﬀ-chip and the data lines (both going oﬀ-chip and going to the CPU, are taken into account. Our model does not consider the power dissipated by comparators, data steering logic, and cache control logic. This model is quite accurate; Kamble and Ghose [10] have shown that their model, which is very similar to this one, if coupled with exact transition counts, predicts the power dissipation of conventional caches (i.e., caches whose organization does not use power-reducing techniques like sub-banking and block buﬀering) with an error within 2%. In our estimations we use accurate counts for cache accesses and address bit transitions to and from memory. The average width of a piece of data written to memory is estimated assuming equal distribution of

252

Andrea G. M. Cilio and Henk Corporaal

bytes, half-words and (32-bit) words, like in [12]. Also, we estimate that the transition counts of address and data bits are evenly distributed between accesses that hit and miss the cache. 3.2

Energy-Related Metrics

To evaluate the eﬃciency of global variable promotion we measure the energydelay (E-D) product. This metric was proposed by Gonzales and Horowitz [5], who argue it is superior to the commonly used power or energy metrics because it combines energy dissipation and performance. To compute the delay D we assumed a clock frequency compatible with the access times estimated by CACTI. The E-D product is given by: ED = E · D = P · D2 = P · (Ncycles · Tclock )2 .

(2)

Although the E-D metric presents important advantages, like the reduced dependence from technology, clock speed and implementations, the energy consumption is an important metric for battery-operated processors in portable devices, because it determines their battery duration [1]. In our experiments, the energy reduction closely follows the reduction of energy-delay. Nevertheless, we do show these results.

4

Experimental Results

We present the results of our simulations in this section. First, we brieﬂy introduce our code generation infrastructure, our benchmarks and the target machines used for the simulations. The results are presented in three parts: the frequency distribution of global variables, the performance results and the energy eﬃciency of the data cache. 4.1

Code Generation Infrastructure

Figure 1 shows our code generation path. It generates code for a templated architecture especially suited for Application Speciﬁc Instruction-Set Processors, called Move. This architecture oﬀers explicitly programmed instruction-level parallelism, in a fashion similar to that of VLIW architectures [4]. For the purpose of this paper, the details of the architecture used for the evaluation are unimportant. Its inherently low-power characteristics, however, make the contribution of caches to the overall chip power consumption even larger than in a conventional architecture. The code generation is coarsely split in two phases: (1) compilation to a generic-machine instruction set and (2) target-speciﬁc instruction scheduler, which integrates also register allocation [9]. Simulation of the generic, unscheduled code is used to generate proﬁling data. The intermediate representation used in the ﬁrst phase of code generation is SUIF, the Stanford University Intermediate Format [6].

Global Variable Promotion

253

C, Fortran source front-end SUIF IR other modules MachSUIF IR

linker

machine-indep. optimizations

back-end MachSUIF IR

machine-dep. optimizations

scheduler executable

Fig. 1. The adapted code generation trajectory Table 4. Benchmarks used for evaluation benchmark compress djpeg cjpeg mpeg2decode

instr. 4855 16421 16526 12935

cycles 2.0M 19.7M 29.8M 30.3M

description Unix utility for ﬁle compression. JPEG image decompression. JPEG image compression. Standard MPEG-2 format decoder.

Instead of generating the traditional assembly textual output, the compiler generates and maintains a structured representation of the machine code in MachSUIF [16], a format derived from SUIF. MachSUIF maintains all sourcelevel information, as well as any other piece of information gathered during analysis passes. This format allows to perform sophisticated code analysis on whole programs and makes the related code transformation much easier to apply than on a binary format [3]. 4.2

Benchmark Characteristics

Four benchmarks have been used for our experimental evaluations. Their static code size and dynamic operation count (test set) are summarized in table 4. All benchmarks have been proﬁled with a training input data set and tested with a diﬀerent data set. We selected multi-module programs of a suﬃcient level of complexity, such that the use of global (scalar) variables is almost unavoidable. Small benchmarks, on the other hand, are often coded without using global scalar variables. Compress, with its relatively small size, is an exception, in that it is a single-source, simple program with frequent accesses to global scalar variables.

254

Andrea G. M. Cilio and Henk Corporaal

Table 5. Machine conﬁgurations used for the evaluation quantity resource M1 M2 transport busses 3 8 long immediates 1 2 # integer regs. varies varies # FP regs. 16 48 # boolean regs. 2 4 cache size 16KB 32KB

4.3

quantity unit latency M1 M2 LSU 2 1 2 IALU 1 2 4 multiply 3 1 1 divide 8 1 1 FPU 3 1 1

Target Machines

We performed our evaluation on two Move target machines with diﬀerent cost and capabilities. The two machine conﬁgurations were selected in order to evaluate how ILP aﬀects the results of global variable promotion. Our Move architecture is kind of VLIW machine with streamlined reduced instruction set. The smaller machine, M1, is slightly more powerful than a simple single-issue RISC processor.1 The average IPC measured for our benchmarks ranges between 1.2 and 1.3. We selected this conﬁguration in order to estimate the eﬀect of global promotion on a single-issue machine. The larger machine, M2, is capable to perform about 4 operations per cycle, two of which can be data memory accesses. In this case, the average IPC measured for our benchmarks is 1.7–2.3. Table 5 summarizes the characteristics of the machine conﬁgurations. The busses are explicitly programmed to transport data between execution units and register ﬁles. The boolean registers allow to guard operations and predicate their execution. We assumed that the CPU is attached to a 2-way set-associative, write-through, on-chip data cache with LRU replacement policy. The cache line size is 32 bytes. Although the results shown in the following sections were obtained with 16KB and 32KB caches, other cache sizes have been tried. For all conﬁgurations the relative energy reduction is very similar. 4.4

Distribution of Global Variable Uses

The number of accesses to the memory segment dedicated to global data varies widely from benchmark to benchmark [13]. The relative frequency of memory operations that access global scalar variables poses a clear upper bound on the improvement of energy eﬃciency achievable via global variable promotion. Fortunately, the accesses to global scalar variables have a desirable characteristic. As shown in ﬁgure 2(a), only a few variables are suﬃcient to cover most memory operations due to accesses to global scalar variables. The values on the Y axis are 1

Due to limitations in the current implementation, the integrated instruction scheduler/register allocator cannot generate code for a machine conﬁguration with only one integer ALU.

Global Variable Promotion

Memory Operations (fraction of total) 0.4

Memory Operations (fraction of total) 0.4

djpeg cjpeg mpeg2dec compress

0.35 0.3

djpeg cjpeg mpeg2dec compress

0.35 0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

0 0

(a)

255

5

10 Global Variables Promoted

15

20

0

(b)

5

10

15

20

Global Variables Promoted

Fig. 2. Dynamic memory operation count covered by global scalar variables (a) on scheduled and optimized code, (b) on unscheduled code

the number of memory operations (as a fraction of the total memory operation count) due to the N most used global scalar variables, where N is reported on the X axis. This indicates that it is suﬃcient to dedicate only few registers to global variables to capture most of the beneﬁt of global variable promotion. The results shown in ﬁgure 2(a) refer to scheduled and highly optimized code on M1, for which most of intra-procedural unaliased accesses to global variables have been optimized away. Code optimizations considerably reduce the relative frequency of accesses to global variables, as conﬁrmed by ﬁgure 2(b), which depicts the same frequency distribution obtained from unscheduled code. Part of this reduction is accounted for by function inlining, which opens new opportunities for intra-procedural optimizations. 4.5

Performance

We compiled 9 diﬀerent versions of each benchmark, with a budget dedicated to global variables ranging from 0 to 8 registers. For a given register budget n, the n most frequent global variables were promoted, resulting in the same number of registers not being available for general register allocation. The ﬁrst series of tests measures the eﬀect of global variable promotion on performance. Figure 3 shows the cycle count of the four benchmarks for diﬀerent sizes of the integer register ﬁle. The modest speedup can be explained by the fact that load operations associated with global variables have a constant address and do not have ﬂow dependencies with preceding operations, therefore can be scheduled with considerable freedom. Thanks to this freedom, our instruction scheduler is capable of hiding the latency of most load operations associated with global variables. The results in ﬁgure 3 conﬁrm that the eﬀect of scheduling freedom prevails, thus making the performance improvement modest to negligible depending on the benchmark. Figure 4 shows the dynamic count of load and store operations for the same series of tests. The reduction in memory operations executed is in good accordance with the usage distributions in ﬁgure 2, except when the most register-

256

Andrea G. M. Cilio and Henk Corporaal

Total Cycles (relative to baseline) 1.1

Total Cycles (relative to baseline) 1.1

024 regs 032 regs 064 regs 128 regs

1.08 1.06

024 regs 032 regs 064 regs 128 regs

1.08 1.06

1.04 1.02

1.04

1

1.02

0.98 1

0.96 0.94

0.98 0

1

2 3 4 5 6 compress Global Variables Promoted

7

8

0

1

Total Cycles (relative to baseline) 1.1

1.06

7

8

7

8

Total Cycles (relative to baseline) 1.1

024 regs 032 regs 064 regs 128 regs

1.08

2 3 4 5 6 djpeg Global Variables Promoted

024 regs 032 regs 064 regs 128 regs

1.08 1.06

1.04

1.04

1.02 1.02 1 1

0.98 0.96

0.98

0.94

0.96

0.92

0.94 0

1

2 3 4 5 6 7 mpeg2decode Global Variables Promoted

8

0

1

2 3 4 5 6 cjpeg Global Variables Promoted

Fig. 3. Performance results: dynamic cycle counts on ‘M1’

hungry benchmarks are run on machine conﬁgurations with a small integer register ﬁle. In such cases, the reduced number of registers available to general register allocation quickly oﬀsets the gains of variable promotion. This is due to the introduction of false dependencies, which pose a tighter constraint on scheduling freedom. A further increase in register pressure results in a large number of spill operations. We also measured the miss rate of the data cache and found that it increases as more global variables are promoted. Obviously, this is only a relative increase, due to the fact that the number of memory accesses decreases more than the number of cache misses. This result conﬁrms that global variables show high temporal locality [13]. We can therefore conclude that global promotion reduces cache activity but does not signiﬁcantly aﬀect the CPU-memory traﬃc. 4.6

Energy and Energy-Delay Product

A reduction of energy and energy-delay product, consistent with the reduction of memory operations, has been measured for conﬁgurations of M1 and M2 with varying number of registers. Figure 5 shows the results for M1 with 64 registers relative to the original program without variable promotion. Very similar reductions are found for the M2 conﬁguration with 64 registers, as can be seen from ﬁgure 6. In this case a 32KB cache was measured.

Global Variable Promotion

Memory Operations (relative to baseline) 1.1

Memory Operations (relative to baseline) 1.1

024 regs 032 regs 064 regs 128 regs

1.05

257

024 regs 032 regs 064 regs 128 regs

1.05

1 0.95

1

0.9

0.95

0.85 0.9 0.8 0.75

0.85 0

1

2 3 4 5 6 compress Global Variables Promoted

7

8

0

1

Memory Operations (relative to baseline) 1.1

7

8

7

8

Memory Operations (relative to baseline) 1.1

024 regs 032 regs 064 regs 128 regs

1.05

2 3 4 5 6 djpeg Global Variables Promoted

024 regs 032 regs 064 regs 128 regs

1.05

1

1

0.95 0.95 0.9 0.9

0.85

0.85

0.8 0.75

0.8 0

1

2 3 4 5 6 7 mpeg2decode Global Variables Promoted

8

0

1

2 3 4 5 6 cjpeg Global Variables Promoted

Fig. 4. Performance results: dynamic memory operation counts on ‘M1’

While the level of ILP seems not to have a signiﬁcant impact on the eﬀect of global variable promotion, the number of available registers is critical. Figure 7 shows the energy-delay product on M1 and M2 when only 32 registers are available. Only compress shows consistent improvement, owing to its low register pressure; in all other benchmarks, the register pressure results in more spill code and cache activity when promoting too many globals. The reduction in energy consumption is paired with reduced execution times, as can be seen by comparing ﬁgure 3 with ﬁgures 5, 6, and 7, therefore we cannot speak of a clear trade-oﬀ between performance and energy consumption for this software technique. This is easy to explain, since the primary source of performance degradation caused by global variable promotion is register pressure, which often results in register spilling and therefore additional memory operation and increased cache activity.

5

Related Work

In this section we review previous work on architectural/software techniques for reducing data cache power consumption. Work on whole-program register allocation has been brieﬂy discussed in section 2. It has recently been demonstrated that memory traﬃc due to references to the global section of a program (which includes scalar global variables) shows

258

Andrea G. M. Cilio and Henk Corporaal

Energy Consumption (relative to baseline) 1

Energy-Delay (relative to baseline) 1

djpeg cjpeg compress mpeg2dec

0.95

djpeg cjpeg compress mpeg2dec

0.95 0.9

0.9 0.85 0.85 0.8 0.8

0.75

0.75

0.7 0

1

2

3 4 5 6 Global Variables Promoted

7

8

0

1

2

3 4 5 6 Global Variables Promoted

7

8

Fig. 5. Relative energy consumption (right) and energy-delay product (left) for a conﬁguration of ‘M1’ with 64 integer registers

Energy Consumption (relative to baseline) 1

Energy-Delay (relative to baseline) 1

djpeg cjpeg compress mpeg2dec

0.95

djpeg cjpeg compress mpeg2dec

0.95 0.9

0.9 0.85 0.85 0.8 0.8

0.75

0.75

0.7 0

1

2

3 4 5 6 Global Variables Promoted

7

8

0

1

2

3 4 5 6 Global Variables Promoted

7

8

Fig. 6. Relative energy consumption (right) and energy-delay product (left) for a conﬁguration of ‘M2’ with 64 integer registers very high temporal locality, with an average life span of cache lines up to almost one order of magnitude higher than that of accesses to heap region [13]. For this reason, most traﬃc due to accesses to global variables can be captured by a small dedicated cache. Since stack accesses show even better cacheability, the authors subdivide the data cache into three region caches which cover global data, stack and heap. This three-component on-chip cache system is much more power-eﬃcient than the conventional single data cache: 37% to 56% less power is dissipated, depending on the cache conﬁguration. Another recent work on architectural-level low-power cache design is presented by Kin and others [12], who propose to insert an unusually small cache before what normally is the L1 on-chip cache. This small cache, called ﬁlter cache, reduces the access cost by roughly a factor 6 at the cost of increased cache miss rate and increased miss latency. This allows to trade-oﬀ power efﬁciency with performance. The authors show that a clear optimal point exists between no ﬁlter cache at all and a ﬁlter cache of the same size of the conventional L1 cache. With an optimal ﬁlter cache size (512 bytes) the energy-delay product is reduced by 50% at the expense of a 21% increase in cycle count.

Global Variable Promotion

Energy-Delay (relative to baseline) 1

Energy-Delay (relative to baseline) 1.15

djpeg cjpeg compress mpeg2dec

0.95

259

djpeg cjpeg compress mpeg2dec

1.1 1.05

0.9

1 0.95

0.85

0.9

0.8

0.85 0.8

0.75

0.75

0.7

0.7 0

1

2

3 4 5 6 Global Variables Promoted

7

8

0

1

2

3 4 5 6 Global Variables Promoted

7

8

Fig. 7. Relative energy-delay product for a conﬁguration of ‘M1’ (right) and ‘M2’ (left) with 32 integer registers The use of global variable promotion to reduce power consumption proposed in this paper exploits the same principles used in the Filter Cache [12] and the Region-Based Cache [13]. While the former exploits the locality principle to decrease power consumption by introducing a new level in the memory hierarchy, our approach achieves a similar result by using the register ﬁle. The registers allocated to frequently accessed global scalar variables can be also compared to the Region-Based cache partition dedicated to global data references. In this case, the use is further limited to a selected subset of scalar global variables. Many other architectural techniques for improving the energy eﬃciency of caches have been proposed. Kamble and Ghose, for example, evaluate the eﬀectiveness of two such techniques: block buﬀering and sub-banking. The interested reader is referred to their paper [11] and to the section on previous work in [13] for further references about this important research area.

6

Conclusions

Power and energy consumption have become a critical issue in high-performance and portable/embedded processors, respectively. As a consequence, new microarchitectural and code generation techniques for power reduction are researched with increasing interest. At the same time, traditional software techniques, like loop unrolling take on a new light when energy-related metrics are considered [1]. Global variable promotion is in our opinion one of those software techniques that deserves new attention in the context of power reduction. In this paper we evaluated the eﬀect of global variable promotion on performance and cache energy consumption, and found that signiﬁcant savings, up to 26%, are achieved by promoting a few (4–8) critical global variables. In summary, the results suggest that on ILP architectures the eﬀect of global variable promotion on performance is rather limited. However, this techniques can signiﬁcantly reduce data cache power consumption, and should be included as a standard optimization technique in power-conscious compilers.

260

Andrea G. M. Cilio and Henk Corporaal

References 1. David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 83–94, Vancouver, British Columbia, June 12–14, 2000. 252, 259 2. Fred C. Chow. Minimizing register usage penalty at procedure calls. In SIGPLAN ’88 Conference on Programming Language Design and Implementation, pages 85– 94, 1988. 248, 249 3. Andrea G. M. Cilio and Henk Corporaal. A linker for eﬀective whole-program optimizations. In Proceedings of HPCN, Amsterdam, The Netherlands, April 1999. 251, 253 4. Henk Corporaal. Microprocessor Architectures; from VLIW to TTA. John Wiley, 1997. ISBN 0-471-97157-X. 252 5. R. Gonzalez and M. Horowitz. Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 31(9):1258–66, September 1996. 252 6. Stanford Compiler Group. The SUIF Library. Stanford University, 1994. 252 7. Jan Hoogerbrugge. Instruction scheduling for trimedia. Journal of InstructionLevel Parallelism, 1(1–2), 1999. 248 8. J. Janssen. Compilation Strategies for Transport Triggered Architectures. PhD thesis, Delft University of Technology, 2001. 249 9. Johan Janssen and Henk Corporaal. Registers on demand: an integrated region scheduler and register allocator. In Conference on Compiler Construction, April 1998. 252 10. M. B. Kamble and K. Ghose. Analytical energy dissipation models for low-power caches. In Proceedings of the 1996 international symposium on Low power electronics and design, Monterey, CA USA, August 12–14, 1997. ACM. 251 11. M. B. Kamble and K. Ghose. Energy-eﬃciency of vlsi caches: a comparative study. In Proceedings Tenth International Conference on VLSI Design, pages 261–7. IEEE, January 1997. 259 12. Johnson Kin, Munish Gupta, and William H. Mangione-Smith. Filtering memory references to increase energy eﬃciency. IEEE Transactions on Computers, 49(1), January 2000. 248, 251, 252, 258, 259 13. Hsien-Hsien S. Lee and Gary S. Tyson. Region-based caching: An eﬃcient memory architecture for embedded processors. In CASES, San Jose, CA, November 2000. 254, 256, 258, 259 14. G. Reinman and N. P. Jouppi. An integrated cache timing and power model. Technical report, COMPAQ Western Research Lab, Palo Alto, California, 1999. 251 15. Vatsa Santhanam and Daryl Odnert. Register allocation across procedure and module boundaries. In Proceedings of the Conference on Programming Language Design and Implementation, pages 28–39, 1990. 247, 248, 250 16. Michael D. Smith. Extending SUIF for Machine-dependent Optimizations. In Proceedings of the First SUIF Workshop, January 1996. 253 17. Peter A. Steenkiste and John L. Hennessy. A simple interprocedural register allocation algorithm and its eﬀectiveness for lisp. TOPLAS, 11(1), 1989. 248 18. David W. Wall. Register windows vs. register allocation. Technical Report 7, Western Research Laboratory, Digital Equipment Corporation, December 1987. 248, 250

Global Variable Promotion

261

19. S. J. E. Wilton and N. P. Jouppi. An enhanced access and cycle time model. Technical Report 5, Digital Western Research laboratory, Palo Alto, California, July 1994. 251

Optimizing Static Power Dissipation by Functional Units in Superscalar Processors Siddharth Rele1 , Santosh Pande2 , Soner Onder3 , and Rajiv Gupta4 1

Dept of ECECS, University of Cincinnati, Cincinnati, OH-45219 2 College of Computing, Georgia Tech, Atlanta, GA-30318 3 Dept. of Computer Science, Michigan Tech. Univ., Houghton, MI 49931 4 Dept. of Computer Science, The Univ.of Arizona, Tucson, Arizona 85721

Abstract. We present a novel approach which combines compiler, instruction set, and microarchitecture support to turn off functional units that are idle for long periods of time for reducing static power dissipation by idle functional units using power gating [2,9]. The compiler identifies program regions in which functional units are expected to be idle and communicates this information to the hardware by issuing directives for turning units off at entry points of idle regions and directives for turning them back on at exits from such regions. The microarchitecture is designed to treat the compiler directives as hints ignoring a pair of off and on directives if they are too close together. The results of experiments show that some of the functional units can be kept off for over 90% of the time at the cost of minimal performance degradation of under 1%.

1

Introduction

To cater to the demands for high performance by a variety of applications, faster and more powerful processors are being produced. With increased performance there is also an increase in the power dissipated by the processors. High performance superscalar processors achieve their performance by exploiting instruction level parallelism (ILP). ILP is detected dynamically and instructions are executed in parallel on multiple functional units. Therefore one source of power dissipation is due to the functional units. It is well known that ILP is often distributed nonuniformly throughout a program. As a result many of the functional units are idle for prolonged periods of time during program execution and therefore the power dissipation by them during these periods is wasted. The goal of this work is to minimize power dissipated by functional units by exploiting long periods of time over which some functional units are idle. The power consumed by functional units falls into two categories: dynamic and static. With current technology the dynamic power is the dominating component of overall power consumption and by using clock gating techniques the dynamic power dissipated by functional units during idle periods can be reduced [4,12,13]. However, it is projected that in a few generations the static power dissipation will

Supported by DARPA award no. F29601-00-1-0183.

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 261–275, 2002. c Springer-Verlag Berlin Heidelberg 2002

262

Siddharth Rele et al.

equal dynamic power dissipation [11]. Speciﬁcally for diﬀerent kinds of adders and multiplers, the increase in static power with changing technology is shown in Table 1 [5]. Therefore it is important to also minimize the static power consumption when the functional units are idle. Table 1. Static power dissipation by functional units Functional unit type Adders Ripple Carry Carry Lookahead Manchester Carry Multipliers Serial Serial/Parallel Parallel

Technology (µm) 0.354 0.18 0.13 0.10 Static power dissipation (mW) 0.07 0.08 0.12 0.14 0.09 0.11 0.19 0.20 0.10 0.16 0.23 0.25 Static power dissipation (mW) 0.29 0.32 0.43 0.51 0.35 0.41 0.48 0.55 0.37 0.46 0.50 0.60

0.07 0.15 0.19 0.28 0.50 0.58 0.62

There are two known techniques that are suitable for reducing static power dissipation by functional units during long periods of idleness. The ﬁrst technique is power gating [10,9] which turns oﬀ devices by cutting oﬀ their supply voltage. The second technique uses the dual threshold voltage technology – by raising the threshold voltage during idle periods of time the static power dissipation is reduced [14]. In both of the above approaches there is a turning on latency involved, that is, when the unit is turned back on (either by providing the supply voltage or lowering the threshold voltage) it cannot be used immediately because some time is needed before the circuitry returns to its normal operating condition. While the latency for power gating is typically few (5-10) cycles [2], the latency for dual threshold voltage technology is much higher. In this work we assume that power gating is being employed to turn oﬀ functional units and we assume a latency of ten cycles for turning a functional unit on in all our experiments. The shutting down of functional units is most eﬀectively accomplished by employing a combination of compiler and hardware techniques. To understand the reasons for this claim lets examine the problems that we must address in designing an algorithm for turning functional units oﬀ and then back on, and then evaluate the suitability of the means for solving the problem, that is, whether to use compiler support or hardware support in addressing the problem. We also describe the approach that we take in addressing each of the problems. Identifying idle regions. In order to turn oﬀ a functional unit we ﬁrst must identify regions of code in the program over which the functional unit is expected to be idle. The use of hardware for predicting or detecting idle regions has the following problems. First the additional hardware for predicting idle regions will also consume additional power throughout the execution as it must remain active

Optimizing Static Power Dissipation by Functional Units

263

all along. Second we will not be able to exploit the idle regions during the warm up period of the prediction mechanism – only after enough history has been acquired by the prediction hardware will the predictions be eﬀective. Our solution to the above problems is to rely on the compiler to identify program regions with low ILP and thus low functional unit demands. The compiler can examine all of the code oﬀ-line and therefore identify suitable regions for turning the functional units oﬀ. Furthermore it can also identify the type of functional units and determine the number of functional units that should be turned oﬀ without degrading performance. This information is then communicated to the hardware by generating special off and on directives. Tolerating the latency of turning a functional unit oﬀ. The functional unit must be turned oﬀ suﬃciently prior to entering the program region in which it can be kept idle. This is because there is a latency for turning the unit oﬀ and we must account for this latency to maximize the power savings. The latency arises because time is needed to drain the functional unit by allowing it to execute the instructions already assigned to it. Let us assume that we have two functional units of a given type and we would like to turn one of them oﬀ. When the off directive is encountered, the functional units may already have instructions assigned to them. One of the unit must be selected and drained before it is turned oﬀ. This problem is also not suitable for handling by hardware because even if we were to overcome the problems described earlier and develop a mechanism for eﬃciently detecting idle regions in hardware, now we would have to predict them even earlier. Therefore our solution is to allow the compiler to place the off directive suﬃciently in advance of reaching the idle region whenever possible. Tolerating the latency of turning a function unit on. The functional unit must also be turned on prior to exiting the idle region. This is because there is a several cycle latency before which the functional unit comes on-line and is ready to execute operations [2]. By tolerating this latency we can minimize the performance degradation while executing instructions from the region following the idle region. Again our solution to this problem is to place the on directive suﬃciently in advance of exiting the idle region whenever possible. Dealing with variable length idle regions. Sometimes the duration of an idle region may vary from being very small in one execution of the region to very long in the next execution of the same region. For example, the idle region may contain a while loop or conditionals which may lead to this variation. Introduction of an off directive in such a situation can be based upon a conservative policy or an aggressive policy. A compiler based upon a conservative policy will introduce the off and on directives only if it is certain that the duration of the idle region is long. The problem with this approach is that the reductions in power dissipation that could be obtained by turning a unit oﬀ are sacriﬁced. We propose to use an aggressive policy in which the compiler introduces the off and on directives to maximize savings. If the duration of the idle region is

264

Siddharth Rele et al.

long, power savings result. On the other hand if the duration is very small, the on directive is issued on the heals of issuing an off directive. If the latter situation arises frequently, while little or no savings in power result, some amount of dynamic power is dissipated during switching of the functional unit state. Moreover the performance is hurt as the functional unit goes oﬀ-line for several cycles each time such a spurious pair of off and on directives are encountered. We address this issue by providing adequate microarchitecture support for nullifying spurious off and on pairs. The microarchitecture is designed to treat the compiler directives as hints ignoring a pair of off and on directives if they are too close together. In this way the state of the unit is not actually switched, the unit stays on-line, and dynamic power for switching the unit oﬀ and on as well as the degradation in performance are minimized. We have incorporated the power-aware instructions into the MIPS-I instruction set and simulated a superscalar architecture which implements these instructions using our FAST simulation system [8]. The compiler algorithms have been incorporated into the lcc compiler. The results of experiments show that some of the functional units can be kept off for over 90% of the time resulting in a corresponding reduction in static power dissipation by these units. Moreover the power reductions are achieved at the cost of very minimal performance degradation – well under 1% in all cases. The remainder of the paper is organized as follows. In section 2 we discuss instruction set extensions and microarchitecture modiﬁcations required to implement the new instructions. In section 3 we discuss in detail the compiler algorithms for introducing on and off instructions. In section 4 we describe our implementation and in section 5 we present results of experiments. Conclusions are given in section 6.

2

Architectural Support

Power aware instruction set. As mentioned earlier, we support instructions that will allow us to turn functional units on or oﬀ. Such instructions must also indicate the type of functional unit that is to be turned on or oﬀ. The solution we developed adds an on or an off directive as a suﬃx to existing instructions. The type of functional unit that is to be turned on or oﬀ is the same type as that is used to execute the instruction to which the directive is added. In case multiple functional units of a particular type are present, the decision as to which speciﬁc unit will be turned oﬀ is left up to the hardware. In some architectures certain operations can be executed by functional units of more than one type (e.g., integer and ﬂoating point). However, we assume that in such cases the off and on directives are attached to instructions that must execute on a functional unit of speciﬁc kind. We have incorporated the on and off directives to the MIPS-I Instruction Set Architecture (ISA) which supports MIPS 32 bit processor cores. This ISA was selected for its simplicity and the availability of encoding space to allow us to encode on and off into existing instructions. A subset of instructions we

Optimizing Static Power Dissipation by Functional Units

265

add.on add.off mul.on mul.off add.s.on add.s.off mul.s.on mul.s.off mov.s.on

switch ON one integer adder switch OFF one integer adder switch ON one integer multiplier unit switch OFF one integer multiplier unit switch ON one ﬂoat adder switch OFF one ﬂoat adder switch ON one ﬂoat multiplier unit switch OFF one ﬂoat multiplier unit move values between float regs and switch ON ﬂoat unit mov.s.off move values between float regs and switch OFFﬂoat unit

Fig. 1. A subset of energy-aware instructions modiﬁed is shown in Fig. 1. These instructions can also be issued without any operands in which case they do not perform any operation except for switching a unit of the appropriate type on or off. These are needed when on or off directives cannot be added to an existing instruction because the code does not already contain an instruction of the appropriate type around the point at which the compiler chooses to place the directive. On and off semantics for an out-of-order superscalar processor. The on directive is acted upon immediately following its detection, that is, when the instruction with the on suﬃx has been decoded, a functional unit of the appropriate type is turned on. It takes a few cycles for the circuitry to reach normal operational state after which the unit can perform useful work. The turning off of a functional unit cannot be done immediately following the decode. This is because if the unit that is turned oﬀ was the last on unit of its type, then no functional unit will be available for executing the instruction carrying the suﬃx and the processor will deadlock. Therefore in this case, following the decode, an on unit is selected and marked as pending-off. When the instruction that marks the unit retires, the unit is actually turned oﬀ and its status is changed from pending-oﬀ to oﬀ. This approach works because it guarantees that all instructions requiring the unit would have executed before the unit is turned oﬀ as all instructions are retired in-order even though they may execute on the functional unit out-of-order in the superscalar processor. At the same time, introduction of an off directive does not constrain the out-of-order execution capability of the processor. The states of the functional units are maintained as part of the processor state. A status table is maintained that indicates for each functional unit whether it is currently turned on, currently turned oﬀ, or if it is in the pending-oﬀ state. No new instructions are assigned to a functional unit by the issue mechanism if the unit is in off or pending-off state. Nullifying spurious off-on pairs. While savings in static energy consumption result when a functional unit is shutdown, a certain amount of performance

266

Siddharth Rele et al.

loss may be incurred when a unit is turned oﬀ as well as a certain amount of dynamic power is expended in bringing the circuit to its normal operating state. We rely upon the compiler to identify suitable idle regions during which turning oﬀ of a functional unit is not expected to hurt performance and the dynamic power expended in turning the unit on is far smaller than the static power saved by turning it oﬀ. For this strategy to work well, it is important that the idle regions be long in duration. However, it is possible that the code representing the idle region varies greatly in duration from its one execution to another. For example, the idle region may be formed by a while loop. If very little time is needed to execute the idle region then the unit will be turned oﬀ and then immediately turned on. In this situation the savings in static power will be minimal. However, loss of performance will still be incurred while executing the code immediately following the idle region and dynamic power will still be expended in turning the unit on. Our implementation of on and oﬀ is so designed that we are able to dynamically nullify spuriuous oﬀ and on pairs and thus avoid the dynamic power that would otherwise be dissipated during the transitions. When an instruction with off directive is encountered, a unit is selected and marked as pending-off. If an instruction with the on directive is encountered while the status of the unit is still pending-off, the unit state is changed to on from pending-off. When the instruction associated with the off directive retires, it will examine the status of the functional unit that it marked as pending-off. If the status is still pending-off, the unit is turned oﬀ; otherwise it is left on. Thus, the overall impact of the above approach is that if the on directive is encountered while the functional unit is in pending-off state, the functional unit is not actually turned oﬀ. Thus the off-on pair does not turn the unit oﬀ and then back on. 1 : .... 2 : mul.off – turn unit oﬀ 3 : if (x > 0) { 4: wait = 0; 5: while(1) { 6: wait = wait++; 7: if (wait == 1000) break; 8: } 9 : mul.on – turn unit on 10 : for (i = 0 ; i < 100; i++) 11 : sum += a[i] * 10; 12 : ....

Fig. 2. Nulliﬁcation of OFF and ON pair For the example in Fig. 2, the code from line 3 to 8 takes very short time to execute when x ≤ 0; otherwise it takes a long time to execute. During the execution of this code we would like to turn the multiplier oﬀ since it is not required. If x > 0 we get power savings by turning the unit oﬀ. However, if

Optimizing Static Power Dissipation by Functional Units

267

x ≤ 0, the oﬀ and on directives are encountered in rapid succession and the unit is not turned oﬀ and then immediately turned back on. Before the instruction with the off directive retires, we would have already decoded the instruction with the on directive and changed the status of the unit from pending-off to on. Therefore when the instruction with off directive retires, it will ﬁnd the functional unit status as on and therefore it will not turn it off. As a result the spurious off-on pair will be nulliﬁed.

3

Compiler Support

Our approach. Our compiler is designed to introduce off and on suﬃxed instructions in such a way that the following two goals are met. First we need to remove idleness by turning functional units oﬀ without causing an increase in program execution time (i.e., we want to reduce static power dissipation without causing performance degradation). Second the functional units that are turned oﬀ should be oﬀ for prolonged periods of time so that the dynamic power dissipated during on-off and off-on transitions is small in comparison to static power saved by keep the units oﬀ. Both the above goals are met by careful placement of on and off suﬃxed instructions. In order to achieve the ﬁrst goal of minimizing performance degradation we take the following approach. We classify the basic blocks in a program into two categories: hot blocks whose execution frequencies are greater than a certain threshold value and cold blocks which are all the remaining blocks in the program. We also analyze the functional unit usage in each block to identify its requirements and consequently identify the units that are expected to be idle in that block. We place the off and on directives in cold blocks bordering the hot blocks in which the unit is expected to be idle. This situation is illustrated by the example in Fig. 3a. In contrast the example in Fig. 3b illustrates a situation in which we forego the removal of idleness since the block neighboring the hot block in which unit is idle is another hot block where the unit is not idle. This is because the potential placement points for off and on directives are also hot and therefore such instructions will be executed with high frequency. Thus, our approach removes idleness only if such removal does not adversely eﬀect performance. In order to achieve the second goal of maximizing power savings mentioned above we do not place instructions carrying off and on directives at boundaries of a region formed by a single basic block. Instead we identify larger subgraphs in the control ﬂow graph that represent control constructs (e.g., loops) which we refer to as power blocks. Then we classify the power blocks as hot or cold. In addition, from the requirements of individual blocks in a power block, we identify which functional units are idle throughout the execution of the power block. When power-aware code is generated, the off and on directives are placed at boundaries of power blocks using the principles described earlier and illustrated in Fig. 3.

268

Siddharth Rele et al. Cold OFF Hot

Hot

NOT IDLE

Hot

IDLE

IDLE ON

Cold (a) Reving idleness without performance degradation.

(b) Allowing idleness to avoid performance degradation.

Fig. 3. Idleness removal strategy We have given an overview of our approach. Now we describe the three main steps of our algorithm in more detail. The ﬁrst step involves construction of an power-aware ﬂow graph. The second step identiﬁes the power blocks. The third and ﬁnal step introduces the off and on suﬃxed instructions. The power-aware ﬂow graph (PAFG). Our compiler begins by building the PAFG which is a control ﬂow graph whose basic blocks are annotated with two types of information: the resource requirements; and the execution counts. The requirements of each block is calculated by ﬁrst identifying the number of operations requiring each functional unit type in the block. This information by itself is enough for those functional unit types where only one functional unit of that type is present. If an operation requiring the functional unit of a certain type is present, the unit of that type is required. However, the above method is inadequate if there are multiple functional units of a given type. We must access the level of instruction level parallelism present in the operations that use the functional unit type to compute the requirements. The dependences among statements are examined to identify the parallelism and accordingly the requirements are computed. In particular, if two instructions that can execute in parallel require the same type of functional unit, then two such units are required. In other words the requirements of a basic block are computed such that they represent the number and type of units required to exploit the ILP present in the block. Another issue that must be considered during computation of requirements is that many instructions other than the integer add instruction may use the integer adder. For example, base + oﬀset computation to compute the address of an array element requires an integer adder. The proﬁle information that annotates the basic blocks is derived from prior executions of the program. This information is used for identifying hot blocks. If the execution count for a particular block is more than a threshold, it is considered to be hot. The threshold value is set according to the formula given below. In this formula N is a tunable parameter that can be changed to generate higher or lower number of hot blocks and thus control how aggressively idleness is removed.

Optimizing Static Power Dissipation by Functional Units

Threshold =

269

Execution Count of Most Frequently Executed Block . Some constant value N

An example code segment and its power-aware ﬂow graph are shown in Fig. 4a and 4b. The requirements are annotated as a vector of values enclosed in angular brackets (the ﬁrst value corresponds to integer adders and second for integer multipliers) while the proﬁling information is annotated as the execution count enclosed within square brackets. We set the threshold value as M axV alue/10 for identifying hot blocks. Identifying power blocks. In order to identify longer periods of time over which a functional unit can be turned oﬀ, we identify subgraphs representing larger constructs such as loops, if-statements, and switch statements. These subgraphs are referred to as power blocks. A hierarchical graph at the power block level is created in which each power block indicates the start and the end nodes of the subgraph forming the power block. In addition, a power block holds the summary of all the information regarding the basic blocks that form the power block. The requirements of a power block are computed from the requirements of the hot blocks in the block. The reason for this will be clear when we discuss how off and on directives are generated. There is only one entry point into a power block, that is, the start node of the power block dominates all the blocks inside the power block and hence the control has to ﬂow through that block. Therefore if the start node is hot, the whole power block is marked as hot even though all the basic blocks belonging to it may not be hot. The higher level tree constructed from power blocks for our example is shown in Fig. 4c. Each leaf in this tree is a basic block. Internal nodes corresponding to higher level control constructs are the power blocks. Inserting power-aware instructions. Once all the information regarding the requirements of each basic as well as power block is recorded in the respective blocks, we traverse the PAFG for code generation. Our basic approach for introducing the off and on instructions is as follows: – For each user function we start by turning all units, except a minimal conﬁguration of units, oﬀ. The minimal conﬁguration is required so that execution can proceed and the processor does not deadlock. Typically this conﬁguration will include an integer adder. – For each call to a library function we assume that all units are on during the execution of the library function. This is because we do not analyze code for library functions and therefore in order to guarantee that no performance degradation occurs, we must keep all units on. Instructions to turn on units that are oﬀ are therefore introduced immediately prior to the call and upon return these units can be again turned oﬀ. The impact of this restriction can be reduced by performing our optimizations at link time.

270

Siddharth Rele et al.

1: void main() { 2: for (i = 0 ; i < 100 ; i++) 3: if(sum < 1000) 4: sum = sum + arr[i]; 5: else { 6: sum = sum / 1000; 7: count++; 8: } 9: print(count,sum); 10 : } (a) Sample code segment.

Start

add.off mul.off

[1]

Start

<1 , 0 >

mul.off

[1]

i = 0;

i = 0;

<1 , 0 >

[100]

False

sum < 1000

add.on

<1 , 1 >

True [96]

<2 , 0 >

[4]

sum = sum / 1000; count++; i++

sum = sum + arr[i] i++

sum < 1000

False

mul.on

True sum = sum + arr[i] i++

[100]

<1 , 0 > i < 100

True

sum = sum / 1000; count++; i++

mul.off

False

[1]

print(count,sum) True

[1]

i < 100 False

End

(b) Power-aware flow graph.

mul.on mul.on

Func

<2, 0>

i=0

Loop

print(count,sum)

Func()

End

End < 2, 0>

<1,0>

<2, 0>

sum < 1000

(d) Final code.

i < 100

IF

sum [+]

Hot Blocks

<1 , 1>

sum [/]

Hot Power Blocks

(c) Hierarchical tree with power blocks.

Fig. 4. Introducing directives

Optimizing Static Power Dissipation by Functional Units

271

– If a particular user function is called in a hot block such that the number of calls to the function exceed the threshold, then the current framework bypasses the analysis of that function, on the grounds that any switching inside this function would be too frequent and hence not beneﬁcial (it may in fact jeopardize the execution speed). – We compare each block with all its successors to check if there is a diﬀerence in the power requirements of the blocks. If there is a diﬀerence, then we try to generate oﬀ and on instructions at the boundaries after checking whether the blocks involved are hot or cold according to the strategy outlined earlier in this section. When a hot power block is adjacent to cold blocks, typically oﬀ instructions are generated prior to entering the power block. From the requirements of the power block we identify the units to be turned on or oﬀ. Recall that the requirements of the power block are computed from hot blocks in it. Therefore within the hot power block there may be cold blocks which require a unit that is currently oﬀ. Therefore, upon entry to such a cold block such a unit is turned on and upon exit it is turned oﬀ again. Notice that all instructions being introduced are being placed in cold blocks. The code generated for our example is given in Fig. 4d. We assume that we have 2 integer adders and 2 integer multipliers (ﬂoating point units are omitted because we assume all operations in the code are integer operations). Note that at the beginning we turn all functional units oﬀ except the integer adder which represents the minimal conﬁguration for this example. The loop represents a hot power block and the block preceding the loop is a cold block. Therefore we introduce instructions according to the requirements of the power block prior to entering it. Since the hot basic block in the loop containing the statements ”sum = sum+arr[i]” and ”i++” requires two adders to exploit ILP, we turn on an additional adder before entering the loop. Notice that the multiplier (which we assume also performs the divide operation) is oﬀ in the loop. Therefore if we enter the cold block containing the statement ”sum = sum/1000”, a multiplier is turned on and upon exit it is turned oﬀ. Finally, prior to executing the library function call for printf all oﬀ units are turned on – since at this point adders are already on, only the multipliers need to be turned on.

4

Experimental Results

Implementation. We have implemented and evaluated the techniques described in this paper. We used the lcc [3] compiler for our work. lburg was used to produce code generator from compact speciﬁcations. The original code was executed on test data to generate proﬁle information which is used by the compiler to generate on and off instructions. We use a cycle level simulator generated using the FAST [8] system. FAST generates a cycle level simulator, an assembler and a disassembler from a microarchitecture and instruction set speciﬁcations. In our experiments we simulated a superscalar that supported out-of-order execution and consisted of 2 integer adders, 2 integer multipliers, 1 ﬂoating point

272

Siddharth Rele et al.

adder, and 1 ﬂoating point multiplier. It uses control speculation (i.e. branch prediction) and implements a precise exception model using a reorder buﬀer and a future ﬁle. The number of outstanding branches is not limited and branch mispredictions take a variable number of cycles to recover. We used six benchmarks in our experiments. From Mediabench [6] we have used two programs: rawcaudio.c and rawdaudio.c. From DSPstones we have taken three programs: fir2dim.c, n-real-updates.c, and fir.c. The last benchmark, compress.c, is from SPEC95. Removing idle time. To access the eﬀectiveness of our idle time removal technique we measured the utilization of functional units before and after optimization. We deﬁne utilization as the percentage of total program execution time (in cycles) for which the unit is on and busy executing instructions. In Table 2 we show the utilization of the various functional unit types in the processor – for integer units the numbers represent average utilization of the two units. As we can see, except for the integer adders, the other units have very low utilization because while they are on, they are often not executing any operations. In other words there must be times when these units can be turned oﬀ. After applying our techniques we measured the utilization again. As shown in Table 3 the utilization of the integer adders shows very little change. This is because during the execution of the optimized code these units were always on. For the other three types of units the utilization has become very high because they are busy executing operations while they are on. This means that for most of the times that they were idle, we were able to turn them oﬀ. In other words these units were oﬀ for over 90% of the time for all programs except compress. Recalling the data in Table 2, we can see that turning oﬀ units for 90% of the time results in signiﬁcant savings in static power dissipation.

Table 2. Utilization of functional units in original code

Benchmark rawcaudio.c rawdaudio.c fir2dim.c n-real-updates.c fir.c compress.c

Utilization (%) Integer Float Adder M ult Adder M ult 87.71 0.0252 0 0 88.76 0.00159 0 0 59.45 7.01 0 0 61.62 2.37 0 0 52.26 2.65 0 0 90.08 0.045 25.70 29.06

Performance degradation. We also measured the degradation in the performance by comparing the total execution cycle counts for original and optimized code (see Table 4). The degradation is less than 1% due to the fact that we place

Optimizing Static Power Dissipation by Functional Units

273

Table 3. Utilization of functional units in optimized code

Benchmark rawcaudio.c rawdaudio.c fir2dim.c n-real-updates.c fir.c compress.c

Utilization (%) Integer Float Adder M ult Adder M ul 87.73 99.7 99.7 99.7 88.77 99.76 99.76 99.76 59.73 85.03 99.70 99.70 61.62 94.53 99.44 99.44 52.72 92.42 99.38 99.38 90.31 98.90 23.99 51.15

Table 4. Performance degradation Benchmark rawcaudio.c rawdaudio.c fir2dim.c n-real-updates.c fir.c compress.c

U noptimized Optimized 6,588,776 6,591,742 5,028,710 5,049,175 4,676 4689 2,697 2,697 2,413 2,424 453,823 454,877

Degradation -0.0147 -0.0041 -0.28 0 -0.46 -0.232

the on and off instructions in cold blocks and units are turned on upon decode of instruction with the on suﬃx. The latter action reduces stalling of instructions due to unavailability of functional units. Transition activity vs oﬀ durations. For each idle period that a unit is turned oﬀ, we have a pair of transitions: on-to-oﬀ and then oﬀ-to-on. While the static power saved during the oﬀ periods depends upon the duration of the oﬀ periods, the dynamic power spent during transitions depends upon the total number of transitions actually performed. Table 5 gives the total number of transition pairs for all the functional units types. There are no transitions for integer adders because they are always on and for integer multipliers the number given is the sum of the transitions encountered by both units of this type. These are the transitions which were actually performed during execution. Table 6 gives the average duration for which units were turned oﬀ. As we can see these durations are quite long - ranging from several hundred to several thousand cycles. Since the durations for which functional units are oﬀ are quite long and the number of transition pairs is relatively modest, we can conclude that our approach is quite eﬀective in saving static power wasted by idle functional units. Eﬀectiveness of nulliﬁcation strategy. We also measured the number of transition pairs which were nulliﬁed by our architecture design because they

274

Siddharth Rele et al.

Table 5. Non-nulliﬁed transition pairs

Benchmark rawcaudio.c rawdaudio.c fir2dim.c n-real-updates.c fir.c compress.c

Integer Float Adder M ult Adder M ult 0 769 748 735 0 800 919 712 0 2 1 1 0 2 1 1 0 2 1 1 0 113 212 286

Table 6. Average oﬀ duration in cycles

Benchmarks rawcaudio.c rawdaudio.c fir2dim.c n-real-updates.c fir.c compress

Integer Float Adder M ult Adder M ult 8552 8789 8944 5481 6296 7075 3987 4674 4674 2550 2682 2682 2230 2409 2409 10929 496 847

Table 7. Nulliﬁed transition pairs

Benchmarks rawcaudio.c rawdaudio.c fir2dim.c n-real-updates.c fir.c compress

Integer Float Adder M ult Adder M ult 0 445 148 148 1510 298 149 149 0 0 0 0 0 0 0 0 2 0 0 0 958 0 1539 0

were too close together. The number of nulliﬁed transition pairs is given in Table 7. As we can see, this number is quite signiﬁcant for some benchmarks as they contain variable length idle regions which are quite often of small duration. Therefore our approach of allowing the compiler to aggressively remove idle time and then relying on the hardware to nullify the operations if they are not useful has proven to be very successful.

Optimizing Static Power Dissipation by Functional Units

5

275

Conclusions

The static power component of power dissipation is on a rise [2,9]. We presented a technique for reducing this static power to some extent by switching oﬀ the idle units. Our approach uses a combination of compiler, instruction set, and microarchitecture support for maximizing power savings and minimizing performance degradation. Static power reduction of over 90% was achieved for units that were found to be mostly idle at the cost of well under 1% increase in execution times.

References 1. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for ArchitecturalLevel Power Analysis and Optimizations. In International Symposium on Computer Architecture (ISCA), pages 83–94, Vancouver, British Columbia, June 2000. 2. J. A. Butts and G. S. Sohi. A Static Power Model for Architects. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 191–201. December 2000. 261, 262, 263, 275 3. C. Fraser and D. Hanson. lcc: A Retargetable C Compiler: Design and Implementation. Adison Wesley Publishing Company, 1995. 271 4. M. Horowitz, T. Indermaur, and R. Gonzalez. Low-Power Digital Design. In IEEE Symposium on Low Power Electronics, pages 8-11, 1994. 261 5. K. S. Khouri and N. K. Jha. Private Communication. June 2001. 262 6. C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Research Triangle Park, North Carolina, December 1997. 272 7. MIPS Technologies, 1225 Charleston Road, Mountain View CA-94043. MIPS32 4k Processor Core Family, Software Users Manual, 1.12 edition, January 2001. 8. S. Onder and R. Gupta. Automatic Generation of Microarchitecture Simulators. In IEEE International Conference on Computer Languages (ICCL), pages 80–89, Chicago, Illinois, May 1998. 264, 271 9. M. D. Powell, S-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. GatedVdd:a Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories. In ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2000. 261, 262, 275 10. K. Roy. Leakage Power Reduction in Low-Voltage CMOS Design. In IEEE International Conference on Circuits and Systems, pages 167-173, 1998. 262 11. S. Thompson, P. Packan, and M. Bohr. MOS Scaling: Transistor Challenges of the 21st Century. Intel Technology Journal, Q3, 1998. 262 12. V. Tiwari, R. Donnelly, S. Malik, and R. Gonzalez. Dynamic Power Management for Microprocessors: A Case Study. In International Conference on VLSI Design, pages 185-192, 1997. 261 13. V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. patel, and F. Baez. Reducing Power in High-Performance Processors. In Design Automation Conference (DAC), pages 732-737, 1998. 261 14. Q. Wang and S. Vrudhula. Static Power Optimization of Deep Submicron CMOS Circuits for Dual VT Technology. In International Conference on Computer-Aided Design (ICCAD), pages 490-496, 1998. 262

Inﬂuence of Loop Optimizations on Energy Consumption of Multi-bank Memory Systems Mahmut Kandemir1 , Ibrahim Kolcu2 , and Ismail Kadayif1 1 Department of Computer Science and Engineering The Pennsylvania State University, University Park, PA 16802, USA [email protected] 2 Computation Department, UMIST Manchester, M60 1QD, UK [email protected]

Abstract. It is clear that automatic compiler support for energy optimization can lead to better embedded system implementations with reduced design time and cost. Eﬃcient solutions to energy optimization problems are particularly important for array-dominated applications that spend a signiﬁcant portion of their energy budget in executing memory-related operations. Recent interest in multi-bank memory architectures and low-power operating modes motivates us to investigate whether current locality-oriented loop-level transformations are suitable from an energy perspective in a multi-bank architecture, and if not, how these transformations can be tuned to take into account the banked nature of the memory structure and the existence of low-power modes. In this paper, we discuss the similarities and conﬂicts between two complementary objectives, namely, optimizing cache locality and reducing memory system energy, and try to see whether loop transformations developed for the former objective can also be used for the latter. To test our approach, we have implemented bank-conscious versions of three loop transformation techniques (loop ﬁssion/fusion, linear loop transformations, and loop tiling) using an experimental compiler infrastructure, and measured the energy beneﬁts using nine array-dominated codes. Our results show that the modiﬁed (memory bank-aware) loop transformations result in large energy savings in both cacheless and cache-based systems, and that the execution times of the resulting codes are competitive with those obtained using pure locality-oriented techniques in a cache-based system.

1

Introduction

In programming for many embedded devices, one important aspect is to minimize the energy consumption. As oﬀ-chip main memories incur a signiﬁcant energy and performance penalty when accessed, it is particularly important to perform user and/or compiler level optimizations to reduce energy consumption R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 276–292, 2002. c Springer-Verlag Berlin Heidelberg 2002

Inﬂuence of Loop Optimizations on Energy Consumption

277

and improve cache locality (if a cache exists in the system). While the impact of loop-level compiler optimizations on performance is well understood (e.g., see [12] and the references therein), very few studies (e.g., [1]) have tried to address the eﬀect of these transformations on energy consumption. Investigating the energy impact of loop optimizations is important, because this is the ﬁrst step towards developing energy-oriented compiler optimizations. Improving memory energy consumption is particularly important in embedded systems that execute image and video processing applications. These applications manipulate large arrays of signals using nested loops, and spend signiﬁcant portions of their execution time in executing memory-related operations [1]. Large oﬀ-chip memories that hold the arrays manipulated by these codes exhibit high per access energy cost (due to long bitlines and wordlines). A recent trend in memory architecture design is to organize the memory as an array of multiple banks (e.g., [11]) instead of a more traditional monolithic single-bank architecture. Each bank contains a portion of the address space and can be optimized for energy using an appropriate mix of low-power operating modes. More speciﬁcally, a bank not used by the current computation can be placed into a low-power operating mode. Also, using smaller banks help reduce per access energy cost. Recent work has addressed how such low-power operating modes can be managed at software [3,6] and hardware [3] levels. The impact of array placement strategies and two loop optimizations (loop splitting and loop distribution) on a banked oﬀ-chip memory architecture has been presented in [2]. The focus of this paper is on reducing the energy consumption of a multi-bank memory system without sacriﬁcing performance signiﬁcantly. In particular, we focus on array-dominated applications that can be found in domains such as embedded image/video processing and scientiﬁc computing, and investigate several loop transformation techniques to see whether they are successful in reducing the memory system energy. We address the problem for both a cacheless system and a system with cache memory. In a cacheless system (which is used commonly in real-time embedded applications), we study the energy impact of classical locality-oriented loop-level techniques and show that slight modiﬁcations to them can bring large energy beneﬁts. In a cache-based system, we attempt to modify the data locality-oriented techniques to take into account the banked nature of the oﬀ-chip memory. To test our approach, we have implemented bankconscious versions of three loop transformation techniques (loop ﬁssion/fusion, linear loop transformations, and loop tiling) using the SUIF compiler infrastructure [5], and measured the energy beneﬁts using nine array-dominated codes. Our results show that the modiﬁed loop transformations result in large energy savings, and that the execution times of the resulting codes are competitive with those obtained using pure locality-oriented techniques. The rest of this paper is organized as follows. Section 2 introduces the memory architecture assumed, and revises the fundamental concepts related to lowpower operating mode management. Section 3 discusses the relationship between cache locality and memory energy consumption. Section 4 discusses the impact of three diﬀerent loop-level transformations (iteration space tiling, linear loop

278

Mahmut Kandemir et al.

transformations, and loop fusion and ﬁssion) on memory energy, and explains how these optimizations can be modiﬁed to take into account the banked nature of the memory system. Section 5 presents experimental results showing the energy beneﬁts of loop transformations. Section 6 concludes the paper with a summary.

2

Memory Architecture

In this work, we focus on an RDRAM-like oﬀ-chip memory architecture [11] where oﬀ-chip memory is partitioned into several banks, each of which can be activated or deactivated independently from others. In this architecture, when a bank is not actively used, it can be placed into a low-power operating mode. While in a low-power mode, a bank typically consumes much less energy than in active (normal operation) mode. However, when the bank is asked to service a memory request, it will take some time for the bank to come alive. The time it takes to switch to active mode (from a low-power mode) is called resynchronization overhead (or reactivation cost). Typically, there is a trade-oﬀ between energy saving and resynchronization overhead. That is, a more energy-saving low-power operating mode has also a higher resynchronization overhead. Thus, it is important to select the most appropriate low-power mode to switch to when the bank becomes idle. Note that diﬀerent banks can be in diﬀerent low-power modes at a given time. In this study, we assume four diﬀerent operating modes: an active mode (the mode during which the memory read/write activity can occur) and three lowpower modes, namely, standby, napping, and power-down. Current DRAMs [11] support up to six power modes with a few of them supporting only two modes. We collapse the read, write, and active without read or write modes into a single mode (called active mode) in our experimentation. However, one may choose to vary the number of modes based on the target DRAM architecture. The energy consumptions and resynchronization overheads for these operating modes are given in Figure 1. The energy values shown in this ﬁgure have been obtained from the measured current values associated with memory modules documented in memory data sheets (for a 3.3 V, 2.5 nsec cycle time, 8 MB memory) [10]. The resynchronization times (overheads) are also obtained from data sheets. Based on trends gleaned from data sheets, the energy values are increased by 30% when module size is doubled. An important parameter that helps us choose the most suitable low-power mode is bank inter-access time (BIT), i.e., the time between successive accesses (requests) to a given bank. Obviously, the larger the BIT, the more aggressive low-power mode can be exploited. Then, the problem of eﬀective power mode utilization can be deﬁned as one of accurately estimating the BIT and using this information to select the most suitable low-power mode. This estimation can be done by software using the compiler [3,2] or OS support [6], by hardware using a prediction mechanism attached to the memory controller [3], or by a combination of both. While the compiler-based techniques have the advantage of predicting

Inﬂuence of Loop Optimizations on Energy Consumption

279

Energy Resynchronization Consumption (nJ) Overhead (cycles) Active 3.570 0 Standby 0.830 2 Napping 0.320 30 Power-Down 0.005 9,000

Fig. 1. Energy consumptions (per access) and resynchronization times for different operating modes. These are the values used in our experiments

BIT accurately for a speciﬁc class of applications, runtime and hardware based techniques are able to capture runtime variations in access patterns (e.g., those due to cache hits/misses) better. In this paper, we employ a hardware-based BIT prediction mechanism whose details are explained in [3]. The prediction mechanism is similar to the mechanisms used in current memory controllers. Speciﬁcally, after 10 cycles of idleness, the corresponding bank is put in standby mode. Subsequently, if the bank is not referenced for another 100 cycles, it is transitioned into the napping mode. Finally, if the bank is not referenced for a further 1,000,000 cycles, it is put into power-down mode. Whenever the bank is referenced, it is brought back into the active mode incurring the corresponding resynchronization overhead (based on what mode it was in). We focus on a single program environment, and do not consider the existence of a virtual memory system. Exploring the (memory) energy impact of loop transformations in the presence of a virtual address translation is part of our future planned research.

3

Cache Locality vs. Oﬀ-Chip Memory Energy

Many optimizing compilers from industry and academia use a suite of techniques for enhancing data locality. Loop transformation techniques [12] are particularly important as there is a well-deﬁned data dependence and loop re-writing (code re-structuring) theory behind them and several eﬃcient implementations exist. Almost all of compiler-based locality-enhancing techniques take some cache speciﬁc parameters (e.g., size and associativity) into account and introduce some extra loop overhead and might cause some degradation in the instruction cache performance (as they typically increase code size and reduce instruction reuse). If there exists no cache in the memory hierarchy, it might not be advisable to employ locality-oriented loop transformations as they do not bring any beneﬁt; instead, they increase loop execution overhead. However, if the memory system is partitioned into banks, applying loop transformations still makes sense (i.e., even if there is no cache) as we can cluster loop iterations (through loop transformations) such that the memory accesses in a given time period are localized into a small set of banks. This obviously allows the system to place more banks into low-power operating modes. One of the questions that we try to address in this

280

Mahmut Kandemir et al.

paper is to see whether the classical cache locality oriented techniques are also suitable for optimizing oﬀ-chip memory energy in a cacheless multi-bank memory architecture; and if so, how they can be modiﬁed to extract the maximum energy beneﬁts from the memory system. The existence of a cache memory can, on the other hand, have an important impact on the energy consumption of a banked memory architecture. The cache memory can ﬁlter out many memory references and increase the bank interaccess times. This has two major consequences. First, the oﬀ-chip memory is accessed less frequently, and therefore consumes less energy. Second, more memory banks can be put in low-power modes and (in some cases) more aggressive low-power modes can be utilized. If the banked-memory system has a cache memory, selecting a suitable combination and versions of loop-level transformations to apply becomes a much more challenging problem. This is because, two objectives, namely, optimizing cache locality and minimizing oﬀ-chip memory energy can sometimes conﬂict with each other (that is, they may demand diﬀerent loop transformations and/or diﬀerent parameters–e.g., tile size and unrolling factor–for the same set of transformations). In this case, one approach would be to optimize cache locality only and not to perform any banked-memory speciﬁc transformation. This strategy works ﬁne as long as the cache is able to capture the data access pattern successfully; that is, the vast majority of data references are satisﬁed from the cache and do not go to oﬀ-chip memory. However, if this is not the case, then we need to take care of oﬀ-chip references as well. We address this problem by modifying the cache locality optimization strategy to take into account the fact that, for the best oﬀ-chip energy behavior, the data accesses should be clustered into a small set of memory banks. More speciﬁcally, we modify each type of loop transformation so that it becomes bank-conscious (bank-aware) as explained in the next section. One way of achieving this is to make sure that the transformed code accesses fewer banks than the original (unoptimized) code (even if all accesses miss the cache) and that the accesses are more clustered than the original code. If this is not possible, then we try not to increase the number of banks that need to be activated (as compared to the original code). In addition to evaluating the impact of loop transformations on the energy behavior of a cacheless memory architecture, this paper also experimentally evaluates two alternative schemes for optimizing energy and locality for a banked memory architecture with cache. The ﬁrst scheme optimizes only for cache locality, and the second scheme tries to strike a balance between enhancing cache locality and reducing oﬀ-chip memory energy as explained above.

4

Energy Impact of Loop Transformations

In this section, we discuss how classical loop-based techniques developed for optimizing cache locality aﬀect oﬀ-chip memory energy consumption. The conclusions we make here will be supported by experimental evaluation given in Section 5. As mentioned earlier in the paper, the optimizations considered in

Inﬂuence of Loop Optimizations on Energy Consumption

281

this work include loop fusion/ﬁssion, iteration space tiling (loop blocking), and linear loop transformations. 4.1

Loop Fusion and Fission

Combining two loops into a single loop is called loop fusion. It is traditionally used to bring array references to the same elements close together [12]. Consider the following example written using a C-like notation, which consists of two separate loops that access the same array a. It is easy to see that if the loop limit is suﬃciently large that the array does not ﬁt in cache, this code will stream the array a from memory through the cache twice (once for each loop). for(i=0;i
If this fragment is transformed into the form below, on the other hand, the array needs to be streamed through the cache only once since its contribution to the second assignment can be calculated, while the cache line holding a[i] is still cache resident from its use in the ﬁrst assignment statement. This simple example illustrates that loop fusion can improve cache locality by bringing accesses to the same array closer. for(i=0;i
Unfortunately, the impact of loop fusion on oﬀ-chip memory energy is not as clear. If the loop nests to be fused contain extra arrays (i.e., arrays that are not targeted by fusion), these arrays might lead to accesses to a large number of memory banks (some of which would not be accessed if we have not fused the loops). Therefore, in a multi-bank memory architecture, loop fusion should be applied with care. One criterion in applying this optimization is to check whether fusing loops would lead to activation of more banks than individual nests demand. Loop ﬁssion (also known as loop distribution [12]) is the reverse of loop fusion, and places the statements in a given loop into separate loops, each with its own iteration space. One can expect this transformation to be useful from a memory energy viewpoint, in particular, in cases where it separates the references to diﬀerent arrays, thereby minimizing the number of banks that need to be activated for a given loop. It is important to note the conﬂicting objectives of optimizing cache locality and optimizing memory energy when these transformations are employed. In general, when one wants to optimize data cache locality, loop fusion is preferable

282

Mahmut Kandemir et al.

whereas loop distribution is generally used to enhance iteration-level parallelism by placing the sinks and sources of data dependences into separate loops. As far as memory energy optimization is concerned, however, loop ﬁssion is, in general, preferable as it has the capability of isolating accesses to small set of banks. For example, suppose that a loop nest accesses two diﬀerent arrays a and b. Further assume that each array is accessed in a separate statement (in the loop body) and resides in a separate memory bank. If we do not perform loop ﬁssion, each iteration of the loop will access both the banks and the BIT (for each bank) will be very small to take any advantage of. If, on the other hand, the loop ﬁssion is applied (provided that it is legal), each loop accesses a single bank. Since in this case the BIT for each bank is large, this may present more opportunities for placing banks into low-power modes. Based on the discussion above, we propose the following strategy for applying loop fusion and ﬁssion in a banked-memory environment. If there is no cache in the memory hierarchy, then we do not apply loop fusion; we apply loop ﬁssion in such a way that the arrays that share the same set of banks reside within the same loop after ﬁssion. If there exists a cache, we do not modify our loop ﬁssioning strategy except that we do not separate statements that contain references to the same array (in an attempt to preserve cache locality). Delaluz et al. [2] present a loop distribution strategy for optimizing oﬀ-chip memory energy. As compared to that algorithm, the approach presented here is not based on trying a subset of all possible ﬁssioning alternatives (that is, it ﬁnds the solution in one shot), it is integrated with loop fusion, tiling, and loop permutation, and it tries to optimize cache locality and oﬀ-chip memory energy consumption in concert. Note that our fusioning/ﬁssioning strategy tries to strike a balance between two objectives. When applying loop fusion in a cache-based environment, on the other hand, we take cache considerations into account but never fuse two loops if doing so increases the number of banks accessed in a single iteration. For example, suppose that there are three one-dimensional fussable loops in the code, each with one statement within it: k1 += a[i]+b[i] in the ﬁrst loop; k2 += a[i+1]*b[i-1] in the second loop; and k3 += c[i]-b[i] in the third loop. Also, assume that each array is stored in a separate bank. In this case, while a pure cache locality-oriented approach would fuse all three loops (in conjunction with array padding), our bank-conscious approach would fuse only the ﬁrst two loops. Note that as in the case of loop ﬁssion, this loop fusion scheme also tries to ﬁnd a balance between conﬂicting objectives. To sum up, in a cachebased environment, we use cache constraints to restrict loop ﬁssion and bankedmemory constraints (e.g., minimizing the number of active banks) to restrict loop fusion. 4.2

Loop Tiling

A widely-used technique for improving cache locality is loop tiling [12]. Here, data structures that are too big to ﬁt in the cache are broken up into smaller pieces that will ﬁt in the cache. Consider the following matrix-multiply example.

Inﬂuence of Loop Optimizations on Energy Consumption

283

If the arrays accessed in this nest do not ﬁt in the cache, the cache performance might be poor. for(i=0;i
If, however, this nest is tiled (blocked) as shown below (assuming that T divides L evenly, where T denotes the tile size), a square-block of array c is computed by taking the product of a row-block of a with a column-block of b. Note that this product consists of a series of sub-matrix multiplies. If these three blocks, one from each matrix, all ﬁt in cache simultaneously, their elements only need to be read in from memory once for each sub-matrix multiply. Thus, the array a will now only need to be touched once for each column-block of c, and b will only need to be touched once for each row-block of a. As a result, the memory traﬃc will be reduced by the size of the blocks. for(ii=0;ii
While this transformation enhances temporal locality across multiple loop levels, it also modiﬁes the array access pattern dramatically. For instance, after the transformation, at a given time, a column-block of array b is active. It should be observed that depending on the tile size parameter, a majority of these elements are not consecutive in memory (assuming a row-major memory layout). Consequently, all the banks that hold these elements need to be active during a given short period of time. This is, of course, assuming that the references to these elements will go to oﬀ-chip memory and that the array is large enough. If there is a cache memory that captures these references successfully, then the impact of tiling on memory energy is expected to be positive (as it increases the bank inter-access times). Our bank-aware tiling strategy works as follows. It ﬁrst determines the loops that carry some form of data reuse as tiling a loop which does not carry any reuse does not improve cache performance but increases loop overhead. We achieve this using the reuse-oriented tiling strategy. Then, among these loops (with data reuse), it selects a subset such that the resulting access pattern does not generate a data tile (i.e., data footprint) on the array space which is orthogonal to the storage direction of the array. This is because, under the assumption that elements of a given array are stored consecutively in memory (from the ﬁrst element to the last element), a data tile orthogonal to the storage direction (of the array) leads to a maximum number of bank activation. For example, in a two-dimensional row-major array case, the bank-aware tiling strategy never selects an iteration space tile shape if it leads to a column-block data tile on the

284

Mahmut Kandemir et al.

array space. If possible, it works with only row-block and square tiles. Note that, in the ideal case, one would want to work with only row-block data tiles; but, in many cases, due to data dependences and array access patterns, it may not be possible to obtain only row-block tiles. But, our experience and experiments show that many nested loops can be tiled using only row-block and square tiles. To achieve this, when necessary, linear loop optimizations such as loop permutation can be used prior to tiling. To sum up, our strategy ﬁrst determines the loops with reuse, ﬁlters out the ones with orthogonal footprints (with respect to the storage order), and tile the resulting nest. Our current implementation also tries all permutations of outer nests1 to obtain row-block and square tiles (i.e., eliminate column-block tiles). 4.3

Linear Loop Transformation

Linear loop transformations that aim at improving cache locality generally try to achieve either of two objectives for each array reference: optimizing temporal locality in the innermost loop or optimizing spatial locality in the innermost loop [7]. Optimizing temporal locality in the innermost loops allows the backend compiler to place the reference in question into a register (provided that no alias exists). Note that this eliminates accesses to the cache and memory, thereby increasing the memory idle time and creating more opportunities for the employment of low-power operating modes. Optimizing spatial locality (unit stride accesses) is also beneﬁcial from an energy perspective as it allows all the accesses to a given bank to be completed before moving to another bank (provided that the array elements are stored sequentially). We note that there are cases where linear transformations might be desirable from one objective’s angle and not desirable from the other’s angle. Consider the following nested loop which accesses a two-dimensional row-major array: for(i=0;i
Since the column-wise access pattern exhibited by the inner loop here is not suitable from a cache locality perspective, a solution is to interchange the order of the loops. Such an optimization makes the accesses in the inner loop consecutive in memory, and consequently improves data locality. Assuming that array a spans multiple banks, the loop interchange here is beneﬁcial from an energy perspective as well (with or without cache). This is because, after the interchange, the array is accessed sequentially; that is, array accesses to a bank are completed before moving to the next bank. However, if we assume that the entire array ﬁts into a single bank, then an energy-oriented optimization strategy would not need to perform any transformation as no transformation would have an eﬀect on the inter-access time of the bank (BIT) in question. However, if there 1

The innermost loop is determined by linear loop transformations; changing the position of this loop during tiling may not be very beneﬁcial.

Inﬂuence of Loop Optimizations on Energy Consumption

285

is a cache in the system, from a cache locality point of view, it is still desirable to apply loop interchange. From the discussion above, we can conclude that linear loop transformations might be beneﬁcial even if there is no cache in the banked-memory system. Our bank-conscious linear loop transformation strategy works as follows. If there is no cache in the system, the compiler tries to optimize spatial and temporal locality aggressively. Speciﬁcally, it uses the loop transformation framework presented in [7]. However, it does not apply a transformation if the transformation will not reduce the number of active banks (at a time) or cluster array accesses (e.g., when the array ﬁts in a single bank). If there is a cache in the system, it tries to optimize locality taking cache characteristics into account, and uses the fact that memory is banked only when it needs to distinguish between references with no cache locality. For example, suppose that a nested loop that manipulates three arrays (a, b, and c) can be optimized for locality in two alternate ways (using linear loop transformations). In the ﬁrst alternative, arrays a and b have unit stride accesses, whereas array c has no cache locality. In the second alternative, arrays a and c have unit stride accesses but array b has no cache locality. Then, our strategy calculates how many diﬀerent banks are accessed due to array c in the ﬁrst alternative and due to array b in the second alternative. It selects the alternative with the minimum number of banks accessed. We have also experimented with an alternate strategy in which (when multiple optimization alternatives exist) the alternative that leads to the activation of the minimum number of banks (when all array accesses–optimized or unoptimized–are considered) is selected. Our experimental results indicate that for the codes in our experimental suite these two strategies generate very similar results. This is because, in general, the number of banks accessed is determined by the unoptimized array references. 4.4

Discussion

So far we have considered our optimizations in isolation. When we consider the interaction between these optimizations, the problem becomes much harder. In particular, it should be noted that the two objective functions, namely, improving data locality and reducing oﬀ-chip memory energy might demand diﬀerent combinations of transformations. Consider the following nested loop which accesses four diﬀerent arrays: for(i=1;i
Let us assume that arrays a and b are stored in one bank, whereas c and d reside in another bank. A data locality optimization scheme would normally not perform any transformation on this loop, as all the references exhibit high spatial locality and the loop body is not large enough to justify loop distribution (due to

286

Mahmut Kandemir et al.

instruction cache locality concerns). A memory energy optimization strategy, on the other hand, will apply loop distribution to isolate the accesses to individual banks so as to maximize the idle periods for each bank. Now, let us assume that all the subscript expressions in the last example above are [j][i] instead of [i][j] (under the same array placement scheme). In this case, a locality-oriented optimization strategy would apply loop interchange (i.e., changing the order of i and j loops) to obtain unit stride accesses in the inner loop position. A strategy that targets oﬀ-chip memory energy would, however, still use loop distribution. If the underlying architecture contains both a banked memory system and a cache, then it would be best to apply both loop interchange and loop distribution. We can conclude from this example that the selection of loop transformations to apply depends strongly on the data locality characteristics of the code as well as the array allocation in oﬀ-chip memory (i.e., array-to-bank mappings). An important issue then is to combine our loop-based transformations in such a fashion that both the oﬀ-chip energy and the cache locality are optimized. However, combining loop-level transformations has not been easy in the past even if one focuses only on speciﬁc types of transformations and performance issues [12]. Our heuristic strategy to this problem is as follows. We ﬁrst apply loop ﬁssion to isolate as many nested loops as possible. This will enable the compiler to turn oﬀ as many memory banks as possible. After that, we apply bank-conscious version of loop fusion to take advantage of cache memory (if there is one in the system). Then, we consider each of the resulting nests one-by-one, and optimize it using bank-conscious versions of loop permutation (linear transformation) and tiling. Figure 4 shows the overall optimization algorithm. Note that this algorithm calls the algorithms Bank-Conscious-Fusion(.) and Bank-Conscious-Fission(.) in Figures 2 and 3, respectively. Note also that the algorithm in Figure 2 is a greedy heuristic based on the depth of compatibility, similar to the performance-oriented fusioning strategy presented in [8]. It builds a DAG from candidate loops, where edges are dependences between the loops and the weight of each edge is the potential gain due to loop fusion. The nests are partitioned into sets of compatibility at the deepest loop levels possible. Note that the approach ﬁrst fuses nests with the deepest compatibility and locality. Then, the DAG is updated and the fusion is applied at the next level until all compatible sets are considered. The algorithm in Figure 3, on the other hand, considers each nest one-by-one, and applies loop distribution while being careful in not distoring data locality. In both the algorithms, for a given loop l, Arrays(l) gives the set of arrays accessed by it and Banks(Arrays(l)) gives the set of bankes touched. After applying loop ﬁssion and fusion, within the outer for-loop (in Figure 4), each of the nests is optimized using loop permutation and tiling for oﬀ-chip memory energy and data locality.

5

Experimental Evaluation

Our loop nest optimizer attempts to improve cache locality and oﬀ-chip memory energy consumption by performing high-level transformations on loops. The

Inﬂuence of Loop Optimizations on Energy Consumption

287

Bank-Conscious-Fusion(N ) INPUT: N = N1 , N2 , · · · , Ns , nests that are fusion candidates ALGORITHM: build M = {M1 , · · · , Mt } where: Mi = {mi }, a set of compatible nests with depth(Mi+1 ) ≤ depth(Mi ); build DAG H with dependence edges and weights; for each Mi = {m1 , · · · , mp } do for k1 = m1 to mp do for k2 = m2 to k1 do if (no cache memory) then continue; else if ((there exists locality between k1 and k2 ) and (Banks(Arrays(k1 )) == Banks(Arrays(k2 ))) and (it is legal to fuse k1 and k2 )) then fuse k1 and k2 and update H; endif endfor endfor endfor

Fig. 2. Bank-conscious loop fusion algorithm Bank-Conscious-Fission(N ) INPUT: N = N1 , N2 , · · · , Ns , nests that are fission candidates ALGORITHM: for each Ni = {n1 , · · · , nk }, where nj s are individual loops in Ni do let p1 , · · · , pl be the statements in Ni ; for each nj ∈ Ni , j = 1, k if (no cache memory) then distribute nj over nj+1 , · · · , nk , p1 , · · · , pl such that: if (Banks(Arrays(pk )) == Banks(Arrays(pj ))) then pk and pj stay in the same loop after distribution; endif else apply classical (performance-oriented) loop distribution algorithm such that: if (Banks(Arrays(pk )) == Banks(Arrays(pj ))) then pk and pj stay in the same loop after distribution; endif endif endfor endfor

Fig. 3. Bank-conscious loop ﬁssion (loop distribution) algorithm

current implementation uses only three optimizations (loop permutation, loop fusion/ﬁssion, and iteration space tiling) discussed earlier in the paper. The important characteristics of the benchmark codes that we used to measure the energy beneﬁts of loop optimizations are given in Figure 5. fourier and flt are Fourier transform and digital ﬁltering routines, respectively. adi and cholesky are ADI and cholesky decomposition codes; hydro2d and nasa7 are array-dominated codes from the Spec Benchmark Suite; and tis and tsf are from the Perfect Club Benchmarks. Finally, nwchem is a kernel routine from a large real-life application that performs computational chemistry-speciﬁc calculations.

288

Mahmut Kandemir et al.

Bank-Conscious-Optimization(N ) INPUT: N = N1 , N2 , · · · , Ns , nests in the procedure ALGORITHM: Bank-Conscious-Fission(N ); Bank-Conscious-Fusion(N ); for each Ni = {n1 , · · · , nk }, where nj s are individual loops in Ni do best-cost = ∞; best-permutation = none; determine permutations of n1 , · · · , nk with the best locality; let P1 , · · · , Pf be such permutations; for each Pi , i = 1, f do current-cost = find the number of banks accessed by the arrays with no locality; if (current-cost < best-cost) then best-cost = current-cost; best-permutation = Pi ; endif endfor determine the set Si , the loops with reuse in Pi ; if (there is a cache in the system) then tile each loop sj ∈ Si if its data footprint is not orthogonal to storage direction; endif endfor

Fig. 4. Bank-conscious energy optimization algorithm

The third column in Figure 5 gives the total dataset size manipulated by the corresponding code. BaseE- and BaseE+ correspond to base energy values (without any loop optimizations) for a cacheless system and for a system with a 32KB two-way set-associative cache (with a block size of 32 bytes), respectively. Note that these base energy values have been obtained using the original codes and exploiting low-power operating modes to save energy (as explained in the second section). In other words, our base version already takes advantage of the low-power operating modes. Also, these energy numbers include the energy consumed in oﬀ-chip memory (due to data accesses only) and the energy consumed in the data cache (when it exists). BaseT- and BaseT+ are the corresponding base execution times. The last three columns indicate whether a given benchmark is amenable to a speciﬁc optimization. All energy numbers given in Section 5.1 (resp. Section 5.2) are percentage improvements over the corresponding entry in the BaseE- (resp. BaseE+) column. All the energy numbers given in Figure 5 are in microjoules and have been obtained using a default memory bank conﬁguration which contains eight 8MB banks (denoted 8×8MB). All performance numbers are in seconds. 5.1

Cacheless System

Figure 6 gives the percentage energy improvements for a cacheless system for four diﬀerent versions. c-opt1, c-opt2, and c-opt3 denote the optimized versions assuming an imaginary cache architecture of 8KB, 16KB, and 32KB, respectively (All caches are two-way set-associative with a block size of 32 bytes). The objective in measuring the energy behavior of these versions is to see whether we can

Inﬂuence of Loop Optimizations on Energy Consumption

289

Benchmark Number Input Base Energy Base Performance Optimization Applicability Name of Lines Size BaseE- BaseE+ BaseT- BaseT+ fusion+fission tiling linear √ √ adi 56 78MB 28.9 19.3 5.76 3.92 √ √ cholesky 34 61MB 88.2 61.1 9.68 7.10 √ √ hydro2d 52 44MB 104.0 76.3 10.02 6.59 √ √ flt 85 51MB 723.3 328.1 16.81 11.57 √ √ fourier 167 57MB 634.0 411.7 11.96 8.90 √ √ √ nasa7 1,105 54MB 1,418.6 783.2 29.77 18.52 √ √ √ nwchem 370 44MB 780.5 408.9 13.95 8.16 √ √ √ tis 485 56MB 899.8 511.0 18.72 12.04 √ √ √ tsf 1,986 60MB 1,066.2 620.4 24.83 16.71

Fig. 5. Benchmark codes and their important characteristics

Fig. 6. Energy improvements in a cacheless system

use a cache locality-oriented scheme without modiﬁcation for optimizing memory energy of a banked system without cache. The b-opt version, on the other hand, denotes a version that uses loop transformations solely for optimizing memory energy (i.e., the bank-aware version). We observe two important trends from these results. First, as the assumed cache size is increased, the energy beneﬁts also increase. This is because with larger caches, the locality-oriented strategy becomes less aggressive, and performs fewer cache-speciﬁc optimizations. This, in turn, causes less side eﬀects on the memory energy consumption. Second, in a cacheless system, customizing loop optimizations taking into account the banked nature of the memory makes sense as it improves energy 18.72% on average (as compared to 13.20% for c-opt2). We need to mention that increasing the assumed cache size further did not bring any additional improvement over c-opt3 (except for tis, where an assumed data cache size of 64KB reduced the memory energy by 2.8% over the c-opt3 version). Our experiments with diﬀerent bank conﬁgurations also showed similar trends.

290

Mahmut Kandemir et al.

Fig. 7. Energy improvements for a memory system with cache

5.2

Memory System with Cache

Figure 7 presents the percentage energy improvements for three diﬀerent versions for a banked memory system with a 32KB two-way set-associative cache memory. c-opt is the version that optimizes only for cache memory and b-opt optimizes only for memory energy. The b+c-opt version, on the other hand, tries to strike a balance between the two objectives (optimizing cache locality and reducing oﬀ-chip memory energy). We can observe from this ﬁgure that, in general, c-opt generates better results than b-opt. That is, if there is a cache in the banked-memory system, it is not a good idea to use optimizations that target only memory energy. Using pure locality-based optimizations results in better energy savings for most of the time. However, we also observe that the b+c-opt version generates the best result across all applications (averaging a 22.84% overall energy improvement). Although not presented here due to lack of space, we observed similar trends in experiments performed using diﬀerent cache sizes and associativities. 5.3

Performance Gains

Figure 8 gives the performance beneﬁts (over the values given under the column BaseT+ in Figure 5) of three diﬀerent versions (b-opt, c-opt, and b+c-opt) for a banked memory system with a 32KB two-way set-associative cache memory. We observe that the b+c-opt version generates comparable results to the c-opt version (pure locality-oriented approach). The diﬀerence between them is only 1.80%. Therefore, we can conclude that the combined optimization strategy is almost as good as the pure cache locality-oriented approach in improving the performance, but it leads to signiﬁcantly more (memory system) energy savings than a pure locality-oriented approach.

Inﬂuence of Loop Optimizations on Energy Consumption

291

Fig. 8. Percentage performance gains in a cache based system

6

Conclusions

In this paper, we investigate the inﬂuence of three diﬀerent types of loop transformation techniques on memory system energy assuming a multi-bank memory architecture. A multi-bank memory system allows unused banks to be transitioned to low-power operating modes. In a multi-bank memory system without cache, we have found that slightly modiﬁed versions of classical locality-oriented loop transformation techniques generate large energy savings. In a cache-based multi-bank system, our results show that the modiﬁed (bank-aware) loop transformations result in large energy savings, and that the execution times of the resulting codes are competitive with those obtained using pure locality-oriented techniques.

References 1. F. Catthoor, S. Wuytack, E. D. Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. Custom Memory Management Methodology – Exploration of Memory Organization for Embedded Multimedia System Design, Kluwer Academic Publishers, June 1998. 277 2. V. Delaluz, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Energy-oriented compiler optimizations for partitioned memory architectures. In Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, November 2000. 277, 278, 282 3. V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin. DRAM energy management using software and hardware directed power mode control. In Proc. the 7th International Conference on High Performance Computer Architecture, Monterrey, Mexico, January 2001. 277, 278, 279 4. DSPstone Benchmark Suite. http://www.ert.rwth-aachen.de/Projekte/Tools/ DSPSTONE/dspstone.html. 5. M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, December 1996. 277

292

Mahmut Kandemir et al.

6. A. R. Lebeck, X. Fan, H. Zeng, and C. S. Ellis. Power aware page allocation. In Proc. Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000. 277, 278 7. W. Li. Compiling for NUMA Parallel Machines. Ph.D. Thesis, Computer Science Department, Cornell University, Ithaca, NY, 1993. 284, 285 8. K. McKinley, S. Carr, and C. W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 1996. 286 9. M. O’Boyle and P. Knijnenburg. Integrating loop and data transformations for global optimisation. In Proc. International Conference on Parallel Architectures and Compilation Techniques, October 1998, Paris, France. 10. Rambus Inc. http://www.rambus.com/. 278 11. 128/144-MBit Direct RDRAM Data Sheet, Rambus Inc., May 1999. 277, 278 12. M. Wolfe. High Performance Compilers for Parallel Computing, Addison-Wesley Publishing Company, 1996. 277, 279, 281, 282, 286

Eﬀective Enhancement of Loop Versioning in Java Vitaly V. Mikheev, Stanislav A. Fedoseev, Vladimir V. Sukharev, and Nikita V. Lipsky A. P. Ershov Institute of Informatics Systems, Excelsior, LLC Novosibirsk, Russia {vmikheev,sfedoseev,vsukharev,nlipsky}@excelsior-usa.com

Abstract. Run-time exception checking is required by the Java Language Speciﬁcation (JLS). Though providing higher software reliability, that mechanism negatively aﬀects performance of Java programs, especially those computationally intensive. This paper pursues loop versioning, a simple program transformation which often helps to avoid the checking overhead. Basing upon the Java Memory Model precisely deﬁned in JLS, the work proposes a set of suﬃcient conditions for applicability of loop versioning. Scalable intra- and interprocedural analyses that eﬃciently check fulﬁlment of the conditions are also described. Implemented in Excelsior JET, an ahead-of-time compiler for Java, the developed technique results in signiﬁcant performance improvements on some computational benchmarks. Keywords: Java, performance, loop optimizations, ahead-of-time compilation

1

Introduction

To date, Java has become an industry-standard programming language. As Java bytecode [3] is a (portable) form of intermediate representation, a great wealth of dynamic and static optimization techniques was proposed to improve the originally poor performance of Java applications. One of the reasons for the insuﬃcient performance is obligatory run-time checks for array elements access operations. JLS [2] requires two checks1 for read/write of each array element a[i]: ﬁrst, a has not to be null (otherwise NullPointerException should be thrown), then i must be in the range 0<=i
Writes to an array of a reference type may also require type inclusion check

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 293–306, 2002. c Springer-Verlag Berlin Heidelberg 2002

294

Vitaly V. Mikheev et al.

proved to be in the proper range. Intraprocedural static analyses are able to infer such program properties quite eﬀectively, of course, if array objects are created and then used within the same Java method ([12], [13]). Unfortunately, the optimization of real Java programs often requires global analyses. It is enough to note that array references are typically stored in shared memory variables (static or instance ﬁelds). However, Java dynamic facilities such as Reﬂection API, JNI, dynamic class loading inhibit applicability of the analyses, not to mention their high spatial and time complexity. An original technique called loop versioning was proposed to optimize loops without global ﬂow analysis ([8]). The key idea is to keep a part of checks in the resulting code but move them out of the loop body as illustrated in Figures 1, 2.

for(i=0; i<=ub; i++) chk_null(A)[chk_idx(A,i)] = 2*chk_null(B)[chk_idx(B,i+1)];

Fig. 1. Original loop code

if ((A!=null) && (B!=null) && (ub
Fig. 2. Versioned loop code In such case, two copies (or versions) of a loop have to be generated. One copy is checks-free provided all required conditions are tested before loop. The other is the original loop with checks. The technique has a great advantage over static analysis: the program properties which are extremely hard to analyze statically, now may be just checked at run-time before loop execution. However, care must be taken to transform the original program correctly, as array reference variables, index expressions and the ﬁnal value of the inductive variable have to be loop invariants. Thus, any loop versioning implementation should advocate (easy provable) conditions for correctness of the optimization. We propose a simple and eﬀective algorithm of loop versioning that can be used in production Java compilers. We implemented it in Excelsior JET [23], an ahead-of-time Java bytecode to native code compiler. The rest of the paper is organized as follows. Section 2 highlights certain aspects of the Java Memory Model with respect to applicability of loop versioning. Sections 3, 4 describe program analysis and transformation required for the optimization. Section 5

Eﬀective Enhancement of Loop Versioning in Java

295

outlines our implementation of loop versioning in the Excelsior JET optimizing compiler. The obtained results are presented in Section 6. Section 7 pursues related works and, ﬁnally, Section 8 concludes.

2

Java Memory Model

Let us consider an example in Figure 3. The question is under which circumstances A, B and UB are loop invariants? If they are not, loop versioning may not be a correct transformation for such loops. This section helps answer the question. Note that we discuss a general case when the expressions may include not only locals but also static or instance ﬁelds.

for(i=0; i; ... ; }

Fig. 3. If A, B and UB are loop invariants? The ”Threads and Locks” chapter of JLS rigorously deﬁnes the Java Memory Model for (generally) multi-threaded programs with the help of three abstract machines: main memory, thread working memory and thread execution engine as depicted in Figure 4. The main memory keeps track of shared variables status performing the read/write actions. Each thread has working memory, its own ”local view” of the main memory. A thread working memory holds working copies of shared variables and communicates with the main memory through load/store message streams speciﬁc for each shared variable. Thread execution engines carry out Java code according to the language semantics and exchange data with working memory through use/assign message streams. The main concern of the speciﬁcation is that reading and writing shared variables are non-atomic w.r.t. thread switching. For instance, the entire readload-use action chain is not guaranteed to be executed in one time slice of a thread though each of the actions is atomic by deﬁnition. The main rules related to shared variables are2 : 1. A thread execution engine is free to use a working copy of a particular shared variable provided that copy was loaded from the main memory at least once. 2. If a thread updates a working copy through an assign action, subsequent use actions that occur in the thread, should return the most recently assigned value. 2

The memory model has a more rich set of restrictions. For our purposes, we shortly describe only those useful for loop invariant computation.

296

Vitaly V. Mikheev et al.

Fig. 4. The Java Memory Model

In fact, the speciﬁcation imposes a strict order on use and assign actions only whereas other actions updating the main and working memory may be issued at any time, at the whim of implementation3 . However, there are two exceptional cases in which working copies have to be in sync with the main memory: (i) A shared variable has the volatile modiﬁer. In such case, a working copy should be synchronized with the main memory each time a use or assign action occurs in a thread. (ii) A synchronized block (or call to a synchronized method) is present on execution path. If so, all working copies should be synchronized with the main memory at enter and exit of the block. Proposition 1 (”Localization” of shared variables). Let a loop-carried statement include an expression with shared variables. If neither of the above conditions holds, the variables may be read before the loop and then treated as locals. Thus, if the expressions to be proved loop invariants contain shared variables, our algorithm analyzes loop body to check the conditions (i), (ii). If they are not fulﬁlled, the analysis concludes that the involved shared variables may be invariants, otherwise it makes a conservative assumption they are not. Loopcarried calls are discussed in the next Section.

3

Program Analysis

This section gives a set of suﬃcient conditions for applicability of loop versioning and describes several analyses to eﬀectively check them. 3

Obviously, CPU architectures with a lot of registers especially beneﬁt from the memory model. Although on Intel x86, shared variables are unlikely to be allocated on registers for a long time, an implementation may provide better cache behaviour if working copies are assigned to local temporaries

Eﬀective Enhancement of Loop Versioning in Java

3.1

297

Alias Analysis

If expressions to be proved loop invariants contain instance ﬁelds or array elements, the analysis has to infer that they are immutable in the loop body. The problem is that instance ﬁelds and array elements may be aliased in Java so that writing one variable changes the value of the other. For example, two expressions o1.f and o2.f are aliases, if both o1 and o2 refer to the same object. In general, the property is practically undiscoverable at compile-time even with (computationally hard) global ﬂow analysis. Instead, our algorithm detects which expressions may not be aliases employing the following simple criteria4 : 1. Instance ﬁelds with diﬀerent names may not be aliases (e.g. expressions expr1.f and expr2.g may not be aliases). 2. Let us consider two expressions o1.f and o2.f. Let C1 and C2 be static (declared) classes for objects o1 and o2. Furthermore, SuperC1(SuperC2) is the superclass of C1(C2) in which instance ﬁeld f was declared. Expressions o1.f and o2.f may not be aliases if SuperC1 and SuperC2 are diﬀerent classes. 3. Let T1[] be a static (declared) type for array object a1 and T2[] be a static type for array object a2. Expressions a1[n] and a2[m] may not be aliases if either – at least one of types T1, T2 is primitive and they are diﬀerent types – types T1, T2 are not interfaces5 and neither can be cast to the other In other cases, the analysis conservatively concludes that the expressions may be aliases. Of course, the technique gives us correct but (generally) non-precise results. Nevertheless, we prefer to use it for eﬀective computability. For instance, the criteria for arrays do not work if a loop computes arrays of the int[] type and an invariant expression contains an access to another (immutable) integer array a[expr]. However, it works ﬁne if the loop computes float[] or double[] arrays only. 3.2

Loop Invariants Computation

For the sake of simplicity, the augment of inductive variable is required to be a constant and, thus, is invariant. However, the following entities have to be proved loop invariants for correct application of loop versioning: – the expression that denotes the ﬁnal value of the inductive variable – array references which are subjects for check removal – index expressions provided the inductive variable is ﬁxed6 4 5

6

The proposed criteria beneﬁt from the strict type system and absence of address arithmetic in Java. If at least one type is an interface, there may exist a class that implements it. In such case, the non-aliasing property can not be determined without a global type analysis. To be more precise, if each occurence of the inductive variable in an index expression is replaced with a constant, the expression would become loop invariant.

298

Vitaly V. Mikheev et al.

First, our algorithm performs ”localization” of shared variables as proposed in Section 2. Then it proves that shared variables appearing in the left side of assignments may not be aliases of those which are part of invariant expressions being analyzed. Finally, a traditional local ﬂow-sensitive analysis [1] is employed to check whether the expressions are invariants. If an either test fails, versioning is not applied to the loop. 3.3

Checking Boundaries of Index Expressions

As long as index expressions (with a ﬁxed inductive variable) are loop invariants, they may be thought of as a set of functions {fk (i) : [a..b] → int}k , where a and b are initial and ﬁnal values of inductive variable i. For practical consideration, the algorithm recognizes only linear functions in the form f (i) = k ∗ i + l which give minimum and maximum at the margins of the domain range. Thus, for each array access arr[k*i+l], the compiler should emit checking this pre-condition formula for positive augment of the inductive variable7 . (k > 0) ? 0 <= k*a+l && k*b+l <arr.length : 0 <= k*b+l && k*a+l <arr.length Thus, the compiler has to generate a concatenation of similar formulas for each index expression within loop. Note that as a rule, the resulting formula will be essentially reduced during further local constant propagation and range analyses that our compiler performs. 3.4

Handling Loop-Carried Calls

In general, a loop body may include calls to other methods among operators in Fig. 3. Though called methods cannot access locals of the caller, they may modify shared variables, contain synchronized blocks or invoke yet other methods which do that. In order to get more precise results, our algorithm makes a simple interprocedural analysis to check operators from the called methods8 . One might note that the same eﬀect may be achieved through simply inlining such methods. However, care must be taken to prevent excessive inlining. Not to mention the growth of the code size, it may result in decreasing performance. Let us imagine a loop containing a call to quite a large method on a rarely executed branch. If the call is inlined, it may consume extra CPU registers and worsen instruction cache behaviour as noted in the work [14]. Because our compiler framework is able to perform adaptive proﬁle-based optimizations (including inlining), we prefer to implement scalable interprocedural analyses (if possible), and not to rely on increasing inline aggressiveness. 7

8

As mentioned above, the sign of the inductive variable augment is known at compiletime. We give the formula for positive augments only as it is symmetric for negative ones. Of course, virtual method invocation hinders the analysis. Our compiler accomplishes local type propagation which often helps to ”devirtualize” such methods. If that is not possible, the analysis treats such methods as potentially unsafe and declines versioning if invariant expressions contain shared variables.

Eﬀective Enhancement of Loop Versioning in Java

3.5

299

Complexity

The described algorithms scale linearly in the size of the program. The ﬂowinsensitive analysis of loop-carried operators and the simple alias analysis give the complexity proportional to N (program size) + G (non-virtual call graph size). Thus, our algorithm runs in O(N + G) both time and space. Strictly speaking, our compiler performs a ﬂow-sensitive intraprocedural analysis that runs in O(n2 ), where n is the number of local temporaries. As it takes eﬀect during compilation of each method anyway, it does not matter whether loop versioning is applied. This is why we give ”pure” complexity of the versioning analysis not taking into account other local optimizations.

4

Program Transformation

If the described analyses have succeeded, the compiler can safely perform loop versioning. The necessary program transformation is very simple and includes the following steps: 1. Generation of pre-conditions for nullness of array references 2. Generation of pre-conditions for index bounds 3. Replication of loop body with removal of checks The only important note is that index bound checks must follow nullness checks in the pre-condition formula concatenated. As the index check conditions dereference array variables (in the form expr
5

Implementation

This section highlights our implementation of loop versioning and particularly focuses on the beneﬁts of ahead-of-time (static) compilation. 5.1

Excelsior JET

We implemented the described algorithm in Excelsior JET, a static compiler which converts Java bytecode to native (platform speciﬁc) code before execution. JET is based on the Excelsior’s compiler construction framework which architecture is shown in Figure 5. Organization of the framework is similar to those of other known compilers, e.g. Marmot [9], HPJC [10]. The main advantage of static compilation is that it is performed only once, on a developer’s machine and typically, the majority of classes is known at

300

Vitaly V. Mikheev et al.

Fig. 5. The Excelsior’s compiler construction framework compile-time9 . Thus, the compiler is free to employ any time- and memoryexpensive optimization technique, resulting in much better code quality than in the case of dynamic (just-in-time) compilation. In that sense, the JET abbreviation stands for Just-Enough-Time compiler. The Java ”Write Once, Run Anywhere”TM paradigm is supported by providing static Java compilers for all major platforms, just like it is supported right now by providing a JVM for each of them. Currently, JET is targeting the Wintel platform, however, porting to other platforms is under consideration. 5.2

Implementation Notes

Now we describe certain aspects of versioning implementation in Excelsior JET. Decompilation of loop operators. The Java bytecode which JET takes as input language, contains stack-based VM instructions [3]. JET bytecode front-end employs quite complex algorithms of abstract interpretation and symbolic computations to reconstruct (or decompile) structural operators. The reconstruction algorithms recognize loops with inductive variables and, thus, are not limited to for operators only. For instance, the loops might be written with the use of while or do-while operators in the original Java sources. In essence, the exploited algorithms are similar to those of related works [18], [19]. However, in order to make them work properly on a variety of real-world Java applications, we had to carefully adapt the algorithms to the Java bytecode speciﬁcation. Powerful local optimizations. As in most advanced compilers, JET middleend has 3-address value internal representation and features local SSA-based optimizations [1]. If possible, checks are removed during local constant propagation and range analysis. However, along with CSE (Common Subexpressions 9

If a class is unknown beforehand, JET provides caching dynamic compilation through Mixed Compilation Model [23] sacriﬁcing several inreprocedural optimizations in favour of lower resource consumption at run-time

Eﬀective Enhancement of Loop Versioning in Java

301

Elimination), the optimizations are extremely useful for the reduction of precondition formulas generated during the loop versioning transformation. Generation of run-time checks. JET, as well as other compilers, co-operates with the run-time system to handle NullPointerException. In fact, dereference of zero or another (small) value is treated as the exception. Intel x86 assembly code for checking instructions is presented in Table 1.

Table 1. Excelsior JET check instructions Null check

Combined null/index check

// eax holds address of array cmp eax, [eax]

// eax holds address of array // ebx holds index value cmp [eax+arrLenOffset], ebx jbe IndexOutOfBound

Note that it would make little sense to remove null checks while preserving index ones as both are performed at once by a single CPU instruction. Adaptive optimizations. Our compiler framework supports proﬁle-based optimizations. The collected proﬁle often recommends not to inline particular methods as they are rarely executed. This is the reason that caused us to implement interprocedural analysis of loop-carried calls. Interprocedural analysis. In order to allow JET to perform interprocedural optimizations (e.g. escape analysis [7], inlining etc.), we have implemented syntax tree object persistency permitting arbitrary tree object graphs to be saved to/restored from ﬁle or cached into memory, if they are intensively used. This technique resembles the slim binaries approach proposed in [16] as an alternative to the Java bytecode during dynamic compilation. However, we restrict its use to static code analysis and optimization only. The mechanism was just recycled for analyzing loop-carried calls. In it, we beneﬁt from the ahead-of-time compilation approach because most dynamic compilers cannot aﬀord even simple interprocedural analyses due to time and memory limitations.

6

Experimental Results

This section gives the results we obtained on two series of benchmarks. One series is provided to discover the ”pure” eﬀect of versioning when only array access operations are executed in loops. The other series is well-known standard benchmark suites - JavaGrande/EPCC Sequential 2.0 [20] and SciMark 2.0 [21]. All tests were run on the same system: AMD AthlonTM running at

302

Vitaly V. Mikheev et al.

1400MHz/768MB RAM/Windows 2000 Professional. In order to see the best results that versioning may potentially give and the actual performance impact of the optimization, we provide the results for the following execution modes: 1. checks enabled, versioning disabled 2. both checks and versioning enabled 3. all checks disabled 6.1

Pure Eﬀect of Versioning

Sum1, Sum5 and Sum10 are simple benchmarks which sum the contents of one, ﬁve and ten int[200000] arrays in the innermost loop during a number of iterations. Table 2 shows the execution time in seconds.

Table 2. Summing elements of large arrays Benchmark +checks -vers Sum1 1.07s Sum5 2.89s Sum10 7.55s

+checks +vers 0.64s (-39%) 2.45s (-15%) 6.38s (-15%)

-checks 0.64s (-39%) 2.43s (-15%) 6.38s (-15%)

The best result (39% reduction of execution time) is achieved on the simplest benchmark. Smaller improvement of the others is caused by the CPU data cache behaviour as the tests read elements of several very large arrays in the same loop. It is not surprisingly that benchmarks with enabled versioning are almost as fast as checks-free ones. 6.2

Standard Benchmarks

For our purposes, we selected only those benchmarks which perform array access operations in loops. Eﬀectiveness of our versioning implementation is given in Table 3. For each test, it shows the total number of loops 10 and the number of those to which the optimization was applied. The benchmarks demonstrate 100% eﬀectiveness, excepting JGFSeq3/Euler which operates on int[][] arrays. We did not implement loop versioning for multidimensional arrays intentionally. The matter is that ﬂattening multidimensional arrays [17] is a diﬀerent optimization complementary to loop versioning. If a (rectangular) multidimentional array is ﬂattened, the index expression a[i][j] is transformed to a[i*firstDimLength+j] which meets our versioning criteria. We plan to implement support for rectangular arrays in future versions. 10

Note that the total numbers count only the loops with array access operations which are subject for versioning.

Eﬀective Enhancement of Loop Versioning in Java

303

Table 3. Eﬀectiveness of loop versioning analysis Benchmark Num. of loops SciMark2/FFT 3 SciMark2/Sparse matmult 2 SciMark2/SOR 2 SciMark2/LU 14 JGFSeq3/Search 5 JGFSeq3/MonteCarlo 2 JGFSeq3/RayTracer 2 JGFSeq3/MolDyn 1 JGFSeq3/Euler 8

Versioned loops 3 (100%) 2 (100%) 2 (100%) 14 (100%) 5 (100%) 2 (100%) 2 (100%) 1 (100%) 4 (50%)

Table 4 gives the results of performance improvement due to loop versioning application. Columns 2-4 hold the numbers of operations per second speciﬁc for each benchmark (greater number means better result).

Table 4. Performance improvement Benchmark SciMark2/FFT SciMark2/ Sparse matmult SciMark2/SOR SciMark2/LU JGFSeq3/Search JGFSeq3/MonteCarlo JGFSeq3/RayTracer JGFSeq3/MolDyn JGFSeq3/Euler

+checks -vers 194.4 170.8 170.8 331.8 266.2 712513.2 1733.73 2706.92 180562.11 5.14

+checks +vers 196.1 (+0.8%) 212.6 (+24.4%) 212.6 (+24.4%) 331.7 (0%) 312.7 (+17.4%) 713972.4 (0%) 1733.4 (0%) 2723.31 (+0.6%) 247306.3 (+36.9%) 5.14 (0%)

-checks 196.8 (+1.2%) 246.9 (+44.5%) 246.9 (+44.5%) 333.9 (+0.6%) 323.0 (+21.3%) 755216.94 (+1.0%) 1757.77 (+1.3%) 2729.92 (+0.8%) 251991.67 (+39.5%) 5.71 (+11.0%)

As can be seen, versioning gives performance improvement on the same tests as check disabling does. The performance gap between columns 3 and 4 on JGFSeq3/Euler is due to not having yet implemented compiler support for rectangular multidimensional arrays. However, the eﬀect on SciMark2/Sparse matmult is a more subtle substance. The benchmark has three nested loops and versioning two innermost ones disrupts (otherwise well-behaved) instruction cache because of swollen code. We intend to ameliorate that as follows. Given the assumption that the checked version is rarely executed, the compiler can move it to the

304

Vitaly V. Mikheev et al.

end of the method’s code section and then link it with the main code through forward and backward jump instructions 11 . One might be interested to see the results of performance comparison between Excelsior JET, the most current Java VMs (e.g. Sun’s HotSpot Server VM which performs powerful SSA-based optimizations [4]) and other generally available ahead-of-time compilers for Java. Although independent studies (e.g. [22]) show that our compiler outperforms them on many benchmarks including those cited in this paper, we do not give the results on purpose. That would require us to consider the entire variety of optimizations implemented in JET and other compilers, not only loop versioning we pursue in this paper.

7

Related Works

Byler et al. [8] seemingly pioneered the loop versioning optimization. Their work aimed at dealing with the lack of compile-time information when optimizing computational programs for parallel architectures. Versioning was proposed to detect alias-safe array regions thereby allowing concurrent computations. Pugh [15] gives an excellent description of the Java Memory Model in details and proposes further improvements to make more optimizations applicable to Java. The works [12], [13] pursue various static analyses for array checks removal. The proposed techniques are able to eliminate checks with the use of local (intraprocedural) optimizations. However, real-world Java programs are unlikely to create and use arrays within the same local scope of a method. Global ﬂow-sensitive analyses are computationally hard. In general, any global analysis may not be used for Java due to presence of dynamic loading of classes and metaprogramming facilities. Artigas et al. [11] consider loop versioning for Java implemented in IBM High Performance Compiler [10]. Though the work mentions the possibility of loop versioning in Java, conditions of its applicability are not discussed. The main concern of their work is the use of versioning for parallel processing alias-safe array regions as in [8]. We ﬁnd the optimization very useful even for single-processor architectures 12 and plan to implement it in future releases of Excelsior JET. Fitzgerald et al. at Microsoft Research, the authors of the Marmot ahead-of-time compiler for Java [9] do not regard the loop versioning optimization. The work [5] describing the architecture of the IBM Just-In-Time compiler for Java, most directly relates to ours. The authors propose an algorithm which relies on loop invariants, however, they do not describe invariant computation and alias analysis (if the last was employed). Because their work is an overview of the entire compiler architecture, it is hard to compare eﬀectiveness of invariant computation which is not described in details. Moreover, their algorithm limits recognizable index expressions to i + constant whereas our algorithms permits expressions k ∗ i + l, where k and l are loop invariants not necessary constants. 11 12

Some people from the compiler community wittily call the technique ”siberian code sections”. The guarantee of alias-free arrays contributes to more eﬀective code generation as many redundant load instructions may be eliminated.

Eﬀective Enhancement of Loop Versioning in Java

305

The form of index expression (potentially) allows us to use versioning for rectangular multidimensional arrays as well. Finally, the algorithm employed by IBM JIT does not perform versioning if the loop includes calls. Our implementation makes simple interprocedural analysis to handle loop-carried calls.

8

Conclusion

This paper presented loop versioning, a technique for removal of null and index checks in Java programs working with arrays. A set of suﬃcient conditions for correct application of loop versioning in Java was given. In order to eﬀectively check the conditions, this work proposed algorithms of alias analysis and loop invariant computation which scale linearly in the size of the program. Implemented in Excelsior JET, an ahead-of-time compiler for Java, the developed technique results in signiﬁcant performance improvements on computational benchmarks. The interesting area for future works is to provide the current implementation with support for alias-free array regions and rectangular multidimensional arrays.

Acknowledgements Without the ongoing support of the entire Excelsior Java team, this work would not have been possible. A special thank-you to John O. Osbourne for his support.

References 1. Steven S. Muchnik. Advanced Compiler Design And Implementation. Morgan Kaufmann Publishers, 1997. 298, 300 2. J. Gosling, B. Joy and G.Steele. The Java(tm) Language Specification, Second Edition. Addison-Wesley, Reading, 2000. 293, 299 3. T. Lindholm, F. Yellin, B. Joy, K. Walrath. The Java Virtual Machine Specification. Addison-Wesley, 1996. 293, 300 4. The Java HotSpot(tm) Virtual Machine, Technical Whitepaper, Sun Microsystems Inc., 2001. 304 URL: http://www.sun.com/solaris/java/wp-hotspot 5. Suganuma et al. Overview of the IBM Java Just-In-time Compiler, IBM Systems Journal, Vol. 39, No. 1, 2000. 304 6. V. Mikheev. Design of Multilingual Retargetable Compilers: Experience of the XDS Framework Evolution. In Proc. of Joint Modular Languages Conference, JMLC’2000, Volume 1897 of LNCS, Springer-Verlag, 2000. 7. V. Mikheev, S. Fedoseev. Compiler-Cooperative Memory Management in Java. To appear in Proc. of 4th International Conference Perspectives of System Informatics, PSI’2001, LNCS, Springer-Verlag, 2001. 301 8. M. Byler et al. Multiple version loops. In Proc. of the 1987 International Conference on Parellel Processing, 1987. 294, 304 9. R. Fitzgerald, T. Knoblock, E.Ruf, B. Steensgaard, D. Tarditi. Marmot: an Optimizing Compiler for Java, Microsoft Research, MSF-TR-99-33, 1999. 299, 304

306

Vitaly V. Mikheev et al.

10. V. Seshadri. IBM high performance compiler for Java. AIXpert Magazine, September 1997. 299, 304 11. P. Artigas, M. Gupta, S. Midkiﬀ and J. Moreira. Automatic Loop Transformations and Parallelization for Java, In Proc. International Conference on Supercomputing, ICS’00, 2000. 304 12. D. Bodik, R. Gupta, and V. Sarkar. ABCD: Eliminating Array Bounds Checks on Demand. In Proceeding of PLDI’00, 2000. 294, 304 13. P. Pomminvillen et al. A Framework for Optimizing Java Attributes. In Proc. Compiler Construction, CC’2001, Volume 2027 of LNCS, Springer-Verlag, 2001. 294, 304 14. M. Arnold, S. Fink, V. Sarkar, and P. Sweeney. A Comparative Study of Static and Proﬁle-Based Heuristics for Inlining. In Proc. of ACM SIGPLAN 2000 Workshop on Dynamic and Adaptive Compilation and Optimization, DYNAMO’00, 2000. 298 15. W. Pugh. Fixing the Java Memory Model. In ACM 1999 Java Grande Conference, San Francisco, CA, June 1999. 304 16. M. Franz, Th. Kistler. Slim binaries. Technical report 96-24, Department of Information and Computer Science, UC Irvine, 1996. 301 17. J. Moreira, S. Midkiﬀ, M. Gupta. A comparison of three approaches to language, compiler, and library support for multidimensional arrays in Java. In Proc. of ISCOPE Conference on ACM 2001 Java Grande, 2001. 302 18. C. Cifuentes. Structuring Decompiled Graphs. In Proc. of the International Conference on Compiler Construction, CC’96. Volume 1060 of LNCS, Springer-Verlag, 1996. 300 19. U. Lichtblau. Decompilation of control structures by means of graph transformations. In Proc. of the International Joint Conference on Theory and Practice of Software Development, TAPSOFT’85. Volume 185 of LNCS, Springer-Verlag, 1985. 300 20. The Java Grande Forum Sequential Benchmarks, Version 2.0. 301 URL: http://www.epcc.ed.ac.uk/javagrande/sequential.html 21. SciMark 2.0. Java benchmark for scientiﬁc and numerical computing. 301 URL: http://math.nist.gov/scimark2/ 22. O. P. Doederlein. The Java Performance Report - Part IV: Static Compilers, and More. JavaLobby, August, 2001 URL: http://www.javalobby.org/fr/html/frm/javalobby/features/jpr/part4.html 304 23. Excelsior JET. Technical Whitepaper, Excelsior LLC, 2001. 294, 300 URL: http://www.excelsior-usa.com/jetwp.html

Value-Profile Guided Stride Prefetching for Irregular Code Youfeng Wu1, Mauricio Serrano1, Rakesh Krishnaiyer2, Wei Li2, and Jesse Fang1 Intel Programming Systems Research Lab 2 Intel Compiler Lab 2200 Mission College Blvd, Santa Clara, CA 95052 {youfeng.wu,mauricio.serrano,rakesh.krishnaiyer,wei.li, jesse.fang}@intel.com 1

Abstract: Memory operations in irregular code are difficult to prefetch, as the future address of a memory location is hard to anticipate by a compiler. However, recent studies as well as our experience indicate that many irregular programs contain loads with near-constant strides. This paper presents a novel compiler technique to profile and prefetch for those loads. The profile captures not only the dominant stride values for each profiled load, but also the differences between the successive strides of the load. The profile information helps the compiler to classify load instructions into strongly or weakly strided and singlestrided or phased multi-strided. The prefetching decisions guided by the load classifications are highly selective and beneficial. We obtain significant performance improvement for the CPU2000 integer programs running on Itanium machines. For example, we achieve a 1.55x speedup for "181.mcf", 1.15x for "254.gap", 1.08x for "197.parser" and smaller gains in other benchmarks. We also show that the performance gain is stable across profile data sets and that the profiling overhead is low. These benefits make the new technique suitable for a production compiler.

1

Introduction

Memory operations in irregular code are difficult to prefetch, as the future address of a memory location is hard to anticipate by a compiler. An example of irregular code is the "pointer-chasing" code that manipulates dynamic data structures. However, recent studies suggest that some pointer chasing references exhibit near-constant strides. Namely, the difference between two successive data addresses changes only infrequently at runtime. Stoutchinin et al [23] and Collins et al [4] notice that several important loads in 181.mcf of CPU2000 integer benchmark suite have near-constant strides. Our experience indicates that many irregular programs, in addition to 181.mcf, contain loads with near-constant strides. For example, the 197.parser in CPU2000 integer benchmarks has code segments as shown in Figure 1 (a). The first load chases a linked list and the second load references the string pointed to by the R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 307-324, 2002.  Springer-Verlag Berlin Heidelberg 2002

308 Youfeng Wu et al.

current list element. The program maintains its own memory allocation. The linked elements and the strings are allocated in the order that are referenced. Consequently, the strides for both loads remain the same 94% of the times with reference input set. CPU2000 integer benchmark 254.gap also contains near-constant strides in irregular code. An important loop in the benchmark performs garbage collection. A simplified version of the loop is shown in Figure 1 (b). The first load at the statement S1 accesses *s and it has four dominant strides, which remain the same for 29%, 28%, 21%, and 5% of the times, respectively. One of the dominant stride occurs because of the increment at S4. The other three stride values depend on the values in (*s&~3)>size added to s at S3. The second load at the statement S2 accesses (*s & ~3L)->ptr. This access has two dominant strides, which remain constant for 48% and 47% of the times, respectively. These strides are mostly affected by the values in (*s&~3)->size and by the allocation of the memory pointed to by *s. for (; string_list != NULL; string_list = sn) { sn = string_list->next; use string_list->string; other operations; }

(a) Example from 197.parser

while ( s < bound ) { S1: if ((*s & 3 == 0) { S2: access (*s & ~3)->ptr S3: s = s + ((*s & ~3)->size)+values; other operations; } else if ((*s & 3 == 2) { S4: s = s + *s; } else { } } (b) Example from 254.gap

Figure 1. Irregular code with dominant strides

Many other CPU2000 integer benchmarks contain loads with near-constant strides as well. To illustrate the widespread occurrence of the near-constant strides, we examine loads in the CPU2000 integer benchmarks with the following properties (in the following sections, these loads are referred to as candidate loads): • Execute at least 2000 times. • Occur inside loops with minimum trip counts of 100. • Memory addresses are not loop invariant. For each candidate load, we collect the top five most frequently occurring strides (including the stride value of zero), and count the number of dynamic references that have one of these strides. Figure 2 shows that, on an average, about 73% of dynamic references issued at a candidate load have one of the top five strides (see the bars marked with Top5). For a few benchmarks, e.g. 255.vortex, almost all the references at a candidate load have one of the top five strides. Figure 2 also shows that the set of candidate loads accounts for about 10% of the total dynamic loads (see the bars marked with Coverage). Traditional compiler techniques [23][3][17][13][15][20], however, cannot easily discover the near-constant strides in irregular code. Pointer references make it hard for a compiler to estimate the stride patterns of load addresses. Also, the near-constant strides in many cases are the results of memory allocation and compiler has limited ability to analyze memory patterns. Without knowing that a load has near-constant

Value-Profile Guided Stride Prefetching for Irregular Code 309

strides, it would be futile to insert stride prefetch instructions, as doing so will penalize those references with no regularity in strides.

Top5 Coverage

16 4. gz ip 17 5. vp r 17 6. gc c 18 1. 18 mcf 6. c 19 raft y 7. pa rs e r 25 25 2.E on 3. pe rlb m 25 k 4. 25 g ap 5. vo 25 rtex 6. bz 30 ip2 0. tw o av lf er ag e

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Figure 2. Near-constant strides of candidate loads

This paper presents a novel compiler technique called Value-Profile-Guided Stride-Prefetching (or VPGSP for short), which uses profile feedback to determine stride values for memory references and to insert prefetching instructions for those loads that can be effectively prefetched. The compiler first identifies a set of loads and instruments the loads to collect the Stride Value and Stride Difference profile (or SVSD profile for short). A stride value of a load is the difference between the load addresses in adjacent iterations of the loop containing the load. The stride difference is the differences between successive stride values. The compiler then performs program analysis using the SVSD profile to determine which loads have near-constant stride and inserts prefetching instructions for these loads. The compiler may also employ code analysis whenever possible to a) determine the best prefetching distance; b) reduce the profiling cost and c) reduce the prefetching overhead. The example in Figure 3 illustrates the new prefetching techniques. Figure 3 (a) shows a typical pointer-chasing loop. For simplicity, we assume that the load address of the reference P->data at L is P. The compiler instruments the load at L as shown in Figure 3 (b). The load address is passed to the profile routine to collect stride value and stride difference profile. The profile could indicate that the load at L frequently has the same stride, e.g. 60 bytes. In this case, the compiler can insert prefetching instructions as shown in Figure 3 (c), where the inserted instruction prefetches the load two strides ahead (120=2*60). The compiler decides the number of iterations ahead using heuristics to be described in this paper. In case the profile indicates that the load has multiple dominant strides, the compiler may insert prefetching instructions as shown in Figure 3 (d) to compute the runtime strides before the prefetching. The variable prev_P stores the load address in the previous iteration. The variable stride stores the difference between the prev_P and current load address P. Furthermore, the profile may suggest that a load has a constant stride, e.g. 60, sometimes and no stride behavior in the rest of the execution, the compiler may insert a conditional prefetch as shown in Figure 3 (e). The conditional prefetch can be implemented on Itanium efficiently using predication [323].

310 Youfeng Wu et al.

While (P) L: D= P->data Use (D) P = P->next (a) A pointer-chasing loop

While (P) profile(P) L: D= P->data Use (D) P = P->next (b) Stride profiling code

While (P) prefetch(P+120) L: D= P->data Use (D) P = P->next (c) Single stride prefetching prev_P = P While (P) stride = (P-prev_P); prefetch(P+2*stride) L: D= P->data prev_P=P; Use (D) P = P->next

prev_P = P While (P) stride = (P-prev_P); if ( stride == 60) prefetch(P+120) L: D= P->data prev_P=P; Use (D) P = P->next (e) Conditional single-stride prefetching

(d) Multi-stride prefetching

Figure 3. Example of value profiling guided stride prefetching

This paper makes the following contributions. • A new profiling and selective stride prefetching method is presented. The profile not only captures the dominant stride values for the profiled loads but also the changing characteristics between successive strides of the loads. The profile information helps the compiler to classify load instructions into strongly or weakly strided and single-strided or phased multi-strided. The prefetching decisions guided by the load classifications are highly selective and beneficial. • Experiments with the new prefetching method show significant performance improvement for the CPU2000 integer programs running on Itanium machines. For example, the results show a 1.55x speedup for "181.mcf", 1.15x for "254.gap", 1.08x for "197.parser" and smaller gains in other benchmarks. • The overhead to collect the profile for stride prefetching is low. For example, there is only about 1.3x slowdown using the reference input set with instrumentation for stride profiling. This overhead is smaller than that for collecting the block/edge profiles in many production compilers [1]. • Performance comparisons show that the stride prefetching technique is competitive to a hardware-prefetching scheme. This technique seems a viable alternative to the hardware prefetch for saving the processor cost/power and at the same time achieving comparable or higher performance. • The SVSD profile is shown to be stable across input data sets. Experiments were performed to collect the SVSD profiles using the reference-input set and the train-input set. The performance difference between binaries compiled using these two profiles running with the reference input set is small. The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 describes the prefetching algorithm. Section 4 provides the experimental results. Section 5 concludes the paper and points out some future research directions.

Value-Profile Guided Stride Prefetching for Irregular Code 311

2

Related Work

2.1 Static Compiler Prefetching There is extensive research on static compiler prefetching. The earlier work focuses on compiler inserted prefetching instructions for array references [3][17][20]. Data reuse analysis is done to reduce the amount of redundant prefetching to the same cache line. Architecture features such as rotating registers and predication have been incorporated to data prefetching to reduce the overhead of the prefetching and branch misprediction penalty [6]. Huang et al [8] use a configurable memory controller to perform runtime data re-mapping and prefetching. Several recent studies focus on prefetching for recursive data structures. Lipasi et al [13] use heuristics to prefetch for pointers passed into procedures. Luk and Mowry [15] examine several software techniques, including compiler-direct greedy prefetching, software full jumping and data linearization. Roth and Sohi describe a framework for jump-pointer prefetching [19]. Karlsson et al [12] extend the jumppointer prefetching with prefetch arrays, or arrays of jump pointers. Although these techniques have shown promising results in small benchmarks using hand-optimized code, designing compiler algorithms to automatically and beneficially perform the transformations could pose a serious challenge. The most relevant compile-time stride prefetching method is proposed by Stoutchinin et al [23]. It uses compiler analysis to detect induction pointers and insert instructions into user programs to compute strides and perform stride prefetching for the induction pointers. However, the compiler analysis cannot determine whether an induction pointer has a near-constant stride and the prefetching instructions have to be inserted conservatively, e.g. only when machine resource allows the prefetching instructions. Still, this technique can slow a program down when stride prefetching is applied to loads without near-constant strides. Although they showed 20% performance gain for 181.mcf, they reported either very small (< 1%) or negative performance gain for the remaining CPU2000 integer benchmarks. In Section 4.2, we will compare VPGSP with static prefetching that implements the above technique. 2.2 Hardware Prefetch One of the well-known hardware prefetching scheme is the stream buffer based prefetching [11][7][21]. These stream buffers can be viewed as additional caches with different allocation and replacement policies from the normal caches [18]. A stream buffer is allocated when a load misses both in the data cache and in the stream buffers. The stream buffers use stride [7] or history [21] information to predict the addresses to be prefetched. When free bus cycles become available, the stream buffers prefetch cache blocks. When a load accesses the data cache, it also searches the stream buffer entries in parallel. If the data requested by the load is not in cache but is in the stream buffer, that block in the stream buffer is transferred to the cache. Two other hardware-prefetching schemes are stride prefetching and sequential prefetching [5]. The stride-prefetching scheme works as follows. The first time a load instruction misses in a cache, the corresponding instruction address I (used as a tag)

312 Youfeng Wu et al.

and data address D1 are inserted in a reference prediction table, RPT. At that time, the state is set to ‘no prefetch’. Subsequently, when a new read miss is encountered with the same instruction address I and data address D2, there will be a hit in RPT, if the corresponding record has not been displaced. The stride is calculated as S1=D2-D1 and inserted in RPT, with the state set to ‘prefetch’. The next time the same instruction I is seen with an address D3, a prediction of a reference to D3+S1 is done, while monitoring the current stride S2=D3-D2. If the stride S2 differs from S1, the state downgrades to ‘no prefetch’. The hardware stride prefetching has the following limitations compared to VPGSP: • The prefetching distance is the difference of the data addresses at two misses, and it is somewhat arbitrary. This may cause either cache pollution by unnecessarily prefetching too far ahead or wasted memory traffic by prefetching too short. • The hardware table is limited in size. For a program with many loads that miss cache, the table may overflow and cause some of the useful strides to be thrown away, and thus reduce the effectiveness of the prefetching. • The hardware monitors cache misses at a particular cache level, e.g. L1, to determine prefetch strides. VPGSP is more flexible as it can prefetch for different cache levels by using different prefetching distances. We will compare the performance of VPGSP with that of the hardware stride prefetching in Section 4.3. 2.3 Pre-computation Recently, a number of studies propose to use speculative threads to run portions of the program in a separate thread to prefetch for the main program. For example, Collins et al [4] identifies delinquent loads and the backward slices for the loads during a previous simulation run. The profile is then used in the later run of the program to issue the slices speculatively before the delinquent loads are executed. C. Zilles and G. Sohi [25] use a runtime mechanism to predict which backward slices to run-ahead. C. Luk [14] uses compiler to insert code in user program to fork speculative threads. With multithreaded resource, pre-computation can perform more complex address computation than stride calculation. However, all the pre-computation approaches so far resort to hand coding in program transformation and the performance gains are measured on simulators. Some of them report performance gain only for 181.mcf out of all the CPU2000 integer benchmarks. Our approach does not require speculative multithreaded hardware and the compiler performs stride prefetching automatically for the entire CPU2000 integer benchmark suite, with significant performance improvement on a real machine. 2.4 Value Profiling A value profiling technique was proposed in [2]. It instruments user programs to collect the frequencies of the recurrent values for each profiling candidate. It uses a Least Frequently Used (LFU) replacement algorithm to manage a buffer to keep track

Value-Profile Guided Stride Prefetching for Irregular Code 313

of the most frequently recurrent values. We extend it to profile both stride value and stride difference. We also device a few techniques to improve the speed of the value profiler for collecting stride value and stride difference profiles.

3

Prefetching Algorithm

Here are some terms that are used in the description of the prefetching algorithm: • Candidate load: a load that may miss cache and should be considered for profiling or prefetching. • Profiled load: a candidate load that is selected for profiling. • Prefetched load: a candidate load that is selected for prefetching. The prefetching algorithm is described in the following subsections. 3.1 Identify Candidate Loads The compiler identifies a candidate load for stride profiling with the following criteria using control flow profile information. • The load is frequently executed. • The load is inside a loop. • The loop has a high trip count. For a loop with a very low trip count (e.g. 1), the compiler will consider the trip count of its parent loop, and the loads inside the loop will be prefetched as if they are in the parent loop. • The memory address of the load is not a loop invariant. The above trip count condition indicates that the load is likely to touch a large range of memory. For example, the range [x, x+stride * trip_count] of memory area will be touched. Therefore, the candidate loads selected by the above criteria are likely to miss cache, especially when SVSD profile shows that the stride value is large. 3.2 Select Profiled Loads and Collect Profile A set of loads is equivalent if their addresses are different only by compile-time constants. They will have the same stride values or their strides can be derived from the stride for another load. The compiler selects only one of them as the representative to be profiled. Examples of equivalent loads are: • Loads that access different fields of the same data structure. • Loads that access different elements of the same array. For each profiled load, the compiler inserts profiling instructions to collect SVSD profile. When the instrumented program is run, profile runtime routine collects two types of information for the given series of addresses from a profiled load: stride value profile and stride difference profile.

314 Youfeng Wu et al.

Stride value profile collects the top N most frequently occurring stride values and their frequencies. An example for N = 2 is shown in Figure 4 (a). For the nine stride values from the addresses of a profiled load, the profile routine identifies that the most frequently occurring stride is 2 with frequency of 5, and the second mostly occurring stride is 100 with frequency of 4. Stride difference profile collects the top M most frequently occurring differences between successive strides and their frequencies. An example for M = 1 is shown in Figure 4 (b). For the eight stride differences, the profile routine identifies that the most frequently occurring difference is 0 with a frequency of 7. The stride difference profile is used to distinguish a phased stride sequence from an alternating stride sequence when they have the same stride value profile. The stride sequence shown in Figure 4 (a) is a phased stride sequence. An alternating stride sequence is shown in Figure 4 (c). A phased stride sequence is characterized by the fact that its top stride difference is zero. The alternating stride sequence in Figure 4 (c) has the same stride value profile as the phased stride sequence in Figure 4 (a), however, its top stride difference is not zero. A phased stride sequence is better for prefetching than an alternating stride sequence as the stride values in phased stride sequence remain a constant over a longer period, while the strides in an alternating stride sequence frequently change. The value-profiling algorithm reported in [2] is used to collect the stride value profile. The same algorithm could also be used to collect the stride difference profile. However, we can simply count the number of zero differences between successive strides to obtain the stride difference profile. If the percentage of the zero differences is high, we know that the stride sequence is phased. Stride sequence 2, 2,2,2,2,100,100,100, 100

Difference sequence 0, 0, 0, 0, 98, 0, 0, 0

Top[1] = 2, freq[1] = 5 Top[2] = 100, freq[2] = 4 Total strides = 9

Dtop[1] = 0, freq[1] = 7 Total differences = 8

(a) Stride values and top strides identified

(b) Differential stride values and top difference identified

Stride sequence 2,100,2,100,2,100,2,100,2 Difference sequence 98,-98,98,-98,98,-98,98,-98 Top[1] = 2, freq[1] = 5 Top[2] = 100, freq[2] = 4 Total strides = 9 Dtop[1] = 98, freq[1] = 4 Total differences = 8 (d) Same top strides but different top difference

Figure 4. Stride value profiling and stride difference profiling example

Stride prefetching often remains effective when the stride value changes slightly. For example, prefetching at address+24 and the prefetch at address+30 should not have much performance difference, if the cache line is large enough to accommodate the data at both addresses. To consider this effect, the "profile (address)" routine considers the strides that are different by less than half the size of a cache line to be identical.

Value-Profile Guided Stride Prefetching for Irregular Code 315

3.3 Analyze Profile The compiler reads the SVSD profiles to guide its prefetching decision. It identifies the loads for stride prefetching by classifying the profiled loads into the following categories: • Strong single stride load: A load with one non-zero stride that occurs with a very high probability (e.g. at least 70% of the time). • Phased multi-stride load: A load with multiple non-zero strides that together occur the majority of the times and the stride differences are mostly zeroes. For example, the profile may find out the stride values 32, 60, 1024 together occur more than 60% of the time and 50% of the stride differences are zeros. • Weak single stride load: A load where only one of the non-zero strides is frequent (e.g. > 30% of the time) and where stride differences are often zeros. For example, the profile may find out that the stride for a load has a non-zero stride 35% of time and the stride differences are zeroes 10% of the time. In the first case, the compiler simply uses the most likely stride obtained from profile in the prefetching instructions. In the second case, it uses run-time calculation to determine the strides. In the third case, it uses conditional prefetching. 3.4 Insert Stride Prefetching Instructions For each set of equivalent candidate loads, although only one of them is profiled, more than one may need to be prefetched. To decide which ones to prefetch, the compiler analyzes the range of cache area accessed by the loads. Enough loads will be prefetched to cover the cache lines in that range. The loads selected for prefetching are called the prefetched loads. Assume a prefetched load has a load address P in the current loop iteration. If this is a strong single stride load and the dominant stride value is S, the compiler inserts the prefetch instruction "prefetch (P+K*S)" right before the load instruction, where K*S is a compile-time constant. The constant K is the prefetch distance determined by compiler analysis. If the load may have a miss latency of W cycles, and the loop body takes about B cycles without taking the miss latency of prefetched loads into account, then K = W/B. The cache miss latency can be estimated based on the analysis of the working set size of the loop. For example, if the estimated data working set size (trip_count * stride) of the load is larger than the size of L3 cache (the highest cache level), the L3 cache miss latency will be used as the value for W. The value of K may also be determined using loop trip count value as follows: K = min (trip_count / T, C) where T is the trip count threshold (e.g. 100), and C is the maximum prefetch distance (e.g. 8). If this is a phased multi-stride load, the compiler inserts the following instructions: 1.

Insert a move instruction before the load operation to save its address in a scratch register.

316 Youfeng Wu et al.

2. 3.

Insert a subtract instruction before the move instruction to subtract the value in the scratch register from the current address of the load. Place the difference in a register called stride. Insert “prefetch (P+K*stride)” before the load, where K is determined as described previously, but rounded to a power of two to avoid the multiplication operation.

If this is a weak single stride load, the compiler inserts prefetching instructions the same as the steps 1 and 2 described for a phased multi-stride load. Step 3 is modified as follows: 3’ Insert a conditional prefetching “if (stride == profiled stride) prefetch (P+K*stride)” before the load. The conditional prefetch instruction can be implemented in Itanium using predication. For example, the compiler can compute a predicate “p = stride == profiled stride” and insert a predicated prefetch instruction “p? prefetch (P+K*stride)”. The reason for a conditional instruction is to reduce the number of useless prefetches for a weak single stride load.

4

Experimentation

VPGSP is implemented in a research compiler for the Itanium Processor Family (IPF). The compiler automatically performs the profiling and prefetching without any hand coding involved. The compiler is based on a production compiler with additional components to make compiler and architectural exploration easier. The code it produced has a similar base performance as reported in [9]. The experiments use the integer benchmark suite in the CPU2000 (see Figure 5) running on the Itanium machines [10]. The Itanium machines used in these experiments have a 16K 4-way set associative split L1 data cache, a 96K 6-way set associative unified L2 cache, and 2M 4-way set associative unified L3 cache. Some of the experiments also use an Itanium machine with 4M L3 cache. In the experiments, the compiler uses the following criteria to classify loads for prefetching. Strong single stride loads: loads with one of the non-zero strides occurring at least 70% of the time. Phased multi-stride loads: loads with the top 4 non-zero strides occurring at least 30% of the time and at least 30% of the differences between non-zero strides are zeroes. Weak single stride loads: loads with one of the non-zero strides occurring at least 20% of the time and at least 10% of the differences between strides are zeroes. In this section, we first measure the speedup of VPGSP compared to 1) no prefetching, 2) static compiler prefetching and 3) hardware prefetching. Then, we report experimental results on the profiling overhead and the sensitivity of the profile to input data sets.

Value-Profile Guided Stride Prefetching for Irregular Code 317 ID 164 175 176 181 186 197 252 253 254 255 256 300

Programs gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf

Language C C C C C C C++ C C C C C

Description Compression/Decompression FPGA circuit placement and routing C programming language compiler Combinatorial Optimization Game Playing: Chess Word Processing Computer Visualization PERL programming language Group theory, interpreter Object-oriented database Compression Place and route simulator

Figure 5. CPU2000 integer benchmarks

4.1 Comparison with No Prefetching Figure 6 shows the performance comparison of VPGSP vs. no prefetching. For a 2M L3 machine, the 181.mcf benchmark is sped up by 1.55x, the 254.gap benchmark is sped up by 1.15x. For the 4M L3 machine, the 181.mcf benchmark is sped up by 1.43x, the 254.gap benchmark is sped up by 1.15x. Several benchmarks (e.g. 181.mcf, 197.parser, and 300.vortex) miss L3 frequently and increasing L3 size from 2M to 4M reduces the effectiveness of the prefetching. Figure 6 also shows that a few benchmarks, such as 164.gzip, have slight performance loss with VPGSP. The prefetching instructions inserted by the compiler may increase the schedule length if the instruction scheduler cannot find enough available slots to place the instructions. The schedule length measures the time to execute a program without considering the cache miss and other microarchitecture stalls. Figure 7 shows that, on the average, about 3.6% more instructions are executed for prefetching, and the prefetching instructions increase the schedule length by about 0.6%. If the prefetching instructions inserted by the compiler in a program do not reduce cache misses enough to compensate for the prefetching overhead (i.e. the increase in schedule length and additional memory traffic), the program will show a performance loss. 4.2 Comparison with Static Prefetching This experiment compares VPGSP with a start-of-the-art static prefetching implemented in a production compiler for IPF. The production compiler includes implementation of a number of prefetching techniques for array and irregular code [23][3][17][6][13][15][20]. Specifically, it includes the technique proposed by Stoutchinin et al [23] to detect induction pointers for stride prefetching. The compiler prefetching is very effective for numeric code [9][22]. For irregular code, however, its performance is mixed. Figure 8 shows the results on an Itanium machine with 4M L3 cache. The bars marked with “static prefetching” show the performance of the static prefetching. Some benchmarks have significant performance gains (181.mcf 9%, 254.gap 7%) and others have noticeable losses (197.parser 10%, 300.twolf 8%). The geomean is slightly negative.

318 Youfeng Wu et al.

1.6 speedup 2M L3 speedup 4M L3

1.5 1.4 1.3 1.2 1.1

om

ea

n

0

6

30

4

5

25

25

2

7

6

1

6

3

25

25

25

19

18

18

17

5 17

16

0.9

4

1.0

ge

benchmarks

1.20 instruction count ratio

1.15

schedule length ratio

1.10 1.05 1.00 0.95

n ea

om

0 ge

30

6 25

5 25

3

4 25

25

7

6

1

6

5

2 25

19

18

18

17

16

4

0.90

17

ratios of VPGSP v.s. no prefetching

Figure 6. Performance comparison with no prefetching

Figure 7. VPGSP overhead in terms of instruction count and schedule length

Notice that VPGSP uses both control flow profile (block frequency and trip count) and SVSD profile to make prefetching decisions. The control flow profile could also benefit the static prefetching. This experiment further enhances the static prefetching with control flow profile information, namely to prefetch only loads that are frequently executed (e.g. at least 2000 times) and inside a loop with high trip count (e.g. the trip count is at least 100). This leads to a significantly higher performance. The result for this experiment is shown with the bars marked by “static+cflow prefetching” in Figure 8. Still, the performance of the enhanced prefetching lags behind that of VPGSP (see the bar marked with VPGSP). 4.3 Comparison with Hardware Prefetching To compare with hardware prefetching, a cache simulator is used to measure the miss rate reduction by VGPSP and by the hardware stride prefetching mechanism [5], on the same memory configuration as an Itanium machine with 4M L3 cache.

Value-Profile Guided Stride Prefetching for Irregular Code 319

1.40

static prefetching static+cflow prefetching

1.30

vpgsp

1.20 1.10

3 ge 0 0 om ea n

25 6

25 5

25 4

25 3

25 2

19 7

18 6

18 1

17 6

0.90

17 5

16 4

1.00

Figure 8. Performance comparison with static prefetching

In this experiment, two sets of binaries are generated, one with and another without VPGSP prefetching. For the binaries with prefetching, the simulator treats the prefetch instructions inserted by the compiler as load references. For the binaries without prefetching, the simulator performs the hardware stride prefetch function. The hardware stride prefetch monitors L1 misses and uses a 256-entry direct-mapped reference prediction table (as suggested in [5]). We compare the miss rates for L1/L2/L3 data caches. Figure 9 shows the miss rate reductions for the CPU2000 integer benchmarks with VPGSP and the hardware scheme. VPGSP reduces the cache miss rates much more significantly than the hardware scheme. For example, VPGSP reduces the L1, L2, and L3 miss rates for 181.mcf by 59%, 57%, 23%, respectively, while the hardware scheme reduces miss rates by 23%, 24%, and 3%. On the average, VPGSP reduces the L1, L2, and L3 miss rates for CPU2000 integer benchmarks by 16%, 16%, and 15% respectively, while the hardware scheme reduces L1 miss rates by 12%, 10%, and 10%. There are a few cases for which the hardware stride prefetching reduces miss rates slightly more than VPGSP does, e.g. 164.gzip (L2), 176.gcc (L3), and 255.vortex (L1 and L2). We suspect that VPGSP and the hardware scheme might be prefetching slightly different sets of loads. VPGSP may support more aggressive program-specific prefetching using profile information. The hardware dynamic logic may capture transient behavior for code not inside a loop or in a loop with low trip counts. We will investigate a possible integration of the two in the future. For example, we may pass hints about the loads that cannot be effectively prefetched by compiler to the hardware. 4.4 Sensitivity to Profiling Data Sets This experiment shows that the performance improvement of VPGSP is stable across input data sets. In this experiment, SVSD profiles with both the reference-input set and the train-input set are collected. The performance difference is measured between the binaries compiled using the two profiles running with the reference input set. Figure 10 shows that the performance gain with the profile obtained using the train

320 Youfeng Wu et al.

input set (train-ref) is only slightly lower than that obtained using the reference input set (ref-ref). 70%

Miss Rate Reduction

60% 50%

vpgsp L1

hw L1

vpgsp L2

hw L2

vpgsp L3

hw L3

40% 30% 20% 10%

30 0 av er ag e

25 6

25 5

25 4

19 7

18 6

18 1

17 6

16 4

-10%

17 5

0%

Figure 9. Miss rates reduction by hardware stride prefetching and by VPGSP 1.6 ref-ref train-ref

1.5 1.4 1.3 1.2 1.1

ge

om

ea

n

0 30

5

6 25

25

3

2

7

6

1

6

4 25

25

25

19

18

18

17

5 17

16

0.9

4

1

Figure 10. Speedup with profiles from reference and train input sets

4.5 Classifications of Profiled Loads This experiment provides the distribution of profiled loads into the following categories: • SSST - Strong single stride loads • PMST - Phased multi-stride loads • WSST - Weak single stride loads Figure 11 shows that on the average, about 59% of profiled loads are “strong single stride”, about 5% are “phased multi-stride” and another 2% are “weak single stride”. Individual benchmarks show significantly different distributions. For example, the 254.gap benchmark has 45% profiled loads in the “phased multi-stride” category,

Value-Profile Guided Stride Prefetching for Irregular Code 321

36% in the “weak single stride” category and only 17% in the “strong single stride” category. As discussed in Section 3.4, each class of loads has a different prefetching code sequence. Benchmark like 254.gap needs all three types of prefetching code sequences to achieve the highest performance improvement. 100% %wsst

90%

%pmst

80%

%ssst

70% 60% 50% 40% 30% 20% 10%

30 0 Av er ag e

25 6

25 5

25 4

25 3

25 2

19 7

18 6

18 1

17 6

17 5

16 4

0%

Figure 11. Distribution of profiled loads

4.6 Profiling Overhead Finally, the execution time of the programs with instrumentation for collecting SVSD profile (profile_run) is compared with those of the programs without the instrumentation (base_run). Figure 12 shows the ratios of the execution time of the profile_run over the base_run with the same reference input set. The profiling_run is only about 1.3x slower than the base_run. 1.90 1.70 1.50 1.30 1.10

3 Av 00 er ag e

25 6

25 5

25 4

25 3

25 2

19 7

18 6

18 1

17 6

17 5

16 4

0.90

Figure 12. Profiling overhead

We achieve this low profiling overhead by selecting only a small percentage of the loads for profiling (about 10% of dynamic loads are profiled). We have also

322 Youfeng Wu et al.

employed a few techniques in the SVSD profiler to reduce the profiling overhead. Firstly, the value-profiling algorithm [2] invokes the heavy-duty LFU operation (Least Frequently Used replacement) whenever the profile value changes. We treat strides that are different by half a cache line size or less as the same and this reduces profiling overhead. Secondly, the value-profiling overhead is in proportion to the number of top values tracked by the LFU operation. In our implementation, only four non-zero top strides are tracked by the LFU operation. The zero strides are checked and counted without going through the LFU operation.

5

Conclusions and Future Work

In this paper, we have presented a novel profiling and prefetching method for guiding the compiler to perform selective stride prefetching. The profile not only captures the dominant stride values for the profiled loads, but also the changing rate between successive strides. The profile information helps the compiler to classify load instructions into strongly or weakly strided and single-strided or phased multi-strided. The prefetching decisions guided by the load classifications are highly selective and beneficial. We show significant performance improvement from VPGSP for the CPU2000 integer programs running on the Itanium machines. For example, we observe a 1.55x speedup for "181.mcf", 1.15x for "254.gap", and 1.08x for "197.parser". The performance gain from VPGSP is much higher than that obtained by the static compiler prefetching as well as the hardware prefetching methods. Furthermore, the performance gain is stable across profiling data sets and the profile overhead is quite low. These benefits make the new technique suitable for a production compiler. We are currently pursuing the study in the following directions. • The SVSD profile is collected in a separate pass from the control flow profiling pass. We are investigating ways to implement the SVSD profiling algorithm so it can be run in the same pass as the control flow profiling to simplify the software development cycle. • We have observed cases where a load itself does not have a dominant stride, but its address depends on another load with constant strides. We may extend VPGSP to prefetch loads that depend on the results of the stride prefetching. • Many applications maintain their own memory allocation and we need to investigate techniques to teach the customized memory allocation to produce more strides that are constant. • A hybrid approach of software and hardware prefetching may lead to better performance. As described in Section 2.2, we may try to pass SVSD profile information to hardware to so the hardware will only prefetch for the loads that are not prefetched by software. • The effectiveness of compiler prefetching can be improved through feedback of cache simulation information.

Value-Profile Guided Stride Prefetching for Irregular Code 323

Acknowledgements We would like to thank John Shen, Sun Chan, Dong-Yuan Chen, Yong-Fong Lee, and Hsien-Hsin Lee for their valuable comments. We appreciate the comments from the anonymous reviewers that helped improve the quality of the paper.

References [1] Ball, T. and J. Larus, “Optimally profiling and tracing programs,” ACM Transactions on Programming Languages and Systems, 16(3): 1319-1360, July 1994. [2] Calder, B., P. Feller, and A. Eustance, “Value Profiling,” MICRO30, Dec. 1997. [3] Callahan, D., K. Kennedy, and A. Porterfield, “Software Prefetching”, ASPLOS4, 1991, 40-52. [4] Collins, J., H. Wang, H. Christopher, D. Tullsen, C. J. Hughes, Y. F. Lee, D. Lavery and J. Shen, "Speculative Pre-computation: Long-range Prefetching of Delinquent Loads," ISCA28, 2001. [5] Dahlgren, F., Stenstrom, P., “Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors”, IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 4, April 1996. [6] Doshi, G., R. Krishnaiyer, and K. Muthukumar, “Optimizing Software Data Prefetches with Rotating Registers”, PACT 2001. [7] Farkas, K., P. Chow, N. Jouppi, and Z. Vranesic, "Memory-system design considerations for dynamically-scheduled processors," ISCA24, June 1997. [8] Huang, Xianglong, Zhenlin Wang, and K.S. McKinley, “Compiling for the Impulse memory controller,” PACT2001. Pages: 141 –150 [9] Intel Corp, "Benchmarks: Intel® Itanium™ based systems," http://www.intel.com/eBusiness/products/ia64/overview/bm012101.htm. [10] Intel Corp, Intel® Itanium™ Processor Hardware Developer’s Manual, 2000. http://developer.intel.com/design/ia-64/manuals.htm. [11] Jouppi, N., "Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buffers," ISCA17, May 1990 [12] Karlsson, M., F. Dahlgren, and P. Stenstrom, “A Prefetching Technique for Irregular Accesses to Linked Data Structures,” HPCA6, January. 2000 [13] Lipasti, M.H., W.J. Schmidt, S.R. Kunkel, and R.R. Roediger, “SPAID: Software Prefetching in Pointer and Call Intensive Environments”, MICRO28, Nov 1995, 231-236. [14] Luk, C., "Tolerating Memory Latency through Software-Controlled PreExecution in Simultaneous Multithreading Processors,” ISCA28, 2001. [15] Luk, C.K. and T.C. Mowry, “Compiler-Based Prefetching for Recursive Data Structures,” ASPLOS7, September 1996, 222-233. [16] Mahlke, S.A., D.C. Lin, W.Y. Chen, R.E. Hank, and R.A. Bringmann, “Effective Compiler Support for Predicated Execution Using Hyperblock,” MICRO25, Dec. 1992, pp 45-54.

324 Youfeng Wu et al.

[17] Mowry, T.C., M.S. Lam, and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” ASPLOS5, October 1992, 62-73. [18] Palacharla, S. and R. Kessler, "Evaluating stream buffers as secondary cache replacement," ISCA21, April 1994. [19] Roth, A., and G. Sohi. “Effective Jump-Pointer Prefetching for linked data structures,” ISCA26, June 1999, 111-121. [20] Santhanam, V., E. Gornish, and W. Hsu, “Data Prefetching on the HP PA8000," ISCA24, June 1997, 264 – 273. [21] Sherwood, T., S. Sair, B. Calder, "Predictor-Directed Stream Buffers," MICRO33, Dec. 2000. [22] Standard Performance Evaluation Corporation, “All SPEC CFP2000 Results Published by SPEC,” http://www.spec.org/osg/cpu2000/results/res2001q2/ cpu2000-20010522-00663.html, 2001. [23] Stoutchinin, A., J. N. Amaral, G. Gao, J. Dehnert, S. Jain, and A. Douillet “Speculative Prefetching of Induction Pointers,” CC 2001, April, 2001. Also in LNCS 2207, pp 289-303, 2001. [24] Wiel, V., S.P., Lilja, D.J. “When caches aren't enough: data prefetching techniques,” Computer, Volume: 30 Issue: 7, July 1997, Page(s): 23 -30 [25] Zilles, C. and G. Sohi, "Execution-based Prediction Using Speculative Slices," ISCA28, 2001.

A Comprehensive Approach to Array Bounds Check Elimination for Java Feng Qian, Laurie Hendren, and Clark Verbrugge School of Computer Science, McGill University {fqian,hendren,clump}@cs.mcgill.ca

Abstract. This paper reports on a comprehensive approach to eliminating array bounds checks in Java. Our approach is based upon three analyses. The ﬁrst analysis is a ﬂow-sensitive intraprocedural analysis called variable constraint analysis (VCA). This analysis builds a small constraint graph for each important point in a method, and then uses the information encoded in the graph to infer the relationship between array index expressions and the bounds of the array. Using VCA as the base analysis, we also show how two further analyses can improve the results of VCA. Array field analysis is applied on each class and provides information about some arrays stored in ﬁelds, while rectangular array analysis is an interprocedural analysis to approximate the shape of arrays, and is useful for ﬁnding rectangular (non-ragged) arrays. We have implemented all three analyses using the Soot bytecode optimization/annotation framework and we transmit the results of the analysis to virtual machines using class ﬁle attributes. We have modiﬁed the Kaﬀe JIT, and IBM’s High Performance Compiler for Java (HPCJ) to make use of these attributes, and we demonstrate signiﬁcant speedups.

1

Introduction

The Java programming language is becoming increasingly popular for the implementation of a wide variety of application programs, including loop-intensive programs that use arrays. Java compilers translate high-level programs to Java bytecode and this bytecode is either executed by a Java virtual machine (usually including a JIT compiler), or it is compiled by an ahead-of-time compiler to native code. In either case, the Java speciﬁcations require that exceptions be raised for any array access in which the array index expression evaluates to an index out of bounds. A naive JIT or ahead-of-time compiler inserts checks for each array access, which is clearly ineﬃcient. These checks cause a program to execute slower due to both direct and indirect eﬀects of the bounds check. The direct eﬀect is that the bounds check is usually implemented via a comparison instruction, and thus each array access has this additional overhead. The indirect eﬀect is that these checks also limit further optimizations such as code motion and loop

Verbrugge’s work done while at IBM Toronto Lab

R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 325–341, 2002. c Springer-Verlag Berlin Heidelberg 2002

326

Feng Qian et al.

transformations because the Java virtual machine speciﬁcation requires precise exception handling. The problem of eliminating array bounds checks has been studied for other languages and static analyses have been shown to be quite successful. However, array bounds check analysis in Java faces several special challenges. Firstly the length of an array is determined dynamically, when the array is allocated, and thus the length (or upper bound) of the array may not be a known constant. Secondly, arrays in Java are objects, and these objects may be passed as references through method calls, or may be stored as a ﬁeld of some object. Thus, there may be a non-obvious correspondence between the allocation site of an array and the accesses to the array. Thirdly, multi-dimensional arrays in Java are not necessarily rectangular, and so reasoning about the lengths of higher dimensions is not simple. Finally, techniques that require transforming the program or inserting checks at other earlier program points are not as applicable in Java as in other languages with less strict semantics about exceptions. This paper describes a bounds check elimination algorithm which consists of three analyses: variable constraint analysis (VCA for short), array ﬁeld analysis, and rectangular array analysis. The combination of these analyses can prove that many array references are safe, without transforming the original program. Variable constraint analysis builds a constraint graph for each array reference, and then uses the graph to infer the relationship between the index of the array reference and the array’s length. The analysis was designed to take advantage of the fact that variables used in index expressions often have very short lifetimes— by only building graphs for live variables of interest the graphs are kept quite small. The associated worklist algorithm is also tuned in order to reduce the number of iterations. As a result, the actual running time is linear in the size of the method being analyzed. Array ﬁeld analysis is used to track the storage of array objects into class ﬁelds. By analyzing assignments to ﬁelds that have certain modiﬁer restrictions (e.g., private and ﬁnal) we are able to eﬃciently capture information about arrays that may not be locally-allocated, but which still have limited scope. Finally, rectangular array analysis approximates the shape of multidimensional arrays. This analysis looks at the call graph for the whole application and identiﬁes multidimensional array variables with consistent rectangular shapes. Both array ﬁeld analysis and rectangular array analysis provide information consulted by the VCA and therefore improve the analysis results. All three analyses have been implemented using the Soot bytecode optimization framework[9], but could be easily implemented in other compilers with good intermediate representations. In order to convey the results of the analysis to virtual machines we use the tagging/attributing capabilities of Soot to tag each array access instruction to indicate if the lower bound and/or upper bound checks can be eliminated. The Soot framework then produces bytecode output, with the tag information stored in the attributes section of the class ﬁles. Virtual machines or ahead-of-time bytecode-to-nativecode compilers can then use these attributes to avoid emitting bounds checks based on the attributes. We

A Comprehensive Approach to Array Bounds Check Elimination for Java

327

have instrumented both the Kaﬀe JIT and IBM HPCJ ahead-of-time compiler to read these attributes. We provide dynamic results showing the number of array bounds checks eliminated, and the eﬀect of the additional ﬁeld and rectangular array analysis. We also provide runtime measurements demonstrating signiﬁcant speedups for both Kaﬀe and HPCJ. The remainder of the paper is structured as follows. The base VCA algorithm is presented in Section 2 and the two additional analyses are presented in Section 3 and 4. Experimental results are given in Section 5, related work is in Section 6 and conclusions are in Section 7.

2

Variable Constraint Analysis

The objective of our variable constraint analysis is to determine the relationships between array index expressions and the bounds of the array. In Java, an array access expression of the form a[i] is in bounds if 0 ≤ i ≤ a.length − 1. If the array access expression is out of bounds an ArrayIndexOutOfBoundsException must be thrown, and this exception must be thrown in the correct context. Our base analysis is intraprocedural and ﬂow-sensitive. For each program point of interest, we use a variable constraint graph (VCG) to approximate the relationships between variables. The VCG is a weighted directed graph, where nodes represent variables, constants, or other symbolic representations; and edges have a weight to represent the diﬀerence constraint between the source and destination node. The interesting program points are the entry points of basic blocks. An array reference breaks a code sequence into two blocks, with the actual array reference starting the second block. The fundamental idea is that the entry of each basic block has a VCG to reﬂect the constraints among variables at that program point. These VCGs are approximated using an optimistic work-list-based ﬂow analysis. By reducing the size of the graphs, careful design of the work-list strategy, and the appropriate use of widening operators, we have developed an eﬃcient and scalable analysis. In the remainder of this section we introduce the concept of the variable constraint graph which is the essence of our algorithm. We then describe the data-ﬂow analysis and the techniques we used to improve the algorithm’s performance. 2.1

The Variable Constraint Graph

Systems of diﬀerence constraints can be represented by constraint graphs, and solved using shortest-path techniques[3]. We have adopted this approach for our abstraction. A node in a variable constraint graph represents the constant zero, or a local with the int or array type. A graph edge has a value of ⊥, an integer constant, or . For any constant c, the ordering ⊥ < c < holds. A directed edge from node j to i with a constant c represents a diﬀerence constraint of i ≤ j + c.

328

Feng Qian et al.

The data-ﬂow analysis uses constraint graphs to encode ﬂow information. It needs to change constraints between a set of variables after various statements. The information changes are reﬂected by operating on the variable constraint graph. In following text, we deﬁne operations ( or primitives ) applicable to our constraint graph. Creating a graph: As we will see later, the set of vertices of a graph at an interesting program point can be pre-computed and never change again. There is no constraint between variables at the initial state. Thus, the initializing function accepts a set of vertices, while all edges are set to ⊥ which means edges are uninitialized. Adding an constraint: A new constraint is added in a graph by changing the weight of the corresponding edge. In order to keep the tightest constraints possible, the edge is assigned the minimum of its old and new weight. This operation is named addedge in Table 2. Deleting a constraint: When a constraint does not hold anymore, the corresponding edge weight is set to in the graph. Right now, a constraint is deleted when detaching a node. Detaching a node: When a variable is assigned a new value, its old edges should be removed before adding new ones. However, the edges may be part of some paths connecting other nodes, and we wish to retain this information. Thus the detachnode primitive ﬁrst builds edges from each predecessor to each successor, and then removes all in and out edges. Updating a node’s in and out edges: For an expression i = i + c, we do not kill the node i. Rather, all in-edges’ weights are increased by c, and all out-edges’ weights are decreased by c, to reﬂect the constraint changes. We call this operation update in Table 2. Making the shortest path: A constraint graph also provides methods to ﬁnd the shortest path between two nodes or of all pairs. It implements singlesource shortest paths and all-pairs shortest paths algorithms[3]. Merging two graphs: At conﬂuence points we must merge constraint graphs coming from more than one predecessor. All predecessor graphs will have the same set of nodes, but their edges may have diﬀerent weights. Thus, merging graphs is done by simply merging edge weights. Note that unlike adding a constraint, the merged edge weight is the maximum of the corresponding incoming edge weights. Negative cycles: Negative cycles may exist in a constraint graph for programs with unreachable code due to useless branches. For example: if (i < j) { if (j < i) { P:...}} would lead to a negative cycle at program point P:, but of course this point is never reached. In the presence of negative cycles in a path, we cannot compute the shortest path weight for nodes in the path. Leaving them unchanged is a conservative approach to keep the correctness of the analysis. Figure 1 shows an example of constraint graphs. We are interested in the graph before s3 because it has an array access and we want to know whether j is in the bounds. The other two graphs only reﬂect the constraint changes.

A Comprehensive Approach to Array Bounds Check Elimination for Java

329

2 i

s0 : i = j + 2; s1 : a[i] = · · · ; s2 : i = · · · ;

j

i

j

−2 2 i

j

s3 : a[j] = · · · ;

−2

(a) a basic block

(b) vcg before s1

−3

−1

2

0 a

0

(c) vcg before s2

a

0

(d) vcg before s3

Fig. 1. The status of constraint graph changes The statement s1 generates the constraints i − a ≤ −1 and 0 − i ≤ 0, resulting in edges from a to i and i to 0. The path from a to j implies the constraint j − a ≤ −3 by adding its edge weights. In statement s2 i loses its contraints from a and j and the path a → i → j ceases to exist; the constraint condition is preserved though by a new edge directly from a to j with weight −3. Thus the constraint j − a ≤ −3 is still in eﬀect before s3, even when i was redeﬁned. The upper bound check for s3 can therefore be proved safe (we can not derive the safe lower bound from this simple example, because it only implies 0 − j ≤ 2). 2.2

Data-Flow Analyses

We developed two data-ﬂow analyses in our intraprocedural algorithm. A special live-local analysis, which is relatively simple, determines the set of local variables which are related to array references. A more complicated analysis performs abstract execution of the method, and gets a conservative approximation of constraints among live locals. The ﬁrst analysis limits the number of nodes in a constraint graph and therefore reduces the computation of the second analysis. Array-Related Liveness Analysis A variable constraint graph contains nodes of locals and edges between them. The size of the graph can be reduced by including only those locals that are used to compute an index or an array object length in the future. A smaller constraint graph allows faster computation of shortest paths, and may also reduce the number of iterations required for the ﬁxed-point computation. In order to determine the nodes which should be in the variable constraint graph, we apply a special live locals analysis, which collects only those variables relating to array references. As with ordinary liveness analysis, it is a backward ﬂow analysis. Table 1 provides the key ﬂow functions. The ﬁrst column gives the types of statements or expressions that may generate or kill live locals. The second and third column should be used together. Only when at least one of the local(s) in the condition set are live, does the statement generate live locals in the gen set. Note that array references generate live locals without any conditions. One can easily extend the liveness analysis to accommodate other special nodes, such as class ﬁelds, array elements, and common subexpressions.

330

Feng Qian et al.

Table 1. Liveness for array references stmt/expr i= j +c i = a.length a = new T [i] a[i] if (i op j) i=i+c i = ···

cond i i a i, j

gen j a i a, i i, j

kill i i a

i

Variable Constraint Analysis We use a forward, ﬂow-sensitive, optimistic data-ﬂow analysis to approximate a variable constraint graph for each important point in a method body. The analysis is based on the control-ﬂow graph of basic blocks as we explained before. The entry of each basic block is associated with an input VCG whose vertices are array-related live locals. The initial state of each graph has ⊥ for all edges, except the entry point graph which has all edges. The analysis is driven by a work-list algorithm which computes an output VCG based on the input VCG and the eﬀect of the statements in the basic block. When processing a conditional branch statement, it may generate diﬀerent constraints for the target block and the next block. After reaching a ﬁxed point, the information for each array access statement, S, is encoded by the VCG associated basic block starting with S. At any program point the set of interesting variables is known from arrayrelated liveness analysis. The abstraction computed by our analysis is all-pairs shortest paths of a variable constraint graph. But instead of computing the shortest paths at every program point, we only perform such computation at the conﬂuence point. In other places, we do simple operations on the graph. The abstract information that changes is the weights associated with edges. For any constant c, the ordering ⊥ ❁ c ❁ c + 1 ❁ c + 2 ❁ . . . ❁ maxint ❁ must hold. The base analysis deals only with local variables, which cannot be aliased, nor can they be modiﬁed by method calls. Thus, the eﬀect of each statement on a VCG is quite straightforward. The ﬂow function for each kind of relevant statement is given in Table 2. Variables i, j and a represent nodes in the graph, and c is an integer constant. Each graph has a 0 node. The ﬁrst column shows the kinds of statement which have eﬀect on a VCA. The second column lists the constraints can be generated from the statement in the ﬁrst column. The third column shows the node of which constraints should be bypassed. The last column gives operations on the constraint graph according to the statement. The rules in Table 2 use several primitives, which were deﬁned in section 2.1. At a conﬂuence point P , we use a set of output graphs from predecessors and the old input graph of P to compute the new input graph. We ﬁrstly call

A Comprehensive Approach to Array Bounds Check Elimination for Java

331

Table 2. Statements generating constraints stmts i=c

gen detach operations i−0≤c i detachnode(i) 0 − i ≤ −c addedge(0, i, c) addedge(i, 0, −c) i= j +c i−j ≤c i detachnode(i) j − i ≤ −c addedge(j, i, c) addedge(i, j, −c) i = a.length i − a ≤ 0 i detachnode(i) a−i ≤ 0 addedge(a, i, 0) addedge(i, a, 0) a = new T [c] a − 0 ≤ c a detachnode(a) 0 − a ≤ −c addedge(0, a, c) addedge(a, 0, −c) a = new T [i] a − i ≤ 0 a detachnode(a) i−a≤0 addedge(i, a, 0) addedge(a, i, 0) a[i] i − a ≤ −1 addedge(a, i, −1) 0−i≤0 addedge(i, 0, 0) if (i < j) target: i − j ≤ −1 addedge(j, i, −1) else: j−i≤0 addedge(i, j, 0) i = j&c i−0≤c i addedge(0, i, c) 0−i≤0 addedge(i, 0, 0) i=i+c update(i, c) i = ··· i detachnode(i)

the merge operation to union all ouput graphs from predecessors, then apply a special operation called widening on each new graph edge weight by comparing it to the old graph edge weight. The widening operation looks at the changing trend of an edge weight. If the weight is increasing, we set it to directly. But if the new weight is less than the old weight, we will discard the new weight and use the old one. The widening technique speeds up the symbolic execution and also stops inﬁnite loops correctly. Walking through a CFG in its topological order can speed up data-ﬂow analysis. However, a simple depth-ﬁrst search ( DFS ) algorithm cannot guarantee an optimal order for the successors of a loop exit node. For our analysis, we prefer to visit the loop body before the loop exit. To enforce a good ordering we perform a DFS from exiting nodes of the CFG in reverse order ﬁrst; then the DFS from the starting node can consult the order of reversed DFS when it meets a loop exit allowing us to put loop body nodes before loop exits. Our work list algorithm puts the successors of a node, whose out set changes, onto the work-list for recalculation. The work-list is handled as a heap using

332

Feng Qian et al.

the order computed as above. By enforcing this order we ensure that inner loops reach a ﬁxed-point before the outer loops. Experiments show this is very eﬀective way of making our data-ﬂow analysis run eﬃciently.

3

Array Field Analysis

The base analysis presented in the previous section does not handle arrays stored in ﬁelds. In Java applications, programmers may use ﬁelds to hold some constant value for code modularity and clarity. A class ﬁeld with the private or final modiﬁer can only be assigned a value in the class declaring that ﬁeld. Based on this observation, we developed a simple analysis detecting a ﬁeld holding a ﬁxed length array object. For each class C, array ﬁeld analysis examines the class ﬁelds. Let FC be the set of array-type ﬁelds modiﬁed by private or final declared in C. If FC is non-empty, then a table τC is created, and for each f ∈ FC an entry τC [f ] is created and initialized to ⊥. Each method m declared in C is then considered. Since the Soot framework provides typed locals, and ensures that a putfield or putstatic is always in the form of an assignment from a local to a ﬁeld, a simple pre-scan of the types of locals of m can be used to avoid further processing of methods that cannot change the value of any f ∈ FC . For each method m that might change an array ﬁeld, the body of m is scanned. Let f = be an assignment to some f ∈ FC . A value δ() is computed as follows: 1. If is a newarray or multianewarray operation, then extract the array length expression d and return δ(d). 2. If is a local variable, the UD-DU chains provided by the Soot framework are used to locate the deﬁnitions of . If has more than one deﬁnition point, return , otherwise for a deﬁnition = x return δ(x). 3. If is an integer constant c, return c. 4. Otherwise, return . The table information τC [f ] is then updated by merging the existing value for τC [f ] with the computed δ() according to Table 3; note that δ() is never ⊥. When the intraprocedural VCA analysis meets an array type ﬁeld read of the form a = o.f where o has class type C, it consults the array ﬁeld analyzer to get

Table 3. The rule for updating the ﬁeld table ⊥ c1 c2 c2 c1 : c1==c2 : else

A Comprehensive Approach to Array Bounds Check Elimination for Java

333

the value τC [f ]. If τC [f ] has a constant value c, we can analyse this statement as if it was a = new T[c] (see rule in Table 2). Our experience shows that this usually happens for a ﬁeld with an initializer, where all assignments are made in the constructors. For simplicity, our implementation of array ﬁeld analysis focuses only on the ﬁrst dimension of array objects.

4

Rectangular Array Analysis

Another opportunity lies in rectangular arrays. Because multidimensional arrays in Java can be ragged, it is more diﬃcult to get good array bounds analysis for multidimensional arrays. However, in scientiﬁc programs arrays are most often rectangular. Thus, we have developed a whole-program analysis using the call graph to identify rectangular arrays that are passed to methods as parameters. A multidimensional array can be allocated by explicit new instruction, or an array initializer. The initializer is compiled by javac or jikes as individual allocations to give a potentially ragged array of array objects. An array of arrays is created, then each element is assigned a subarray object. Figure 2(a) shows a typical Java example, and Figure 2(b) shows the resulting bytecode. We use a simple pattern matcher that can ﬁnd this idiom and recover a rectangular array’s creation from its sparse representation to a dense one, as shown in Figure 2(c).

int[][] a = {{1}, {2}};

a = newarray (int[])[2]; $r2 = newarray (int)[1]; $r2[0] = 1; a[0] = $r2; $r3 = newarray (int)[1]; $r3[0] = 2; a[1] = $r3;

a) An array initializer

b) Compiled code by javac and jikes

a = multianewarray int[2][1]; $r2 = a[0]; $r2[0] = 1; $r3 = a[1]; $r3[0] = 2; c) Recovered code

Fig. 2. Recover the creation of rectangular arrays After ﬁnding all the creation sites for rectangular arrays, we then perform a simple whole program analysis to ﬁnd which variables must be associated with rectangular arrays. To achieve this we build an array type propagation graph. The graph nodes consist of two special nodes for TRUE and FALSE, plus nodes representing method parameters, locals, returns, class ﬁelds, and array elements. To minimize the size of the graph we only include nodes for those variables whose static types indicate that they are multidimensional array objects. A variable in the graph is connected to the TRUE node if it is assigned a new multi-array expression, a=multianewarray T[i][j]. A variable a is connected to the FALSE node if it appears in the statement a[i] = c and a is a

334

Feng Qian et al.

multidimensional array. An assignment a = b adds an edge between a and b. To handle assignments due to parameter passing, we add edges between actual arguments and formals for each method call. For virtual and interface calls we use a conservative call graph to ﬁnd all potential target methods. If a local is passed to or gets a return value from a method which is out of our analysis context (i.e. we do not have the method body to examine), we make a conservative assumption and connect the variable to the FALSE node.1 After building the propagation graph, we want to ﬁnd all nodes which are reached starting at the TRUE node (were allocated as rectangular), and are not reached starting at the FALSE node (may have become ragged). We achieve this as follows. First we traverse the graph, starting from the FALSE node, marking these nodes as reachable from FALSE. Then we traverse the graph starting at the TRUE node, ﬁnding all reachable nodes that are not marked FALSE. This set indicates that the members are always assigned rectangular arrays. To use rectangular array information, the constraint graph has some special nodes to represent the subarrays. For example, we use A[ to represent the second dimension length of A.

5

Experimental Results

We have implemented the algorithm in the context of the Soot framework. In this section we present and discuss the experimental results that we have obtained. The results are grouped into three categories: 1. We measured the dynamic characteristics of the variable constraint analysis in terms of two most important factors aﬀecting the algorithm’s performance: the size of variable constraint graphs and the number of iterated blocks to reach the ﬁxed point. 2. Then we show the results of the base intraprocedural analysis, followed by the array ﬁeld analysis and rectangular array analysis as they are added in separately, and ﬁnally combined. The results are presented as percentages of lower and upper bound checks that can be proved safe. 3. Our analyses results are encoded in the attributes of class ﬁles. To measure the real impact to the run-time performance of Java programs, we modiﬁed Kaﬀe JIT and HPCJ compiler to read and take advantages of such attributes. The run-time measurements show speed-ups in most of benchmarks. We chose several benchmarks including both general and numerical ones: as well as Spec and scimark2, LCS , an implementation of a Longest Common Subsequence algorithm, and MCO, an algorithm for ﬁnding an optimal order of matrix multiplication. Before doing experiments, we measured the overhead of array bounds checks within each benchmark. In the Spec benchmarks we found ‘mpegaudio’ has a 1

For our experiments, we analyzed only the benchmark code, and treated the library as out of context.

A Comprehensive Approach to Array Bounds Check Elimination for Java

335

Table 4. Characteristics of the algorithm Graph size Blocks Iter NonZero (avg) (max) (avg) Blocks db 3.17 6 280 1.28 89 jack 2.5 6 2076 1.04 1892 javac 2.45 6 3347 1.27 1631 mpegaudio 3.42 10 6987 1.10 6670 raytrace 2.56 6 626 1.31 476 scimark2 5.8 12 388 1.79 301 LCS 9 13 59 2.8 55 MCO 4.6 11 98 2.0 95

large overhead, as do LCS, MCO and three sub-benchmarks in scimark2. These are all typical examples of array-intensive programs. Other benchmarks in our study serve as examples of normal programs which are less array dedicated. 5.1

Dynamic Characteristics of the Algorithm

Table 4 shows some of the dynamic properties of our algorithm applied to the diﬀerent benchmarks. The Blocks column gives the number of basic blocks in the program, while the NonZero Blocks column gives the number of blocks that have non-empty live sets for local variables, and so have non-empty constraint graphs. Only NonZero blocks were used in the calculation of average and maximum constraint graph sizes, and every (non-empty) constraint graph includes at least one node for the constant zero. From this, the size of the constraint graphs is quite reasonable: the average size never exceeds 10 nodes, and the maximum size is no more than 13. These are quite practical factors. The Iter column is the average number of times a block is processed as the analysis iterates toward a ﬁxed point. It is a good indicator of how long the analysis will run, and suggests that in a practical sense the running time of our algorithm is linear in the code size. There is an impact due to loop nesting; in small benchmarks, LCS, MCO and scimark2, the code bodies are dominated by nested loops and hence, the factor is higher than other benchmarks. Nevertheless, the factor remains relatively small. 5.2

Dynamic Results and Discussion

Figure 3(a) shows the percentage of bounds checks our basic intraprocedural analysis is able to detect are safe to remove. Note that these are dynamic statistics, obtained by instrumenting the class ﬁles and inserting proﬁling instructions before each array reference bytecode. Lower bounds and upper bounds are measured separately in the ﬁrst two bars for each benchmark, while the last bar gives the percentage of array references with both safe checks.

336

Feng Qian et al.

100% 80% 60%

Safe lower bound Safe upper bound

Safe both bounds

40% 20% 0% db

jack

javac mpegaudio raytrace

FFT

LU

SOR

LCS

MCO

(a) Results of the base analysis 100% 80% 60%

VCA VCA+Field VCA+Rect All

40% 20% 0% db

jack

javac mpegaudio raytrace

FFT

LU

SOR

LCS

MCO

(b) Improvements due to ﬁeld and shape analysis (both bounds safe)

Fig. 3. Dynamic Results of VCA

The intraprocedural algorithm can determine that a fairly high percentage of the lower bound checks are safe. Safety of upper bound checks is more diﬃcult to ascertain. Still, the results for the array-intensive benchmarks (rightmost ﬁve) are encouraging; these are the benchmarks which will beneﬁt the most, and also in which we achieve the best results. By analyzing the ﬁelds holding constant length array objects, the intraprocedural analysis can get more information about ﬁeld accesses. The success of this method, however, depends on the application: ‘mpegaudio’ and ‘raytrace’ improve greatly, while others are more or less unaﬀected (Figure 3(b)). Rectangular analysis also proves to be very application-dependent. It is of beneﬁt only to those benchmarks using multidimensional arrays. LU, SOR, and LCS and MCO improve dramatically with the addition of this analysis. The last experiment shows the result of the combined use of ﬁeld and rectangular analyses. Because these are essentially independent analyses, the combined improvement is close to the sum of the improvements seen individually. With most of our benchmarks this brings the percentage of checks we could eliminate to 50% or more; again, array-intensive benchmarks fare best, and in some cases we identify almost 100% of array bounds checks as safe. Relative runtime performance improvements for the instrumented versions of the Kaﬀe JIT and HPCJ are given in Figure 4. Both systems were modiﬁed to read the array attribute information stored within the class ﬁle and to apply that data during code generation. If array bounds checks are required, a testand-branch code sequence is inserted prior to the array access. Note that a

A Comprehensive Approach to Array Bounds Check Elimination for Java

25%

No checks

20%

With attributes

337

15% 10% 5% 0% -5%

mpegaudio FFT

LU

SOR LCS MCO

(a) Kaﬀe

60% 50% 40% 30% 20% 10% 0%

No checks

With attributes

mpegaudio FFT LU SOR LCS MCO (51s) (25s) (29s) (24s) (87s) (38s) (b) HPJC (other optimizations oﬀ)

60% 50% 40% 30% 20% 10% 0% -10% -20%

No checks

With attributes

mpegaudio FFT LU SOR LCS MCO (17s) (22s) (12s) (52s) (17s) (21s) (c) HPJC (other optimizations on)

Fig. 4. Speedup for Kaﬀe and HPCJ

well-known optimization for bounds checking involves making use of the 2’scomplement representation of integer values to perform just a single unsigned comparison that encompasses both upper and lower bound checks (see [6]:144); this data thus represents the use of attribute information only when both bounds are declared safe.

338

Feng Qian et al.

If an array access is deemed safe from the attribute information, no such checks are created—this is done during actual (just-in-time) code generation for Kaﬀe, and at an internal, intermediate stage for HPCJ. In the latter case, this eliminates the potential array bounds exception that may restrict subsequent internal optimizations, resulting in diﬀerent code output. For this reason we present results with and without HPCJ’s own optimizations applied. Finally, note that every array access is an object access, and so null pointer checks are also required at these points. Depending on machine architecture and how objects are organized this check can be combined with the array bounds check, and so removing the latter may require inserting explicit null pointer checks [9]. Best performance results therefore occur when both kinds of checks are eliminated; our results include this optimization. Kaﬀe results were gathered on a dual Pentium II, 400MHz, 384Meg of memory, Linux OS kernel 2.2.8 and glibc-2.1.3; HPCJ results are from a Pentium III 500MHz, 192Meg, Windows NT. In each case the result of using the intraprocedural analysis combined with both ﬁeld and rectangular analyses is compared with the eﬀect of artiﬁcially disabling all bounds checks. A couple of cases (LU in Kaﬀe, LU and FFT in HPCJ (opt)) exhibit interesting anomalous results that we have been able to attribute to code cache eﬀects. In all other cases, however, we achieve signiﬁcant performance increases, roughly corresponding to the quality of information we were able to collect.

6

Related Work

Array bounds check optimization has been performed for other languages, such as Pascal, Fortran, and Ada[8] for a long time. The problem of runtime overhead of array bounds checks was ﬁrst addressed in [7]. R. Gupta[4,5] extended their work by using data-ﬂow analysis to move checks out of loops. These algorithms were working on languages that do not require precise exceptions, which allow an exception to be thrown before the original exception point. More recently, Bodik et. al.[1] presented an algorithm called ABCD (Eliminating Array Bounds Checks on Demand) for general Java applications, The algorithm uses a diﬀerent form of constraint graphs to solve bounds checks. It builds an extended SSA form for a method body. The e-SSA guarantees that all uses (by name) of a variable are bounded by the same constraints, the value range, at runtime. Based on the new form, a constraint graph is constructed, where nodes are locals and constants, and weighted edges are constraints representing inequality relationship between nodes. The relationship between array and index is inferred by a customized depth ﬁrst search. VCA has some similarity to this approach in that both are using inequality graphs to represent constraints. However, there are several diﬀerences between our algorithm and ABCD approach: 1. The ABCD algorithm is based on an extended SSA form, and uses one graph to summarize constraints from all statements in a method. Thus, the controlﬂow information is included in the constraint graph. Our VCA approach

A Comprehensive Approach to Array Bounds Check Elimination for Java

2.

3. 4.

5.

339

does not rely on any underlying program representation form, it uses a ﬁxed number of small program-point speciﬁc constraint graphs. Based on e-SSA form, the ABCD algorithm can be used in a demand-driven manner. Each demand (query) is solved individually, and may be performed on selected array references that occur in hot spots. Although each query is relatively expensive, ABCD does have an overall speed advantage over VCA. The VCA approach is designed to analyze all array references at once, and is intended for oﬀ-line usage. Our experimental results show that our techniques for reducing the size of the graphs and reducing the number of iterations works well to keep the cost of VCA reasonable. The VCA approach keeps constraints of lower and upper bounds in the same graph, which is not the case in the ABCD approach. In our algorithm, the constraint graph serves as the basis of other two analyses. For certain types of applications, the impacts of these analyses can be signiﬁcant. Currently it is not clear how class ﬁelds and multidimensional arrays information can be used to help the ABCD algorithm. ABCD is capable of catching partial redundant bounds checks. VCA is not able to do that currently.

VCA’s primary advantage is in its interaction and integration with other analyses. In isolation, VCA is capable of recognizing nearly the same percentages of safe upper bounds on the SPEC JVM98 benchmarks as reported in[1]. However, when combined with array ﬁeld analysis and rectangular array analysis, VCA can outperform ABCD signiﬁcantly. Experiments show that VCA with rectangular array analysis is very eﬀective on micro benchmarks using twodimensional arrays. In addition, we have provided complete experimental results showing runtime speedups. We also think the approach of formulating a problem in constraint graphs and solving it by using data-ﬂow analysis can be useful for other problems. R. Shaham et. al. [11,10] described an algorithm for identifying live regions of arrays to detect array memory leaks in Java. Although in a very diﬀerent experimental setting, their representation and analysis are very similar to VCA. In both cases constraint graphs and data-ﬂow analyses are used to compute inequalities between variables. However, their focus is on ﬁnding relationships between special class ﬁelds across method boundaries based on supergraphs of a few particular library classes. Although the supergraph can make our ﬁeld analysis more powerful, our VCA approach focuses on intraprocedural analysis for general Java applications, and we handle diﬀerent statements in more detail. Another important aspect of our VCA approach is that we use diﬀerent techniques to reduce the cost of data-ﬂow analysis, such as limiting constraint graph node size, and enforcing iteration in pseudo-topological order. Compared with other algorithms, our VCA works on bytecode level and does not change the program. The analysis results are encoded in the class ﬁle attributes. Thus, there are no problems with precise exception semantics. It is capable of preserving information from various sources. Although it uses a relatively sophisticated abstraction for the data-ﬂow analysis, the techniques

340

Feng Qian et al.

used in the algorithm reduce the overhead to a minimum. VCA can be very easily extended to take advantage of results from other analyses. We demonstrated how the two extended algorithms can improve the analysis results dramatically for array intensive benchmarks. To target the scientiﬁc programs which use multidimensional arrays frequently, our rectangular array analysis provides very important information to the VCA, which helps the conservative VCA remove almost one hundred per cent bounds checks in some typical applications. To the best of our knowledge, very few other works takes advantage of knowing array shapes. Further, we believe the array shape information can also help memory layout of array objects in a virtual machine[2].

7

Conclusions

In this paper we have presented a collection of techniques for eliminating array bounds checks in Java. Our base analysis, variable constraint analysis (VCA), is a ﬂow-sensitive intraprocedural analysis that approximates the constraints between important program variables at program points corresponding to array access statements. The analysis has been made eﬃcient by reducing the size of the graphs, choosing an appropriate worklist order, and applying a widening at loop entry points. As shown in the experimental results, the size of the graphs is small (around 10 nodes for our benchmarks), and the average number of iterations per basic block is always less than 3. In order to improve the precision of the base VCA analysis, we have described two additional techniques. Array ﬁeld analysis is applied to each class to ﬁnd those array type ﬁelds that always hold an array with a ﬁxed constant length. Rectangular array analysis is applied to a whole program to ﬁnd those variables that always refer to rectangular, non-ragged, arrays. Given the information from these analyses, the intraprocedural VCA analysis was improved to include information about ﬁelds, and upper dimensions for multi-dimensional arrays. Our analyses were implemented in the Soot optimization/annotation framework, and we provided dynamic results that showed that eﬀectiveness of the base VCA analysis and the incremental improvements due to ﬁeld and rectangular array analysis. These results were quite encouraging and demonstrated that almost all checks could be eliminated for those benchmarks with very regular computations. We also provided experimental results for Kaﬀe and IBM’s HPCJ to demonstrate that signiﬁcant runtime savings can be achieved as a result of the analysis. Currently our attributes are not veriﬁable, we are just using them as an experimental tool to convey dataﬂow facts from our tool to the ahead-of-time or JIT compiler. Our next phase of work will be to integrate a side-eﬀect analysis into the framework, and improve upon information for arrays stored in objects. We would also welcome the opportunity to provide our attributed class ﬁles to other groups in order to see the runtime impact on other virtual machines.

A Comprehensive Approach to Array Bounds Check Elimination for Java

341

References 1. R. Bodik, R. Gupta, and V. Sarkar. ABCD: Eliminating Array Bounds Checks on Demand. In Proceedings of the ACM SIGPLAN ’00 Conference on Programming Language Design and Implementation(PLDI), pages 321–333, Vancouver, BC, Canada, June 2000. 338, 339 2. M. Cierniak and W. Li. Optimizing Java bytecodes. Concurrency, Practice and Experience, 9(6):427–444, 1997. 340 3. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw-Hill and MIT Press, 1990. 327, 328 4. R. Gupta. A fresh look at optimizing array bound checking. In Proceedings of the ACM SIGPLAN ’90 Conference on Programming Language Design and Implementation, pages 272–282, White Plains, NY, June 1990. 338 5. R. Gupta. Optimizing array bound checks using ﬂow analysis. ACM Letters on Programming Languages and Systems, 2(1-4):135–150, 1993. 338 6. S. Hoxey, F. Karim, B. Hay, and H. Warren, editors. The PowerPC Compiler Writer’s Guide. IBM Microelectronics Division, 1986. 337 7. V. Markstein, J. Cocke, and P. Markstein. Optimization of range checking. Proceedings of the SIGPLAN’82 Symposium on Compiler Construction, pages 114– 119, June 1982. 338 8. S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997. 338 9. P. Pominville, F. Qian, R. Vallee-Rai, L. Hendren, and C. Verbrugge. A framework for optimizing java using attributes. In Proceedings of Compiler Construction, 2001, pages 334–554, 2001. 326, 338 10. R. Shaham. Automatic removal of array memory leaks in Java. Master’s thesis, Tel-Aviv University, Tel-Aviv, Israel, September 1999. Available at http://www.math.tau.ac.il/˜ rans/thesis.zip. 339 11. R. Shaham, E. K. Kolodner, and M. Sagiv. Automatic removal of array memory leaks in java. In D. A. Watt, editor, Compiler Construction, 9th International Conference, volume 1781 of Lecture Notes in Computer Science, pages 50–66, Berlin, Germany, March 2000. Springer. 339

Author Index

Agrawal, Gagan . . . . . . . . . . . . . . . . .29 Altman, Erik . . . . . . . . . . . . . . . . . . . 95 Amarasinghe, Saman . . . . . . . . . . 179 Avdiˇcauˇsevi´c, Enis . . . . . . . . . . . . . . . 1 Backhouse, Kevin . . . . . . . . . . . . . . 128 Brand, Mark G. J. van den . . . . 143

Mehofer, Eduard . . . . . . . . . . . . . . . . 62 Mernik, Marjan . . . . . . . . . . . . . . . . . . 1 Miecznikowski, Jerome . . . . . . . . . 111 Mikheev, Vitaly V. . . . . . . . . . . . . 293 M¨ ossenb¨ock, Hanspeter . . . . . . . . 229 Mohnen, Markus . . . . . . . . . . . . . . . . 46 Moor, Oege de . . . . . . . . . . . . . . . . . 128

Cilio, Andrea G. M. . . . . . . . . . . . 247 Corporaal, Henk . . . . . . . . . . . . . . . 247 Cousot, Patrick . . . . . . . . . . . . . . . . 159 Cousot, Radhia . . . . . . . . . . . . . . . . 159

Necula, George C. . . . . . . . . . . . . . 213

Ertl, M. Anton . . . . . . . . . . . . . . . . . . . 5

Pande, Santosh . . . . . . . . . . . . . . . . 261 Pfeiﬀer, Michael . . . . . . . . . . . . . . . 229

Fang, Jesse . . . . . . . . . . . . . . . . . . . . 307 Fedoseev, Stanislav A. . . . . . . . . . 293 Garavel, Hubert . . . . . . . . . . . . . . . . . .9 Gregg, David . . . . . . . . . . . . . . . . . . . . 5 Gschwind, Michael . . . . . . . . . . . . . . 95 Gupta, Rajiv . . . . . . . . . . . 14, 62, 261 Henderson, Fergus . . . . . . . . . . . . . 197 Hendren, Laurie . . . . . . . . . . 111, 325 Kadayif, Ismail . . . . . . . . . . . . . . . . 276 Kandemir, Mahmut . . . . . . . . . . . . 276 Karczmarek, Michal . . . . . . . . . . . 179 Kolcu, Ibrahim . . . . . . . . . . . . . . . . 276 Krishnaiyer, Rakesh . . . . . . . . . . . 307 Kwiatkowski, Paul . . . . . . . . . . . . . 128 Lang, Fr´ed´eric . . . . . . . . . . . . . . . . . . . 9 Leniˇc, Mitja . . . . . . . . . . . . . . . . . . . . . 1 Li, Jinqian . . . . . . . . . . . . . . . . . . . . . . 29 Li, Wei . . . . . . . . . . . . . . . . . . . . . . . . 307 Lipsky, Nikita V. . . . . . . . . . . . . . . 293 Mateescu, Radu . . . . . . . . . . . . . . . . . . 9 Matias, Yossi . . . . . . . . . . . . . . . . . . . 78 McPeak, Scott . . . . . . . . . . . . . . . . . 213

Onder, Soner . . . . . . . . . . . . . . . . . . 261 Oren, David . . . . . . . . . . . . . . . . . . . . 78

Qian, Feng . . . . . . . . . . . . . . . . . . . . 325 Rahul, Shree P. . . . . . . . . . . . . . . . . 213 Rele, Siddharth . . . . . . . . . . . . . . . . 261 Sagiv, Mooly . . . . . . . . . . . . . . . . . . . .78 Scheerder, Jeroen . . . . . . . . . . . . . . 143 Serrano, Mauricio . . . . . . . . . . . . . .307 Somogyi, Zoltan . . . . . . . . . . . . . . . 197 Su, Qi . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Sukharev, Vladimir V. . . . . . . . . . 293 Thies, William . . . . . . . . . . . . . . . . . 179 Verbrugge, Clark . . . . . . . . . . . . . . 325 Vinju, Jurgen J. . . . . . . . . . . . . . . . 143 Visser, Eelco . . . . . . . . . . . . . . . . . . .143 Weimer, Westley . . . . . . . . . . . . . . .213 Wu, Youfeng . . . . . . . . . . . . . . . . . . .307 Wyk, Eric Van . . . . . . . . . . . . . . . . 128 Zhang, Youtao . . . . . . . . . . . . . . 14, 62 ˇ Zumer, Viljem . . . . . . . . . . . . . . . . . . . 1