Haskell’10 Proceedings of the 2010 ACM SIGPLAN Haskell Symposium

September 30, 2010 Baltimore, Maryland, USA Haskell’10 Proceedings of the 2010 ACM SIGPLAN Haskell Symposium Sponsore...

Author: Jeremy Gibbons (editor)

37 downloads 953 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

September 30, 2010 Baltimore, Maryland, USA

Haskell’10

Proceedings of the 2010 ACM SIGPLAN

Haskell Symposium Sponsored by:

ACM SIGPLAN Co-located with:

ICFP’10

The Association for Computing Machinery 2 Penn Plaza, Suite 701 New York, New York 10121-0701 Copyright © 2010 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permission to republish from: Publications Dept., ACM, Inc. Fax +1 (212) 869-0481 or . For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Notice to Past Authors of ACM-Published Articles ACM intends to create a complete electronic archive of all articles and/or other material previously published by ACM. If you have written a work that has been previously published by ACM in any journal or conference proceedings prior to 1978, or any SIG Newsletter at any time, and you do NOT want this work to appear in the ACM Digital Library, please inform [email protected], stating the title of the work, the author(s), and where and when published.

ISBN: 978-1-4503-0252-4 Additional copies may be ordered prepaid from:

ACM Order Department PO Box 11405 New York, NY 10286-1405 Phone: 1-800-342-6626 (USA and Canada) +1-212-626-0500 (all other countries) Fax: +1-212-944-1318 E-mail: [email protected]

ACM Order Number 565107 Printed in the USA

ii

Foreword It is my great pleasure to welcome you to the Third ACM Haskell Symposium. In addition to two previous Haskell Symposia, this meeting follows in the tradition of eleven previous instances of the Haskell Workshop—the name change reflects both the steady increase of influence of the Haskell Workshop on the wider community, and the increasing number of high quality submissions. The Call for Papers attracted 36 submissions from Asia and Australasia, Europe, North and South America, of which 14 were accepted. During the review period, each paper was evaluated by at least three Programme Committee members; in addition, I as Programme Chair wrote a short review for each of the 36 papers. Based on these reviews, the submissions were discussed during a five-day electronic Programme Committee meeting, and judged on their impact, clarity, and relevance to the Haskell community. Because of the constraints of a one-day event, many papers with valuable contributions could not be accepted. To accommodate as many papers as possible, the Committee followed last year’s practice of allocating 25-minute presentation slots for full papers (of which twelve were accepted) and 15-minute slots for shorter experience reports (of which there were two). The programme also includes a brief presentation on progress with the Haskell 2010 and Haskell 2011 language revisions. Foremost, I would like to thank the authors of all submitted papers for their hard work. The Programme Committee deserves my heartfelt thanks for their efforts in selecting from the many excellent submissions, despite a tight review period; my gratitude goes also to the external reviewers, for responding in depth at short notice. Special thanks go to Stephanie Weirich, chair of the 2009 Haskell Symposium, and the rest of the Steering Committee, for much helpful advice. The EasyChair conference management system was invaluable; without the efforts of its lead developer Andrei Voronkov, my task would have been much more difficult. Finally, my thanks go to Derek Dreyer and Christopher Stone as ICFP Workshop Co-Chairs, Paul Hudak as ICFP General Chair, Lisa Tolles from Sheridan Printing Company, and ACM SIGPLAN, for their support and sponsorship.

Jeremy Gibbons Haskell’10 Programme Chair University of Oxford

iii

Table of Contents Haskell Symposium 2010 Organization ..............................................................................................vi Session 1: Libraries Session Chair: Chung-chieh Shan (Rutgers - The State University of New Jersey) •

Invertible Syntax Descriptions: Unifying Parsing and Pretty Printing ...................................................1 Tillmann Rendel, Klaus Ostermann (University of Marburg)

•

The Performance of the Haskell Containers Package..............................................................................13 Milan Straka (Charles University in Prague)

Session 2: Language Design and Implementation Session Chair: Wouter Swierstra (Vector Fabrics) •

A Systematic Derivation of the STG Machine Verified in Coq...............................................................25 Maciej Piróg, Dariusz Biernacki (University of Wrocław)

•

A Generic Deriving Mechanism for Haskell .............................................................................................37 José Pedro Magalhães, Atze Dijkstra (Utrecht University), Johan Jeuring (Utrecht University & Open University of the Netherlands), Andres Löh (Utrecht University)

•

Exchanging Sources Between Clean and Haskell: A Double-Edged Front End for the Clean Compiler ...............................................................................49 John van Groningen, Thomas van Noort, Peter Achten, Pieter Koopman, Rinus Plasmeijer (Radboud University Nijmegen)

•

Experience Report: Using Hackage to Inform Language Design...........................................................61 J. Garrett Morris (Portland State University)

Session 3: Parallelism and Concurrency Session Chair: Peter Thiemann (Albert-Ludwigs-Universität Freiburg) •

Nikola: Embedding Compiled GPU Functions in Haskell ......................................................................67 Geoffrey Mainland, Greg Morrisett (Harvard University)

•

Concurrent Orchestration in Haskell.........................................................................................................79 John Launchbury, Trevor Elliott (Galois, Inc.)

•

Seq No More: Better Strategies for Parallel Haskell................................................................................91 Simon Marlow (Microsoft Research, Cambridge), Patrick Maier, Hans-Wolfgang Loidl, Mustafa K. Aswad, Phil Trinder (Heriot-Watt University)

•

Scalable I/O Event Handling for GHC.....................................................................................................103 Bryan O’Sullivan (Serpentine), Johan Tibell (Google)

Session 4: Compilation and Static Analysis Session Chair: Andrew Tolmach (Portland State University) •

An LLVM Backend for GHC....................................................................................................................109 David A. Terei, Manuel M. T. Chakravarty (University of New South Wales)

•

Hoopl: A Modular, Reusable Library for Dataflow Analysis and Transformation...........................121 Norman Ramsey, João Dias (Tufts University), Simon Peyton Jones (Microsoft Research)

•

Supercompilation by Evaluation...............................................................................................................135 Maximilian Bolingbroke (University of Cambridge), Simon Peyton Jones (Microsoft Research)

Session 5: Functional Pearl Session Chair: James Cheney (University of Edinburgh) •

Species and Functors and Types, Oh My! ...............................................................................................147 Brent A. Yorgey (University of Pennsylvania)

Author Index .........................................................................................................................................159

v

Haskell Symposium 2010 Organization Programme Chair: Steering Committee Chair: Steering Committee:

Programme Committee:

Additional Reviewers:

Jeremy Gibbons (University of Oxford, UK) Gabriele Keller (University of New South Wales, Australia) Jeremy Gibbons (University of Oxford, UK) Andy Gill (University of Kansas, USA) John Hughes (Chalmers University and Quviq, Sweden) Colin Runciman (University of York, UK) Peter Thiemann (Albert-Ludwigs-Universität Freiburg, Germany) Stephanie Weirich (University of Pennsylvania, USA) James Cheney (University of Edinburgh, UK) Duncan Coutts (Well-Typed LLP, UK) Sharon Curtis (Oxford Brookes University, UK) Fritz Henglein (Københavns Universitet, Denmark) Tom Schrijvers (Katholieke Universiteit Leuven, Belgium) Chung-chieh Shan (Rutgers – The State University of New Jersey, USA) Martin Sulzmann (Informatik Consulting Systems AG, Germany) Wouter Swierstra (Vector Fabrics, The Netherlands) Peter Thiemann (Albert-Ludwigs-Universität Freiburg, Germany) Andrew Tolmach (Portland State University, USA) Malcolm Wallace (Standard Chartered Bank, UK) Jesper Andersen Patrick Bahr Jost Berthold Annette Bieniusa Andrzej Filinski Sebastian Fischer Ken Friis Larsen Stefan Holdermans Oleg Kiselyov

Sponsor:

vii

Clare Martin Lasse Nielsen Alexey Rodriguez Yakushev Sergei A. Romanenko Alejandro Russo Doaitse Swierstra Tarmo Uustalu Janis Voigtländer

Invertible Syntax Descriptions: Unifying Parsing and Pretty Printing Tillmann Rendel

Klaus Ostermann

University of Marburg, Germany

Abstract

parser DSL (Leijen and Meijer 2001), and a pretty printer EDSL (Hughes 1995). However, these EDSLs are completely independent, which precludes the use of a single embedded program to specify both parsing and pretty printing. This means that due to the dual nature of parsing and pretty-printing a separate specification of both is at least partially redundant and hence a source of potential inconsistency. This work addresses both invertible computation and the unification of parsing and pretty printing as separate, but related challenges. We introduce the notion of partial isomorphisms to capture invertible computations, and on top of that, we propose a language of syntax descriptions to unify parsing and pretty printing EDSLs. A syntax description specifies a relation between abstract and concrete syntax, which can be interpreted as parsing a concrete string into an abstract syntax tree in one direction, and pretty printing an abstract syntax tree into a concrete string in the other direction. This dual use of syntax descriptions allows a programmer to specify the relation between abstract and concrete syntax once and for all, and use these descriptions for parsing or printing as needed. After reviewing the differences between parsing and pretty printing in Sec. 2, the following are the main contributions of this paper:

Parsers and pretty-printers for a language are often quite similar, yet both are typically implemented separately, leading to redundancy and potential inconsistency. We propose a new interface of syntactic descriptions, with which both parser and pretty-printer can be described as a single program. Whether a syntactic description is used as a parser or as a pretty-printer is determined by the implementation of the interface. Syntactic descriptions enable programmers to describe the connection between concrete and abstract syntax once and for all, and use these descriptions for parsing or pretty-printing as needed. We also discuss the generalization of our programming technique towards an algebra of partial isomorphisms. Categories and Subject Descriptors D.3.4 [Programming Techniques]: Applicative (Functional) Programming General Terms

Design, Languages

Keywords embedded domain specific languages, invertible computation, parser combinators, pretty printing

1.

Introduction

• We propose partial isomorphisms as a notion of invertible com-

Formal languages are defined with a concrete and an abstract syntax. The concrete syntax specifies how words from the language are to be written as sequences of characters, while the abstract syntax specifies a structural representation of the words well-suited for automatic processing by a computer program. The conversion of concrete syntax to abstract syntax is called parsing, and the conversion of abstract syntax into concrete syntax is called unparsing or pretty printing. These operations are not inverses, however, because the relation between abstract and concrete syntax is complicated by the fact that a single abstract value usually corresponds to multiple concrete representations. An unparser or pretty printer has to choose among these alternative representations, and pretty printing has been characterized as choosing the “nicest” representation (Hughes 1995). Several libraries and embedded domain-specific languages (EDSLs) for both parsing and pretty printing have been proposed and are in wide-spread use. For example, the standard libraries of the Glasgow Haskell Compiler suite include both Parsec, an embedded

putation (Sec. 3.1). • On top of partial isomorphisms, we present the polymorphi-

cally embedded DSL of syntax descriptions (Sec. 3) to eliminate the redundancy between parser and pretty-printer specifications while still leaving open the choice of parser/prettyprinter implementation. • We provide proof-of-concept implementations of the language

of syntax descriptions and discuss the adaption of existing parser or pretty printer combinators to our interface (Sec. 4). • We illustrate the feasibility of syntactic descriptions in a case

study, showing that real-world requirements for parsing and pretty-printing such as the handling of whitespace and infix operators with priorities can be supported (Sec. 4). • We present a semantics of syntactic descriptions as a relation

between abstract and concrete syntax as a possible correctness criterion for parsers and pretty-printers (Sec. 4.3). • We explore the expressivity of partial isomorphisms by present-

ing fold and unfold as an operation on partial isomorphisms, implemented as a single function (Sec. 5).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

Section 7 discusses related and future work, and the last section concludes. This paper has been written as literate Haskell and contains the full implementation. The source code is available for download at http://www.informatik.uni-marburg.de/ ~rendel/unparse/.

1

2.

Parsing versus Pretty-Printing

But these snippets also show the remaining syntactic differences between parsers and pretty printers. Parsers use combinators $ , ∗ and | to apply functions and branch into the alternatives of the data type, while pretty printing uses the usual function application and pattern matching. This syntactic difference has to be resolved in order to unify parsing and pretty printing.

EDSLs for parsing such as Parsec tend to be structured as parser combinator libraries, providing both elementary parsers and combinators to combine parsers into more complex ones. In a typed language, the type of a parser is usually a type constructor taking one argument, so that Parser α is the type of parsers which produce a value of type α when successfully run on appropriate input. We will present parsers and pretty-printers in a style that makes it easy to see their commonalities and differences. Using the combinators for applicative functors (McBride and Paterson 2008), one can implement a parser for an algebraic datatype in such a way that the structure of the parser follows the structure of the datatype. Here is an example for a parser combinator producing a list:

3.

A language of syntax descriptions

We adapt polymorphic embedding of DSLs (Hofer et al. 2008) to Haskell by specifying an abstract language interface as a set of type classes. This interface can be implemented by various implementations, i.e., type class instances. A program in the DSL is then a polymorphic value, which can be used at different use sites with different type class instances, that is, which can be interpreted polysemantically. In this section, we are concerned with the definition of the language interface for syntax descriptions as a set of type classes. Our goal is to capture the similarities and resolve the differences between parsing and pretty printing so that a single polymorphic program can be used as both a parser and a pretty printer. The combinators $ , ∗ and | as shown in the previous section are at the core of parser combinator libraries structured with applicative functors. The combinator $ is used to associate semantic actions with parsers, the combinator ∗ is used to combine two parsers in sequentially, and the combinator | is used to combine two parsers as alternatives. As we will see in the next subsection, these combinators cannot be implemented directly for Printer. Therefore, our goal in the following subsections is to find variants of $ , ∗ and | which can be implemented both for type constructors like Parser and for type constructors like Printer. These combinators will be assembled in type classes to form the language interface of the language of syntax descriptions.

data List α = Nil | Cons α (List α) parseMany :: Parser α → Parser (List α) parseMany p = const Nil $ text "" | Cons $ p ∗ parseMany p The combinator | is used to choose between the possible constructors, $ is used to associate constructors with their arguments, and ∗ is used to handle constructors with more than one field. Since Nil does not take any arguments, const is used to ignore the result of parsing the empty string. The structure of parseMany follows the structure of List: parseMany is composed of a parser for empty lists, and a parser for non-empty lists, just like List is composed of a constructor for empty lists, and a constructor for non-empty lists. On the other hand, EDSLs for pretty printing such as the library by Hughes (1995) are usually structured around a proper type Doc with elementary documents and combinators for the construction of more complex documents. These combinators can be used to write a pretty printer for a datatype such that the structure of the pretty printer follows the structure of the datatype.

3.1

The category of partial isomorphisms and the combinator

$

The fmap combinator for Parser (or its synonym $ ) is used to apply a pure function α → β to the eventual results of a Parser α, producing a Parser β . The behavior of a f $ p parser is to first use p to parse a value of some type α, then use f to convert it into a value of some other type β , and finally return that value of type β .

printMany :: (α → Doc) → (List α → Doc) printMany p list = case list of Nil → text "" Cons x xs → p x ♦ printMany p xs

(

$

) :: (α → β ) → Parser α → Parser β

Unfortunately, we cannot implement the same $ function for Printer, because there is no point in first printing a value, and then apply some transformation. Instead we would like to apply the transformation first, then print the transformed values. However, this would require a function of type β → α. The behavior of a f $ p pretty printer could be to first get hold of a value of type β , then use f to convert it into a value of some other type α, and finally use p to print that value of type α.

The structure of printMany follows the structure of List, but this time, pattern matching is used to give alternative pretty printers for different constructors. The combinator ♦ is used to combine two documents side by side. We introduce a type synonym Printer to show the similarity between the types of parseMany and printMany even more clearly.

(

$

) :: (β → α) → Printer α → Printer β

How can we hope to unify the types of $ for parsers and pretty printers? Our idea is to have functions that can be used both forwards and backwards. A f $ p parser could use f forwards to convert values after parsing, and a f $ p pretty printer could use f backwards before printing. Clearly, this would work for invertible functions, but not all functions expressible in Haskell, or any general-purpose programming language, are invertible. Since we cannot invert all functions, we have to restrict the $ operator to work with only such functions which can be used forwards and backwards. An invertible function is also called an isomorphism. We define a data type constructor Iso so that Iso α β is the type of isomorphisms between α and β . More precisely, the type Iso α β captures

type Printer α = α → Doc printMany :: Printer α → Printer (List α) These code snippets show how the structure of both parsers and pretty printers are similar in following the structure of a datatype. Jansson and Jeuring (2002) have used this structural similarity between datatype declarations, and parsers and pretty printers for the same datatypes, to derive serialization and deserialization functions generically from the shape of the datatype. We offer the programmer more freedom in the choice of parser and pretty printer by using the structural similarity between parsers and pretty printers to unify these concepts without depending directly on the shape of some datatype.

2

what we call partial isomorphisms. A partial isomorphism between α and β is represented as a pair of functions f of type α → Maybe β and g of type β → Maybe α so that if f a returns Just b, g b returns Just a, and the other way around.

Unfortunately, Printer is not a covariant functor, because the type variable occurs in a contravariant position, to the left of a function arrow. Instead, it is a contravariant functor, which could be captured in Haskell by the following type class. class ContravariantFunctor f where contrafmap :: (β → α) → (f α → f β )

data Iso α β = Iso (α → Maybe β ) (β → Maybe α)

This kind of functor is called contravariant because the direction of the arrow is flipped between β → α and f α → f β . In general, value producers such as Parser are covariant functors, while value consumers such as Printer are contravariant functors. Partial isomorphisms can be understood as the arrows in a new category different from Hask. Categories which differ from Hask in the type of arrows can be expressed as instances of the type class Category, which is defined in Control.Category as follows.

We are interested in partial isomorphisms because we want to modularly compose isomorphisms for the whole extension of a type from isomorphisms for subsets of the extension. For example, each constructor of an algebraic data type gives rise to a partial isomorphism, and these partial isomorphisms can be composed to the (total) isomorphism described by the data equation. The partial isomorphisms corresponding to the constructors of an algebraic data type can be mechanically derived by a system like Template Haskell (Sheard and Jones 2002). For example, with the Template Haskell code in Appendix A, the macro call

class Category cat where id :: cat a a (◦) :: cat b c → cat a b → cat a c

$(defineIsomorphisms 00List) expands to the following definitions.

The category of partial isomorphisms has the same objects as Hask, but contains only the invertible functions as arrows. It can be expressed in Haskell using the following instance declaration.

nil :: Iso () (List α) cons :: Iso (α, List α) (List α) nil = Iso (λ () → Just Nil) (λ xs → case xs of Nil → Just () Cons x xs → Nothing) cons = Iso (λ (x, xs) → Just (Cons x xs)) (λ xs → case xs of Nil → Nothing Cons x xs → Just (x, xs))

instance Category Iso where g ◦ f = Iso (apply f >= > apply g) (unapply g >= > unapply f ) id = Iso Just Just The >= > combinator is defined in Control.Monad as (>= >) :: Monad m ⇒ (a → m b) → (b → m c) → (a → m c) f >= > g = λ x → f x >>= g and implements Kleisli composition for a monad, here, the Maybe monad. We want to abstract over functors from Iso to Hask to specify our $ operator which works for both Parser and Printer, but Haskell does only provide the Functor typeclass for functors from Hask to Hask. To capture our variant of functors, we introduce the IsoFunctor typeclass.

Partial isomorphisms can be inverted and applied in both directions. inverse :: Iso α β → Iso β α inverse (Iso f g) = Iso g f

class IsoFunctor f where ( $ ) :: Iso α β → (f α → f β )

apply :: Iso α β → α → Maybe β apply (Iso f g) = f

The type class IsoFunctor and its $ method forms the first component of the language interface of our language of syntax descriptions.

unapply :: Iso α β → β → Maybe α unapply = apply ◦ inverse

3.2

We will generally not be very strict with the invariant stated above (if f a returns Just b, g b returns Just a, and the other way around). In particular we will sometimes interpret this condition modulo equivalence classes. A typical example from our domain is that a partial isomorphism maps strings of blanks of arbitrary length to a unit value but maps the unit value back to a string of blanks of length one—that is, all strings of blanks of arbitrary length are in the same equivalence class. The need for invertible functions can also be understood from a categorical point of view. In category theory, a type constructor such as Parser can be seen as a covariant functor from the category Hask of Haskell types and Haskell functions to the same category. This notion is captured in the standard Haskell Functor class, which provides the fmap function. Note that the usual $ for parsers is simply an alias for fmap.

Uncurried application and the

∗

combinator

The ∗ combinator for Parser is used to combine a Parser (α → β ) and a Parser α into a Parser β . The behavior of the (p ∗ q) parser is to first use p to parse a function of type α → β , then use q to parse a value of type α, then apply the function to the value, and finally return the result of type β . (

∗

) :: Parser (α → β ) → (Parser α → Parser β )

The Applicative type class specifies such a ∗ operator for functors from Hask to Hask, i.e. instances of the Functor type class. But since our language of syntax descriptions is based on functors from Iso to Hask, we cannot use the standard Applicative type class as a component in our language interface. We would like to generalize the notion of applicative functors to functors from Iso to Hask. class IsoApplicative f where ( ∗ ) :: f (Iso α β ) → (f α → f β )

class Functor f where fmap :: (α → β ) → (f α → f β )

Unfortunately, this version of ∗ cannot be implemented by Printer. Expanding the definition of Printer, we see that we would have to implement the following function.

This kind of functor is called covariant because the direction of the arrow does not change between α → β and f α → f β .

3

( (

) :: (Iso α β → Doc) → (α → Doc) → (β → Doc) ) p q b = ...

∗ ∗

class Alternative f where ( | ) :: f α → f α → f α empty :: f α

We have b of type β and want to produce a document. Our only means of producing documents would be to call p or q, but neither of them accepts β . We furthermore have no isomorphism Iso α β available to convert b into a value of type α. Instead, we could print such an isomorphism, if only we had one. Since Printer does not support the applicative ∗ combinator, we have to specify an alternative version of ∗ to combine two syntax descriptions side by side. Note that in our parseMany code, ∗ is always used together with $ in an expression like the following. f

$

p1

...

∗

∗

This class can be readily instantiated with Parser. The | combinator will typically try both parsers, implementing a backtracking semantics. The empty function is a parser which always fails. For Printer, | will try to print with the left printer. If this is not successful, it will print with the right printer instead. The empty function is the printer which always fails. 3.4

pn

In this restricted usage, the role of ∗ is simply to support curried function application. We do support the (f $ p1 ∗ ... ∗ pn) pattern through a different definition of ∗ . Our operator ∗ will not be used to express curried function application, but it will be used to express uncurried function application. Therefore, our ∗ has the following type. (

) :: Printer α → Printer β → Printer (α, β )

∗

This ∗ operator is supported by both printing and parsing. Printing with (p ∗ q) means printing the first component of the input with p, and the second component with q. And parsing with (p ∗ q) means parsing a first value with p, then a second value with q, and returning these values as components of a tuple. The applicative version of ∗ supports the pattern (f

$

p1

∗

...

class (IsoFunctor δ , ProductFunctor δ , Alternative δ ) ⇒ Syntax δ where -- ( $ ) :: Iso α β → δ α → δ β -- ( ∗ ) :: δ α → δ β → δ (α, β ) -- ( | ) :: δ α → δ α → δ α -- empty :: δ α pure :: Eq α ⇒ α → δ α token :: δ Char

pn)

∗

as left-associative nested application of a curried function (((f

$

p1)

∗

...)

∗

pn),

whereas our ∗ supports the same pattern as right-associative tupling and application of an uncurried function (f

$

(p1

∗

(...

∗

With this typeclass, we can now state a function many which unifies parseMany and prettyMany as follows:

pn))).

many :: Syntax δ ⇒ δ α → δ [α ] many p = nil $ pure () | cons $ p ∗ many p

by appropriately changing the associativity and relative priority of the $ and ∗ operators. For normal functors, the pairing variant and the currying variant of ∗ are inter-derivable (McBride and Paterson 2008), but for Iso functors it makes a real difference. We abstract over the class of functors supporting ∗ by introducing the ProductFunctor typeclass.

This implementation looks essentially like the implementation of parseMany, but instead of constructors Nil and Cons, we use partial isomorpisms nil and cons. Note that we do not have to use const nil, because our partial isomorphisms treat constructors without arguments like constructors with a single () argument. Unlike the code for parseMany, which was usable only for parsing, this implementation of many uses the polymorphically embedded language of syntax descriptions, which can be instantiated for both parsing and printing.

class ProductFunctor f where ( ∗ ) :: f α → f β → f (α, β ) ProductFunctor does not have any superclasses, so that it can be used together with the new IsoFunctor type class or together with the ordinary Functor type class. ProductFunctor and its ∗ method form the second component of the language interface for our language of syntax descriptions.

4. 3.3

Expressing choices and the

|

The class of syntax descriptions

So far, we have provided the combinators $ , ∗ and | to combine smaller syntax descriptions into larger syntax descriptions, but we still have to provide a means to describe elementary syntax descriptions. We use two elementary syntax descriptions: token and pure. The token function relates each character with itself. The pure function takes an α and the resulting parser/printer will relate the empty string with that α value. A pure x parser returns x without consuming any input, while a pure x printer silently discards values equal to x. The Eq α constraint on the type pure is needed so that a printer can check a value to be discarded for equality to x. Together with the typeclasses already introduced, these functions are sufficient to state the language interface that unifies parsing and prettyprinting. The type class Syntax pulls in the $ , ∗ , and | combinators via superclass constraints, and adds the pure and token functions.

operator

Implementing syntax descriptions

In the last section, we derived a language interface for syntax descriptions to unify parsers and printers syntactically. For example, at the end of the section, we have shown how to write parseMany and printMany as a single function many. To support our claim that many really implements both parseMany and printMany semantically, we now have to implement the language of syntax descriptions twice: First for parsing and then for printing. An implementation of the language of syntax descriptions consists of a parametric data type with instances for IsoFunctor,

In the parseMany code shown above, alternatives are expressed using the | combinator of type Parser α → Parser α → Parser α. This combinator is used to compose parsers for the variants of a datatype into a parser for the full datatype. The | combinator has been generalized in the standard Alternative type class. But Alternative declares a superclass constraint to Applicative, which is not suitable for syntax descriptions. We therefore need a version of Alternative which is superclass independent.

4

4.2

ProductFunctor, Alternative and Syntax. In this paper, we present rather inefficient proof-of-concept implementations for both parsing and pretty printing, but appropriate instance declarations could add more efficient implementations (see Sec. 4.4 for a discussion).

Implementing printing

Our implementations of pretty printers are partial functions from values to text, modelled using the Maybe type constructor. newtype Printer α = Printer (α → Maybe String)

4.1

Implementing parsing

This is different from the preliminary Printer type we presented in Sec. 3, where we used Doc instead of String, and did not mention the Maybe. Here, we are using String because we are only interested in a simple implementation, and do not want to adapt an existing pretty printing library with a first-order Doc type to our interface. We are dealing with partial functions because a Printer α should represent a pretty printer for a subset of the extension of α. We then want to use the | combinator to combine pretty printers for several subsets into a pretty printer of all of α. This allows us to specify syntax descriptions for algebraic data types one constructor at a time, instead of having to specify a monolithic syntax description for the full data type at once. A value of type Printer α can be used to pretty print a value of type α simply by applying the function.

In our implementation, a Parser is a function from input text to a list of pairs of results and remaining text. newtype Parser α = Parser (String → [(α, String)]) A value of type Parser α can be used to parse an α value from a string by applying the function and filtering out results where the remaining text is not empty. The parse function returns a list of α’s because our parser implementation supports nondeterminism through the list monad, and therefore can return several possible results. parse :: Parser α → String → [α ] parse (Parser p) s = [x | (x, "") ← p s]

print :: Printer α → α → Maybe String print (Printer p) x = p x

We now provide the necessary instances to use Parser as an implementation of syntax descriptions. A parser of the form iso $ p is implemented by mapping apply iso over the first component of the value-text-tuples in the returned list, and silently ignoring elements where apply iso returns Nothing. Note that failed pattern matching (in this case: Just y) in a list comprehension is filtering out that element.

We now provide the necessary instances to use Printer as an implementation of syntax descriptions. A printer of the form iso $ p is implemented by converting the value to be printed with unapply iso before printing it with p, silently failing if unapply iso returns Nothing. instance IsoFunctor Printer where iso $ Printer p = Printer (λ b → unapply iso b >>= p)

instance IsoFunctor Parser where iso $ Parser p = Parser (λ s → [ (y, s0 ) | (x, s0 ) ← p s , Just y ← [apply iso x]])

A printer of the form (p ∗ q) is implemented by monadically lifting the string concatenation operator ++ over the results of printing the first component of the value to be printed with p, and the second component with q. This returns Nothing if one or both of p or q return Nothing, and returns the concatenated results of p and q otherwise.

A parser of the form (p ∗ q) is implemented by threading the remaining text through the applications of p and q, and tupling the resulting values. instance ProductFunctor Parser where Parser p ∗ Parser q = Parser (λ s → [ ((x, y), s00 ) | (x, s0 ) ← p s , (y, s00 ) ← q s0 ])

instance ProductFunctor Printer where Printer p ∗ Printer q = Printer (λ (x, y) → liftM2 (++) (p x) (q y)) A printer of the form p | q is implemented by using p if it succeeds, and using q otherwise. The empty printer always fails.

A parser of the form (p | q) is implemented by concatenating the result lists of the two parsers. The empty parser returns no results.

instance Alternative Printer where Printer p | Printer q = Printer (λ s → mplus (p s) (q s)) empty = Printer (λ s → Nothing)

instance Alternative Parser where Parser p | Parser q = Parser (λ s → p s ++ q s) empty = Parser (λ s → [ ])

A printer of the form pure x is implemented by comparing the value to be printed with x, returning the empty string if it matches, and Nothing otherwise. Finally, token is implemented by always returning the singleton string consisting just of the token to be printed.

Finally, the elementary parsers pure and token are implemented by returning the appropriate singleton lists. pure x always succeeds returning x and the full text as remaining text. token fails if there is no more input text, and returns the first character of the input text otherwise.

instance Syntax Printer where pure x = Printer (λ y → if x ≡ y then Just "" else Nothing) token = Printer (λ t → Just [t ])

instance Syntax Parser where pure x = Parser (λ s → [(x, s)]) token = Parser f where f [] = [] f (t : ts) = [(t, ts)]

This concludes our proof-of-concept implementation of the language interface of syntax descriptions with printers. We have shown that it is possible to implement syntax descriptions with both parsers and printers.

This concludes our proof-of-concept implementation of the language interface of syntax descriptions with parsers.

5

4.3

What syntax descriptions mean

Using a syntax description in both ways requires more work than in logic programming, since explicit instance declarations for each direction have to be specified. They have to be specified once only, though, and then inversion in that direction works for any syntax description. The instance declarations also provide more control than the fixed DFS strategy of typical logic solvers, which means that in contrast to logic programming invertibility can actually be made to work in practice.

A syntax description denotes a relation between abstract and concrete syntax. We can represent such a relation as its graph, i.e., as a list of pairs of abstract and concrete values. Since our interface design allows us to add a new meaning to the interface by corresponding instance declarations, we formulate our semantics as a set of type class instances in Haskell, too. This instance declaration is not useful as an executable implementation because it will generate and concatenate infinite lists. Rather, it should be read as a declarative denotational semantics. An abstract value in this relation is of some type α, while a concrete value is of type String.

4.4

Adapting existing libraries

The implementations of syntax descriptions for parsing and printing in the previous subsections are proofs-of-concept, lacking many features available in “real-world” parsers and pretty printers. The parser implementation also suffers from an exponential worst-case complexity and a space leak due to unlimited backtracking, which limits its applicability to large inputs. The former problem is a problem of any interface design. We could add more features to our interfaces, but this would also limit the number of parsers and pretty printers that can implement this interface. This is, for example, also a problem of the existing designs of the Applicative and Alternative type classes in Haskell. We propose two different strategies to deal with this problem. One strategy is to extend the interfaces via type class subclassing and then write additional instance declarations for more sophisticated parsers and pretty printers. Another strategy is to split a grammar specification into those parts that can be expressed with the Syntax interface and its derived operations alone, and those parts that are specific to a fixed parser or pretty-printer implementation. In this case, the automatic inversion still works for the first part, and manual intervention is necessary to invert the second part. The latter problem can be solved by instantiating Syntax for more advanced parser combinator and/or pretty printer approaches, such as . . ., which exhibit better time or memory behavior. However, such existing parser/pretty printer libraries may not match the semantics expected by syntax descriptions. We have identified two categories of such semantic mismatches. Firstly, an existing library may not provide combinators with the exact semantics of the combinators in the language interface for syntax descriptions, but only combinators with a similar semantics. For example, Parsec provides a | combinator, but its semantics implements predictive parsing with a look ahead of 1, whereas our implementation supports unlimited backtracking. This means that with Parsec, p | q may fail, even if q would succeed, whereas the syntax description p | q should not be empty if q is nonempty. If one would use the Parsec | to implement the Syntax | , then syntax descriptions have to be written with the Parsec semantics in mind. The design of an interface that is rich enough to specify efficient and sophisticated parsers and pretty printers without committing to a particular implementation is in our point of view an open research (and standardization) question and part of our future work. However, our design of syntax descriptions can serve as a common framework for such interfaces which combine several parsing and pretty printing libraries, similar to how the Applicative and Alternative classes provide a common framework for parsing.

data Rel α = Rel [(α, String)] To provide a semantics for syntax descriptions, we have to implement the methods of Syntax. The $ operator applies the first component of the partial isomorphism to the abstract values, filtering out abstract values which are not in the domain of the partial isomorphism. instance IsoFunctor Rel where Iso f g $ Rel graph = Rel [(a0 , c) | (a, c) ← graph , Just a0 ← return (f a)] The ∗ operator returns the cross product of the graphs of its arguments, tupling the abstract values, but concatenating the concrete values. instance ProductFunctor Rel where Rel graph ∗ Rel graph0 = Rel [((a, a0 ), c ++ c0 ) | (a, c) ← graph , (a0 , c0 ) ← graph0 ] The | operator returns the union of the graphs, and empty is the empty relation, i.e. the empty graph. instance Alternative Rel where Rel graph | Rel graph0 = Rel (graph ++ graph0 ) empty = Rel [ ] Finally, pure x is the singleton graph relating x to the empty string, and token relates all characters to themselves. instance Syntax Rel where pure x = Rel [(x, "")] token = Rel [(t, [t ]) | t ← characters] where characters = [minBound . . maxBound ] This denotational semantics of syntax descriptions can be used to describe the behavior of printing and parsing in a declarative way. Printing an abstract value x according to a syntax description d means to produce a string s so that (x, s) is an element of the graph of d. Parsing a concrete string s according to a syntax description d means to produce an abstract value x so that (x, s) is an element of the graph of d. Both printing and parsing are under-specified here, because it is not specified how to choose the s or the x to produce. Understanding syntax descriptions as relations also allows us to compare our approach to logic programming, where relations (defined via predicates) can also theoretically be used “both ways”, since each variable in a logic rule can operationally be used as both input and output. In practice, however, most predicates work only in one direction, because “unpure” features (such as cuts or primitive arithmetic) and the search strategy of the solver often require a clear designation of input and output variables.

5.

Programming with partial isomorphisms

Since our language of syntax descriptions is based upon the notion of partial isomorphisms, programming with partial isomorphisms is an important part of programming with syntax descriptions. In this section, we evaluate whether programming with partial isomorpisms is practical. The abstractions developed in this section are reused in the next section as the basis for some derived syntax combinators.

6

element :: Eq α ⇒ α → Iso () α element x = Iso (λ a → Just x) (λ b → if x ≡ b then Just () else Nothing)

Every partial isomorphism expressible in Haskell can be written by implementing both directions of the isomorphism independently, and combining them using the Iso constructor. However, this approach is neither safe nor convenient. It is not safe because it is not checked that the two directions are really inverse to each other, and it is not convenient because one has to essentially program the same function twice, although in two different directions. We call such a partial isomorphism implemented directly with Iso a primitive partial isomorphism, and we hope to mostly avoid having to define such primitives. Instead of defining every partial isomorphism of interest as a primitive, we provide elementary partial isomorphisms for the constructors of algebraic datatypes, and an algebra of partial isomorphism combinators which can be used to implement more complex partial isomorphisms. We call such a partial isomorphisms implemented in terms of a small set of primitives a derived partial isomorphism, and we hope to implement most partial isomorphisms of interest as derived isomorphisms. 5.1

For a predicate p, subset p is the identity isomorphism restricted to elements matching the predicate. subset :: (α → Bool) → Iso α α subset p = Iso f f where f x | p x = Just x | otherwise = Nothing Numerous more partial isomorphisms primitives could be defined, reflecting other categorical constructions or type isomorphisms. However, the primitives defined so far are sufficient for the examples in this paper. Therefore, the following subsections are devoted to the derivation of a non-trivial partial isomorphism using the primitives implemented so far. 5.2

An algebra of partial isomorphisms

Folding as a small-step abstract machine

We will need left-associative folding resp. unfolding as a partial isomorphism in the implementation of left-associative binary operators. Instead of defining folding and unfolding as primitives, we show how it can be defined as a derived isomorphism in terms of the already defined primitives. To see how to implement folding and unfolding in a single program, we consider the straightforward implementation of foldl from the standard Haskell prelude.

An algebra of partial isomorphisms can be implemented using primitives. The specification and implementation of a full algebra of partial isomorphisms is beyond the scope of this paper. However, we present sample elementary partial isomorphisms and partial isomorphism combinators to show how the development of such an algebra could reflect well-known type isomorphism and categorical constructs. We have already seen the implementation of the ◦ and id combinators in the Category instance declaration in Sec. 3.1.

foldl :: (α → β → α) → α → [β ] → α foldl f z [ ] = z foldl f z (x : xs) = foldl f (f z x) xs

id :: Iso α α (◦) :: Iso β γ → Iso α β → Iso α γ

Since partial isomorphisms do not support currying very well, we uncurry most of the functions.

Other categorical constructions can be reified as partial isomorphisms as well. For example, the product type constructor (, ) is a bifunctor from Iso × Iso to Iso, so that we have the bifunctorial map × which allows two separate isomorphisms to work on the two components of a tuple.

foldl :: ((α, β ) → α) → (α, [β ]) → α foldl f (z, [ ]) = z foldl f (z, x : xs) = foldl f (f (z, x), xs) This implementation of foldl is a big-step abstract machine with state type (α, [β ]), calling itself in tail-position and computing the result in a monolithic way. We want to break this monolithic computation into many small steps by transforming foldl into a smallstep abstract machine. A big-step abstract machines can be transformed into small-step abstract machines by a general-purpose program transformation called light-weight fission (see Danvy 2008, for this and related transformations on abstract machines). We decompose foldl into a step function and a driver. step computes a single step of foldl’s overall computation, and driver calls step repeatedly. step is actually a partial function, represented with a Maybe type. If no more computation steps are needed, step returns Nothing, so that driver stops calling step and returns the current state. driver is implemented independently from foldl.

(×) :: Iso α β → Iso γ δ → Iso (α, γ) (β , δ ) i × j = Iso f g where f (a, b) = liftM2 (, ) (apply i a) (apply j b) g (c, d) = liftM2 (, ) (unapply i c) (unapply j d) We reify some more facts about product and sum types as partial isomorphisms. Nested products associate. associate :: Iso (α, (β , γ)) ((α, β ), γ) associate = Iso f g where f (a, (b, c)) = Just ((a, b), c) g ((a, b), c) = Just (a, (b, c)) Products commute. commute :: Iso (α, β ) (β , α) commute = Iso f f where f (a, b) = Just (b, a)

driver :: (α → Maybe α) → (α → α) driver step state = case step state of Just state0 → driver step state0 Nothing → state

() is the unit element for products. unit :: Iso α (α, ()) unit = Iso f g where f a = Just (a, ()) g (a, ()) = Just a

Since we are only interested in the α part of the final state, foldl drops the second component of the state after running the abstract machine. foldl :: ((α, β ) → α) → (α, [β ]) → α foldl f = fst ◦ driver step where step (z, [ ]) = Nothing step (z, x : xs) = Just (f (z, x), xs)

element x is the partial isomorphism between () and the singleton set which contains just x. Note that this is an isomorphism only up to the equivalence class defined by the Eq instance, as discussed in Sec. 3.1.

7

As a partial isomorphism, this definition of foldl is invertible. It can be applied as left-associative folding, but it can also be unapplied as left-associative unfolding. By rewriting the step function of a small-step abstract machine to use the combinators for partial isomorphisms, we have effectively inverted the implementation of foldl into an implementation of unfoldl. In this section, we have evaluated the practicability of programming with partial isomorphisms. We have seen that the automatic generation of partial isomorphisms for constructors of algebraic datatypes together with a small set of primitives suffices to derive an advanced combinator like left-associative folding, which can then be automatically inverted to yield left-associative unfolding.

We have transformed foldl into a small-step abstract machine to break its monolithic computation into a series of smaller steps. The next step towards the implementation of foldl as a partial isomorphism will be to enable this abstract machine to run backwards. 5.3

Running the abstract machine backwards

To convert foldl into a partial isomorphism combinator of type Iso (α, β ) α → Iso (α, [β ]) α, we have to convert both driver and step into partial isomorphisms. We could then run foldl forwards by composing a sequence of steps, and we could run foldl backwards by composing a reversed sequence of inverted steps. The partial isomorphism analogue to driver is implemented as a primitive in terms of driver. We call it iterate, since it captures the iterated application (resp. unapplication) of a function.

6.

iterate :: Iso α α → Iso α α iterate step = Iso f g where f = Just ◦ driver (apply step) g = Just ◦ driver (unapply step)

Describing the syntax of a language

Using the partial isomorphism combinators from the last section, we can now evaluate our approach to syntax descriptions by applying it to an example of a small formal language which features keywords and identifiers, nested infix operators with priorities, and flexible whitespace handling.

Note that the type of iterate does not mention Maybe anymore. Instead, the partial isomorphism step is applied (resp. unapplied) until it fails, showing once more the usefulness of partial isomorphisms. It remains to implement the parametric partial isomorphism step in terms of the primitives introduced earlier in this subsection. It has the following type.

6.1

Derived operations

Before introducing our example language, we implement some general-purpose combinators, mostly adopted from the usual parser and pretty-printer combinators. We can define the dual of the ∗ combinator, using the following injections into Either α β .

Iso (α, β ) α → Iso (α, [β ]) (α, [β ]) We start with a value of type (α, [β ]), and want to use the partial isomorphism i we have taken as an argument. Since i takes a single α, we have to destruct the [β ] into a first element β and the remaining elements [β ]. The α should not be changed for now. The destruction is performed by the inverse of the cons partial isomorphism, and (×) is used to apply it to the second component of the input.

$(defineIsomorphisms 00Either) ( p

+ +

) :: Syntax δ ⇒ δ α → δ β → δ (Either α β ) q = (left $ p) | (right $ q)

The + operator can be used as an alternative to | when describing the concrete syntax of algebraic data types. Instead of providing a partial isomorphism for every constructor of the algebraic data type, and using | to combine the branches for the constructors, we provide a single partial isomorphism between the data type and its sum-of-product form written with (, ) and Either, and combine the branches for the constructors with + . The many combinator shown earlier can be implemented in this style as follows.

id × inverse cons :: Iso (α, [β ]) (α, (β , [β ])) We can now restructure our value by using the fact that products are associative. associate :: Iso (α, (β , [β ])) ((α, β ), [β ]) The partial isomorphism i is now applicable to the first component of the tuple.

many0 :: Syntax δ ⇒ δ α → δ [α ] many0 p = listCases $ (text "" + p ∗ many0 p)

i × id :: Iso ((α, β ), [β ]) (α, [β ]) We arrive at a value of type (α, [β ]), and are done. These snippets can be composed with ◦ to implement step as a partial isomorphism.

The partial isomorphism listCases can be implemented as follows, or the Template Haskell code in Appendix A could be extended to generate this kind of partial isomorphisms as well.

step i = (i × id) ◦ associate ◦ (id × inverse cons)

listCases :: Iso (Either () (α, [α ])) [α ] listCases = Iso f g where f (Left ()) = Just [ ] f (Right (x, xs)) = Just (x : xs) g [ ] = Just (Left ()) g (x : xs) = Just (Right (x, xs))

We can now implement foldl in terms of iterate and step. In the version of foldl as a small-step abstract machine, we used fst to return only the first component of the tuple, ignoring the second component. In this reversible small-step abstract machine, we are not allowed to just ignore information. However, we know from the definition of step, that the second component of the abstract machine’s state will always contain [ ] after the machine has been run. Therefore, we can use the inverse of the nil partial isomorphism to deconstruct that [ ] into (), which can be safely ignored using the unit primitive.

text parses/prints a fixed text and consumes/produces a unit value. text :: Syntax δ ⇒ String → δ () text [ ] = pure () text (c : cs) = inverse (element ((), ())) $ (inverse (element c) $ token) ∗ text cs

foldl :: Iso (α, β ) α → Iso (α, [β ]) α foldl i = inverse unit ◦ (id × inverse nil) ◦ iterate (step i)

The following two operators are variants of ∗ that ignore their left or right result. In contrast to their counterparts derived from the

8

skipSpace, optSpace, sepSpace :: Syntax δ ⇒ δ () skipSpace = ignore [ ] $ many (text " ") optSpace = ignore [()] $ many (text " ") sepSpace = text " " ∗ skipSpace

Applicative class, the ignored parts have type δ () rather than δ β because otherwise information relevant for pretty-printing would be lost. ( p ( p

∗ ∗ ∗ ∗

) :: Syntax δ ⇒ δ () → δ α → δ α q = inverse unit ◦ commute $ p ∗ q ) :: Syntax δ ⇒ δ α → δ () → δ α q = inverse unit $ p ∗ q

ignore :: α → Iso α () ignore x = Iso f g where f = Just () g () = Just x

The between function combines these operators in the obvious way. between :: Syntax δ ⇒ δ () → δ () → δ α → δ α between p q r = p ∗ r ∗ q

ignore is again not a strict partial isomorphism, because all values of α are mapped to ().

Even sophisticated combinators like chainl1 can be directly implemented in terms of syntax descriptions and appropriate partial isomorphisms. The chainl1 combinator is used to parse a leftassociative chain of infix operators. It is implemented using foldl from Sec. 5.3 and many from 3.4.

6.4

keywords = ["ifzero", "else"]

chainl1 :: Syntax δ ⇒ δ α → δ β → Iso (α, (β , α)) α → δ α chainl1 arg op f = foldl f $ arg ∗ many (op ∗ arg)

letter, digit :: Syntax δ ⇒ δ Char letter = subset isLetter $ token digit = subset isDigit $ token

We have implemented some syntax description combinators along the lines of the combinators well-known from parser combinator libraries. We will now use these combinators to describe the syntax of a small language. 6.2

identifier = subset (∈ / keywords) ◦ cons $ letter ∗ many (letter | digit)

Abstract Syntax

Keywords are literal texts but not identifiers.

The abstract syntax of the example language is encoded with abstract data types.

keyword :: Syntax δ ⇒ String → δ () keyword s = inverse right $ (identifier

data Expression = Variable String | Literal Integer | BinOp Expression Operator Expression | IfZero Expression Expression Expression deriving (Show, Eq) data Operator = AddOp | MulOp deriving (Show, Eq)

+

text s)

Integer literals are sequences of digits, processed by read resp. show. integer :: Syntax δ ⇒ δ Integer integer = Iso read0 show0 $ many digit where read0 s = case [x | (x, "") ← reads s] of [ ] → Nothing (x : ) → Just x 0 show x = Just (show x) A parenthesized expressions is an expression between parentheses.

The Template Haskell macro defineIsomorphisms is used to generate partial isomorphisms for the data constructors.

parens = between (text "(") (text ")")

$(defineIsomorphisms 00Expression) $(defineIsomorphisms 00Operator) 6.3

Syntax descriptions

The first character of an identifier is a letter, the remaining characters are letters or digits. Keywords are excluded.

The syntax descriptions ops handles operators of arbitrary priorities. The priorities are handled further below. ops = mulOp | addOp

Expressing whitespace

Parsers and pretty printers treat whitespace differently. Parsers specify where whitespace is allowed or required to occur, while pretty printers specify how much whitespace is to be inserted at these locations. To account for these different roles of whitespace, the following three syntax descriptions provide fine-grained control over where whitespace is allowed, desired or required to occur.

$ $

text "*" text "+"

We allow optional spaces around operators. spacedOps = between optSpace optSpace ops The priorities of the operators are defined in this function. priority :: Operator → Integer priority MulOp = 1 priority AddOp = 2

• skipSpace marks a position where whitespace is allowed to

occur. It accepts arbitrary space while parsing, and produces no space while printing.

Finally, we can define the expression syntax description.

• optSpace marks a position where whitespace is desired to occur.

expression = exp 2 where

It accepts arbitrary space while parsing, and produces a single space character while printing.

exp 0 = literal $ integer | variable $ identifier | ifZero $ ifzero | parens (skipSpace ∗ expression

• sepSpace marks a position where whitespace is required to

occur. It requires one or more space characters while parsing, and produces a single space character while printing.

9

∗

skipSpace)

exp 1 = chainl1 (exp 0) spacedOps (binOpPrio 1) exp 2 = chainl1 (exp 1) spacedOps (binOpPrio 2)

7.

Related and Future Work

7.1

Parsing and Pretty Printing

Parser combinator libraries in Haskell are often based on a monadic interface. The tutorial of Hutton and Meijer (1998) shows how this approach is used to implement a monadic parser combinator library on top of the same type Parser as we used in Sec. 4. Both applicative functors (McBride and Paterson 2008) and arrows (Hughes 2000) have been proposed as alternative frameworks for the structure of parser combinator libraries. We have designed our language of syntax descriptions to allow a similar programming style as with parser combinator libraries based on applicative functors. This decision allows to more easily adopt programs written for monadic or applicative parser combinator libraries into our framework. However, the definition of a ∗ combinator for curried function application can, for instance, be found in the tutorial by Fokker (1995). Alternative approaches are based on arrows. Jansson and Jeuring (1999) implement both an arrow-based polytypic parser and an arrow-based polytypic printer in parallel with a proof that the parser is the left inverse of the printer. They implement a generic solution to serialization which is directly applicable to a wide range of types using polytypic programming. However, since they do not aim to construct human-readable output, they are not concerned with pretty printing, and since they cover multiple datatypes using polytypic programming, they do not provide an interface to construct more printers and parsers which are automatically inverse. Alimarine et al. (2005) introduce bi-arrows as an embedded DSL for invertible programming based on arrows. Similar to our notion of partial isomorphisms, a bi-arrow can be inverted and run backwards. A number of combinators for bi-arrows are introduced, and a simple parser and pretty printer is implemented as a single program. While their bi-arrows resemble our partial isomorphisms, there is an important difference in the role these constructs play in the respective approaches. Alimarine et al. implement a parser and pretty printer directly as a bi-arrow, while we have defined the language of syntax descriptions as a functor on top of partial isomorphisms. Therefore, their parsers and printers resemble the parsers in EDSLs based on arrows, while our syntax descriptions resemble the parsers in EDSLS based on applicative functors. Furthermore, their pretty printer does not handle advanced features like operator priorities and the automatic inserting of parentheses in the same general way as we do, but requires information about the location of parentheses to be contained in the abstract syntax tree. Generally, their work suffers from the methodically questionable decision to define a BiArrow type class as a subclass of the Arrow type class, even if some methods of Arrow could never be implemented for bi-arrows. These methods are defined to throw errors at runtime instead. On the other hand, Alimarine et al. present some transformers for bi-arrows. This approach could possibly be adapted to our notion of partial isomorphisms.

ifzero = keyword "ifzero" ∗ optSpace ∗ parens (expression) ∗ optSpace ∗ parens (expression) ∗ optSpace ∗ keyword "else" ∗ optSpace ∗ parens (expression) binOpPrio n = binOp ◦ subset (λ (x, (op, y)) → priority op ≡ n) This syntax description is correctly processing binary operators according to their priority during both parsing and printing. Similar to the standard idiom for expression grammars with infix operators, the description of expression is layered into several exp i descriptions, one for each priority level. The syntax description combinator chainl1 parses a left-recursive tree of expressions, separated by infix operators. Note that the syntax descriptions exp 1 to exp 2 both use the same syntax descriptions ops which describes all operators, not just the operators of a specific priority. Instead, the correct operators are selected by the binOpPrio n partial isomorphisms. The partial isomorphism binOpPrio n is a subrelation of binOp which only accepts operators of the priority level n. While parsing a high-priority expressions, the partial isomorphism will reject low-priority operators, so that the parser stops processing the high-priority subexpression and backtracks to continue a surrounding lower-priority expression. When the parser encounters a set of parentheses, it allows low-priority expressions again inside. Similarly, during printing a high-priority expression, the partial isomorphism will reject low-priority operators, so that the printer continues to the description of exp 0 and inserts a matching set of parentheses. All taken together, the partial isomorphisms binOpPrio n not only control the processing of operator priorities for both printing and parsing, but also ensure that parentheses are printed exactly where they are needed so that the printer output can be correctly parsed again. This way, correct round trip behavior is automatically guaranteed. The following evaluation shows that operator priorities are respected while parsing. > parse expression "ifzero (2+3*4) (5) else (6)" [IfZero (BinOp (Literal 2) AddOp (BinOp (Literal 3) MulOp (Literal 4))) (Literal 5) (Literal 6)] And this evaluation shows that needed parentheses are inserted during printing. > print expression (BinOp (BinOp (Literal 7) AddOp (Literal 8)) MulOp (Literal 9)) Just "(7 + 8) * 9"

7.2

Functional unparsing

There has been some work on type-safe variants on the C printf and scanf functions. The standard variants of these functions take a string and a variable number of arguments. How these arguments are processed, and how many of them are accessed at all, is controlled by the formatting directives embedded into the string. This dependence of the type of the overall function on the value of first argument seemingly requires dependent types. But Danvy (1998) has shown how to implement a type-safe variant of printf in ML by replacing the formatting string with an embedded DSL. The DSL is implemented using function composition and continuation passing style (CPS). The use of CPS allows Danvy to circumvent the fact that Printer is contravariant. However, in

By implementing whitespace handling and associativity and priorities for infix operators, we have shown how to implement two nontrivial aspects of syntax descriptions which occur in existing parsers and pretty printers for formal languages. We have shown how to implement well-known combinators like between and chainl1 in our framework, which enabled us to write the syntax descriptions in a style which closely resembles how one can program with monadic or applicative parser combinator libraries.

10

Danvy’s approach, it is not obvious how to define an abstraction like Printer as a parametric type constructor. More recently, Kiselyov (2008) implements type-safe printf and scanf so that the formatting specifications can be shared. Asai (2009) analyzes Danvy’s solution, and shows that it depends on the use of delimited continuations to modify the type of the answer. The same can be done in direct style using the control operators shift and reset. The work on type-safe printf and scanf shares some of the goals and part of the implementation method with the work presented in this article. In both approaches, an embedded DSL is used to allow a type-safe handling of formatting specifications for printing, parsing, and in Kiselyov’s implementation, even for both printing and parsing at once. However, these approaches differ in the interface presented to the user, and in the support for recursive and userdefined types. printf and scanf ’s continuation take a variable number of arguments depending on the formatting specification, but our parse and print functions take resp. return only one argument. Instead, we support more complicated arguments by using datatypes, and we support recursive types by building recursive syntax descriptions. It is not clear how user-defined datatypes and recursive syntax descriptions are supported in the printf and scanf approach. We allow to use a well-known Haskell idiom for parsing to be used for printing. Hinze (2003) implements a type-safe printf in Haskell without using continuations, but instead composing functors to modify the type of the environment. The key insight of Hinze’s implementation is that each of the elementary formatting directives specify a functor so that the type of printf is obtained by applying the composition of all the functors to String. Functor composition is implemented with multi-parameter typeclasses and functional dependencies. While we implement Printer resp. Parser as a single functor from an unusual category, Hinze implements his formatting directives as several functors and functor compositions. 7.3

formation of semantic artifacts (Danvy 2008), into a technique for program inversion. 7.4

Categories other than Hask

In Sec. 3.1, we had to introduce the IsoFunctor class to abstract over functors from Iso to Hask, because Haskell’s ordinary Functor does not support Functors involving categories different from Hask. Instead of introducing yet another category-specific functor class like IsoFunctor, one could use a more general functor class which allows to abstract over functors between arbitrary categories. Kmett (2008) supports such a “more categorical definition of Functor than endofunctors in the category Hask” in his category − extras package. class (Category r, Category s) ⇒ CFunctor f r s | f r → s, f s → r where cmap :: r a b → s (f a) (f b) Kmett declares a symbolic name for the category Hask, where the arrows are just Haskell functions. type Hask = (→) The CFunctor type class is a strict generalization of Haskell’s standard Functor class. While all instances of Haskell’s standard Functor class can be declared instances of CFunctor Hask Hask, there are instances of CFunctor which cannot be expressed as Functor. For example, instances of IsoFunctor can be declared instances of CFunctor Iso Hask. Similarly, if the standard Alternative typeclass would have been parametric in the source and target categories of the applicative functor, we could have reused it directly, instead of duplicating its methods into our version of Alternative. Combinators and generic algorithms expressed in terms of the standard Alternative class would then be readily available for our functors from Iso to Hask. This unnecessary need for code duplication suggests that the Haskell standard library could benefit from a redesign along the lines of Kmett’s CFunctor class.

Invertible functions

Mu et al. (2004) present an combinator calculus with a relational semantics, which can express only invertible functions. Programming in their “injective language for reversible computation” is based on a set of combinators quite similar to the algebra of partial isomorphisms in Sec. 5.1, but their language also contains a union operator to combine two invertible functions with disjoint domains and codomains. In our work, the | operator plays a similar role on the level of syntax descriptions. Mu et al.’s language has a relational semantics, implemented by a stand-alone interpreter, while partial isomorphisms are implemented as an embedded DSL in Haskell. Somewhat related to partial isomorphisms, functional lenses (Foster et al. 2008, 2005) can be described as functions which can be run backwards. However, functional lenses and partial isomorphisms use different notions of “running backwards”. Running a lens forwards projects a part of a data structure into some alternative format. Running it backwards combines a possibly altered version of the alternative format with the original structure into an possibly altered version of the original structure. This is different from partial isomorphisms, where running backwards is not dependent on some original version of data. However, results about partial lenses may be applicable to partial isomorphisms. It is part of our future work to analyze their relationship. Program inversion is concerned with automatically or manually inverting existing programs, while our approach for partial isomorphisms is based on the combination of primitive invertible building blocks into larger programs. Abramov and Gl¨uck (2002) give an overview over the field of program inversion. Future work could try to combine our technique of running abstract machines backwards, and existing techniques for the trans-

7.5

Other

Oury and Swierstra (2008) present the embedding of data definition languages as a use case of dependently typed programming and the use of universes in Agda. While their proposal has a somewhat monadic flair, Oury and Swierstra do not discuss functoriality of their type constructor. Furthermore, their prototype does not support user-defined datatypes, or recursive data types. In contrast, our implementation supports user-defined data types and (iso-) recursive types through the device of partial isomorphic functions. It would be interesting to see how the invariants of Iso values could be encoded in a dependently typed language. Brabrand et al. (2008) define a stand-alone DSL for the specification of the connection between an XML syntax, and a nonXML syntax for the same language. Their implementation statically checks that a specified transfromation between two syntaxes is reversible by approximating a solution to the ambiguity problem of context-free grammars. Hofer et al. (2008) describe a general methodology to embed DSLs in such a way that programs written in the DSL are polymorphic with respect to their interpretation. We have adopted their Scala-based design to Haskell using type classes.

8.

Conclusion

We have described the language of syntactic descriptions, with which both parser and pretty-printer can be described as a single program. We have shown that sophisticated languages with keywords and operator priorities can be described in this style, resulting in useful parsers and pretty-printers. Finally, we have seen that

11

partial isomorphisms are a promising abstraction that goes beyond parsing and pretty-printing; functions such as fold/unfold can be described in a single specification.

Daan Leijen and Erik Meijer. Parsec: Direct style monadic parser combinators for the real world. Technical Report UU-CS-2001-27, Department of Computer Science, Universiteit Utrecht, 2001. Conor McBride and Ross Paterson. Applicative programming with effects. Journal of Functional Programming, 18(1):1–13, 2008.

Acknowledgments

Shin-Cheng Mu, Zhenjiang Hu, and Masato Takeichi. An injective language for reversible computation. In Proceedings of the International Conference on Mathematics of Program Construction (MPC ’04). Springer Verlag, 2004.

We thank the anonymous reviewers for their insightful and encouraging comments.

References

Nicolas Oury and Wouter Swierstra. The power of pi. In Proceedings of the International Conference on Functional Programming (ICFP ’08), pages 39–50, New York, 2008.

Sergei Abramov and Robert Gl¨uck. Principles of inverse computation and the universal resolving algorithm. In The essence of computation: complexity, analysis, transformation, pages 269–295. Springer LNCS 2566, New York, 2002.

Tim Sheard and Simon Peyton Jones. Template meta-programming for Haskell. SIGPLAN Not., 37(12):60–75, 2002.

Artem Alimarine, Sjaak Smetsers, Arjen van Weelden, Marko van Eekelen, and Rinus Plasmeijer. There and back again: arrows for invertible programming. In Proceedings of the workshop on Haskell (Haskell ’05), pages 86–97, New York, 2005.

A.

Kenichi Asai. On typing delimited continuations: three new solutions to the printf problem. Higher-Order and Symbolic Computation, 22(3): 275–291, September 2009. Claus Brabrand, Anders Møller, and Michael I. Schwartzbach. Dual syntax for XML languages. Information Systems, 33(4-5):385–406, 2008. Olivier Danvy. Functional unparsing. Journal of Functional Programming, 8(6):621–625, 1998. Olivier Danvy. From reduction-based to reduction-free normalization. In Advanced Functional Programming, pages 66–164. Springer LNCS 5832, 2008. J. Fokker. Functional parsers. In J.T. Jeuring and H.J.M. Meijer, editors, Advanced Functional Programming, First International Spring School, number 925 in LNCS, pages 1–23, 1995. J. Nathan Foster, Alexandre Pilkiewicz, and Benjamin C. Pierce. Quotient lenses. In Proceeding of the International Conference on Functional Programming (ICFP ’08), pages 383–396, New York, 2008.

Generation of partial isomorphisms using Template Haskell

This appendix contains the implementation of the constructorIso and defineIsomorphism Template Haskell macros. constructorIso c = do ← reify c DataConI n d TyConI ((DataD cs )) ← reify d let Just con = find (λ (NormalC n0 ) → n ≡ n0 ) cs isoFromCon con defineIsomorphisms d = do TyConI (DataD cs ) ← reify d let rename n = mkName (toLower c : cs) where c : cs = nameBase n defFromCon con@(NormalC n ) = funD (rename n) [clause [ ] (normalB (isoFromCon con)) [ ]] mapM defFromCon cs

Nathan J. Foster, Michael B. Greenwald, Jonathan T. Moore, Benjamin C. Pierce, and Alan Schmitt. Combinators for bi-directional tree transformations: A linguistic approach to the view update problem. In Proceedings of the symposium on Principles of Programming Languages (POPL ’05), pages 233–246, New York, 2005. Ralf Hinze. Formatting: a class act. Journal of Functional Programming, 13(5):935–944, 2003. Christian Hofer, Klaus Ostermann, Tillmann Rendel, and Adriaan Moors. Polymorphic embedding of DSLs. In Proceedings of the Conference on Generative Programming and Component Engineering (GPCE ’08), pages 137–148, New York, October 2008. John Hughes. The Design of a Pretty-printing Library. In J. Jeuring and E. Meijer, editors, Advanced Functional Programming, pages 53–96. Springer LNCS 925, 1995. John Hughes. Generalising monads to arrows. Science of Computer Programming, 37:67–111, May 2000.

isoFromCon (NormalC c fs) = do let n = length fs (ps, vs) ← genPE n v ← newName "x" let f = lamE [nested tupP ps] JJust $(foldl appE (conE c) vs)K let g = lamE [varP v] (caseE (varE v) [match (conP c ps) (normalB JJust $(nested tupE vs)K) [ ] , match (wildP) (normalB JNothingK) [ ]]) JIso $f $gK

Graham Hutton and Erik Meijer. Monadic parsing in Haskell. Journal of Functional Programming, 8(4):437–444, 1998. Patrik Jansson and Johan Jeuring. Polytypic compact printing and parsing. In European Symposium on Programming, pages 273–287. Springer LNCS 1576, 1999. Patrik Jansson and Johan Jeuring. Polytypic data conversion programs. Science of Computer Programming, 43(1):35–75, 2002.

genPE n = do ids ← replicateM n (newName "x") return (map varP ids, map varE ids) nested tup [ ] = tup [ ] nested tup [x] =x nested tup (x : xs) = tup [x, nested tup xs]

Oleg Kiselyov. Type-safe functional formatted IO. Available at http://okmij.org/ftp/typed-formatting/, 2008. Edward A. Kmett. category extras: Various modules and constructs inspired by category theory. Available at http://hackage.haskell.org/package/category-extras, 2008.

12

The Performance of the Haskell

CONTAINERS

Package

Milan Straka Department of Applied Mathematics Charles University in Prague, Czech Republic [email protected]

Abstract

• ordered sequences of any elements, • trees and graphs. All data structures in this package work persistently, ie. they can be shared [Driscoll et al. 1989]. Our decision to compare and improve the CONTAINERS package was motivated not only by the wide accessibility of the package, but also by our intention to replace the GHC internal data structures with the CONTAINERS package. Therefore we wanted to confirm that the performance offered by the package is the best possible, both for small and big volumes of data stored in the structure, and possibly to improve it. The contributions of this paper are as follows: • We present the first comprehensive performance measurements of the widely-used CONTAINERS package, including head-tohead comparisons against half a dozen other popular container libraries (Section 3). • We describe optimisations to containers that improve the performance of IntSet by up to 8% and the performance of Set by 30-50% in common cases (Section 4). • We describe a new container data structure that uses hashing to improve performance in the situation where key comparison is expensive, such as the case of strings. Hash tables are usually thought of as mutable structures, but our new data structure is fully persistent. Compared to other optimised containers, performance is improved up to three times for string elements (Section 5).

In this paper, we perform a thorough performance analysis of the CONTAINERS package, the de facto standard Haskell containers library, comparing it to the most of existing alternatives on HackageDB. We then significantly improve its performance, making it comparable to the best implementations available. Additionally, we describe a new persistent data structure based on hashing, which offers the best performance out of available data structures containing Strings and ByteStrings. Categories and Subject Descriptors D.2.8 [Software Engineering]: Metrics—Performance measures; E.1 [Data Structures]: Trees; Lists, stacks, and queues General Terms Algorithms, Measurement, Performance Keywords Benchmarking, Containers, Haskell

1.

Introduction

In almost every computer language there are libraries providing various data structures, an important tool of a programmer. Programmers benefit from well written libraries, because these libraries • free the programmer from repeated data structure implementation and allow them to focus on the high level development, • prevent bugs in the data structure implementation, • can provide high performance. For some languages, standardized data structure libraries exist (STL for C++ [Stepanov and Lee 1994], Java Collections Framework, .NET System.Collections, etc.), which provide common and effective options in many cases. Being the only data structure package coming with GHC and the Haskell Platform (the standard Haskell development environment), the CONTAINERS package has become a “standard” data structure library for Haskell programmers. It is used by almost every third package on the HackageDB (674 out of 2083, 21st May 2010), which is a public collection of packages released by Haskell community. The CONTAINERS package contains the implementations of • sets of elements (the elements must be comparable), • maps of key and value pairs (the keys must be comparable),

2.

The CONTAINERS package

In this section we describe the data structures available in the CON TAINERS package. We tried to cover the basic and most frequent usage, for the eventual performance boost to be worthwhile. Focusing on basic usage is beneficial for the sake of comparison too, as the basic functionality is offered by nearly all implementations. 2.1

Sets and maps

A set is any data structure providing operations empty, member, insert, delete and union as listed in Figure 1. Real implementations certainly offer richer interface, but for our purposes we will be interested only in these methods. data Set e empty :: member :: insert :: delete :: union ::

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

Set Ord Ord Ord Ord

e e e e e

=> => => =>

e -> Set e -> Set e -> Set Set e ->

e -> Bool e -> Set e e -> Set e Set e -> Set e

Figure 1. A set implementation provided by the CONTAINERS package

13

The CONTAINERS package also contains a data type of a multi-way tree. Aside from the definition of this type, it contains only trivial methods (folds), so there is no point in benchmarking those. The last data structure offered by the package is a graph, which is built on top of the ARRAY package, and some simple graph algorithms. We perform no graph benchmarks, as the most similar FGL package is very different in design. We only describe some simple performance improvements.

All benchmarks were performed on a dedicated machine with Intel Xeon processor and 4GB RAM, using 32-bit GHC 6.12.2. All Cabal packages were compiled using default compiler switches (except for the CONTAINERS package, where we adopted the switches of the precompiled GHC version). We tried to benchmark all available implementations on the HackageDB. The list of packages used, together with their versions, can be found in Appendix A. The benchmarking process works by calling a benchmarked method on given input data and forcing the evaluation of the result. The evaluation forcing can be done conveniently using a DEEPSEQ package. But as the representation of the data structures is usually hidden from its users, we could not provide NFData instances directly and had to resort to a fold which performs an evaluation of all elements in the structure. Because the benchmarked method can take only microseconds to execute, the benchmarking framework repeats the execution of the method until it takes reasonable time (imagine 50ms) and then divides the elapsed time by the number of iterations. This process is repeated 100 times to get the whole distribution of the time needed, and the mean and confidence interval are produced. The results are displayed as graphs, one for each benchmark (Figures 4 to 17). One implementation is chosen as a baseline and the execution times are normalized with respect to the selected baseline. For each implementation and input, the mean time of 100 iterations is displayed, together with 95% confidence interval (which is usually not visible on the graphs as it is nearly identical to the mean). For every implementation a geometric mean of all times is computed and displayed in the legend. The implementations except for the baseline are ordered according to this mean. Each benchmark consists of several inputs. The size of input data is always measured in binary logarithms (so the input of size 10 contains 1024 elements). This size is always the first part of description of the input, which is displayed on the x axis. The input elements are of type Int unless stated otherwise (Strings and ByteStrings will be used with the HashSet in Section 5). Where any order or elements in the input data could be used, we tried ascending and random order (asc and rnd in the description of the input) to fully test the data structure behaviour. The random data are uniformly distributed, generated using standard Haskell random generator with fixed seed, and duplicates are allowed. All graphs together with the numerical data are available on the author’s website http://fox.ucw.cz/papers/containers/. For comparison, there are also graphs obtained by using only a seq instead of an all-element fold to evaluate the data structure.

3.

3.2

A map from keys to values is a set of pairs (key, value), which are compared using the key only. To prevent duplication we discuss only sets from now on, but everything applies to maps too1 . 2.2

Intsets

A set of Ints, or a map whose key type is Int, is used so frequently, that the CONTAINERS package offers a specialized implementation. By an intset we therefore mean a specialized implementation of a set of Ints2 . It should of course be faster than a regular set of Ints, otherwise there would be no point in using it. 2.3

Sequences

The CONTAINERS package also provides an implementation of a sequence of elements called a Seq with operations listed in Figure 2. A Seq is similar to a list, but elements can be added data Seq a data ViewL a = EmptyL | a :< (Seq a) data ViewR a = EmptyR | (Seq a) :> a empty :: Seq a (<|) :: a -> Seq a -> Seq a (|>) :: Seq a -> a -> Seq a viewl :: Seq a -> ViewL a viewr :: Seq a -> ViewR a index :: Seq a -> Int -> a update :: Int -> a -> Seq a -> Seq a Figure 2. An implementation of a sequence of elements provided by the CONTAINERS package (<| and |>) and removed (viewl and viewr) to the front and also to the back in constant time, allowing to use this structure as a double-ended queue. Elements can be also indexed and updated in logarithmic time and two sequences can be concatenated also in logarithmic time. 2.4

The rest of the CONTAINERS package

The benchmarks

The Set interface is polymorphic in the elements, provided the element type is an instance of Ord. Since the only element operation available is a comparison, nearly all implementations use some kind of a balanced search tree. We will not describe the algorithms used, but will provide references for interested readers. We benchmarked the following set implementations: • Set and Map from the CONTAINERS package, which uses bounded balance trees [Adams 1993], • FiniteMap from the GHC 6.12.2 sources, which also uses bounded balance trees [Adams 1993], • AVL from AVLT REE package, which uses well-known AVL trees [Adelson-Velskii and Landis 1962], • AVL from T REE S TRUCTURES package, which we denote as AVL2 in the benchmarks, also using AVL trees, • RBSet implemented by the author which uses well-known redblack trees [Guibas and Sedgewick 1978]. We performed these benchmarks:

Our first step is to benchmark the CONTAINERS package against other popular Haskell libraries with similar functionality. 3.1

Benchmarking Sets

Benchmarking methodology

To benchmark a program written in a language performing lazy evaluation is a tricky business. Luckily there are powerful benchmarking frameworks available. We used the CRITERION package for benchmarking and the PROGRESSION package for running the benchmarks of different implementations and grouping the results together. 1 In

reality it works the other way around – a set is a special case of map that has no associated value for a key. We could use a Map e (), where () is a unit type with only one value, as a Set e. But the unit values would still take space, which is why a Set e is provided. 2 When the GHC compiles one source file, it spends 5-15 times more performing intmap operations comparing to map operations (depending on the code generator used), which we measured with the GHC-head on 26th March 2010.

14

1.5

normalized execution time

1.4 1.3 1.2 1.1 1 0.9 0.8

Figure 3. A tree called the centipede.

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

Set: lookup

• lookup benchmark: perform a member operation on every element of the given set, either in ascending order (asc in the input description) or in random order of elements (rnd in the input description). For example the results for “08/rnd” are for a randomly-generated input of size 28 . • insert benchmark: build a set by sequentially calling insert, either in ascending (asc in the input description) or in random order of elements (rnd in the input description), • delete benchmark: sequentially delete all elements of a given set, either in ascending (asc in the input description) or in random order of elements (rnd in the input description), • union benchmark: perform a union of two sets of given sizes (the sizes are the first and second part of the input description). The input description asc means the elements in one set are all smaller than the elements in the other set. The description e_o stands for an input, where one set contains the even numbers and the other odd numbers. The last option √ mix represents an input,√whose n elements are grouped in n continuous runs each of n elements, and there runs are split between the two sets. • tree union benchmark: given a tree with elements in the leaves, perform union on all internal vertices to get one resulting set. The tree union benchmark models a particularly common case in which a set or map is generated by walking over a tree – for example, computing the free variables of a term. In these situations, most of the calls tu union are of very small sets, a very different test load to the union benchmark. The input description asc and rnd specify the order of the elements in the leaves. The shape of the tree is specified by the last letter of the input description. The letter b stands for perfectly balanced binary tree, u denotes unbalanced binary tree (one son is six times the size of the other son) and p stands for a centipede, see Figure 3.

Set(100.0%) RBSet(79.5%) AVL(84.9%)

Map(107.1%) FiniteMap(110.9%) AVL2(128.0%)

1.8

normalized execution time

1.6 1.4 1.2 1 0.8 0.6 0.4

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

04/asc

0.2

Set: insert Set(100.0%) AVL(48.2%) RBSet(61.9%)

FiniteMap(102.5%) Map(115.8%) AVL2(135.8%)

1.6

normalized execution time

1.5

The results of the benchmarks are plotted in Figures 4 and 5. The performance of the Set is comparable to the FiniteMap, but it is significantly worse than AVL and RBSet. This leaves a lot of space for improvements of the Set implementation to make it comparable to the AVL and RBSet. We describe such improvements in Section 4. 3.3

04/rnd

04/asc

0.7

1.4 1.3 1.2 1.1 1 0.9 0.8

Benchmarking Intsets

Set: delete Set(100.0%) FiniteMap(92.7%) AVL(101.6%)

The purpose of an intset implementations is to outperform a set of Ints. This can be achieved by allowing other operations on Ints in addition to a comparison. All mentioned implementations exploit the fact that an Int is a sequence of 32 or 64 bits. We have benchmarked following intset implementations:

Map(102.5%) AVL2(139.8%)

Figure 4. Benchmark of sets operations I

15

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

04/asc

0.7

2.5

7.2

6.0

2.4

1.8

1.4

5.8

normalized execution time

1 0.8 0.6 0.4

1.6 1.5 1.4 1.3 1.2 1.1 1 0.9

20/20/mix

2.2

6.3

IntSet(100.0%) UniqueFM(94.7%) IntMap(97.9%)

20/rnd

20/asc

16/rnd

16/asc

12/rnd

SetInlined(111.8%) EdisonMap(140.8%) Set(144.8%)

6.6

10.8

1.4

6

1.2

normalized execution time

normalized execution time

08/rnd

Intset: lookup

Map(105.4%) RBSet(116.0%)

2.5

12/asc

Set: union Set(100.0%) AVL(40.4%) FiniteMap(102.7%)

08/asc

0.8 04/asc

20/20/e_o

20/20/asc

10/20/mix

10/20/e_o

10/20/asc

10/10/mix

10/10/e_o

10/10/asc

05/10/mix

05/10/e_o

05/10/asc

0.2

04/rnd

normalized execution time

1.7 1.2

5

1

4

0.8

3

0.6

2

0.4

1 20/rnd_u

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

Set: treeunion

Intset: insert

FiniteMap(98.4%) Map(110.5%)

IntSet(100.0%) IntMap(100.5%) UniqueFM(112.9%)

EdisonMap(204.3%) SetInlined(206.6%) Set(326.7%) 7.0

Figure 5. Benchmark of sets operations II

16.7

normalized execution time

4

• IntSet from the CONTAINERS package which implements bigendian Patricia trees [Okasaki and Gill 1998], • UniqueFM from GHC 6.12.2 sources which also implements big-endian Patricia trees, • PatriciaLoMap from EdisonCore package, called EdisonMap in the benchmark, which implements little-endian Patricia trees [Okasaki and Gill 1998]. We also include ordinary Set Int from the CONTAINERS package in the benchmarks. For comparison, we also manually specialised the Set implementation by replacing overloaded comparisons with direct calls to Int comparisons, a process that could be mechanised. By comparing with this implementation, called SetInlined we can see the effect of the algorithmic improvements (rather than mere specialisation) in other intset implementations. The benchmarks performed are the same as in the case of generic set implementations. The results can be found in Figures 6 and 7. The IntSet outperforms all the presented implementations, except for the lookup and delete benchmark, where the UniqueFM is

3.5 3 2.5 2 1.5 1

Intset: delete IntSet(100.0%) UniqueFM(91.9%) IntMap(98.5%)

SetInlined(162.9%) EdisonMap(208.3%) Set(270.6%)

Figure 6. Benchmark of intsets operations I

16

20/rnd

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

04/asc

0.5 20/asc

Set(100.0%) AVL(56.0%) RBSet(83.0%)

04/rnd

0 04/asc

20/rnd_p

20/rnd_b

20/asc_u

20/asc_p

20/asc_b

14/rnd_u

14/rnd_p

14/rnd_b

14/asc_u

14/asc_p

14/asc_b

08/rnd_u

08/rnd_p

08/rnd_b

08/asc_u

08/asc_p

08/asc_b

0.2

6.7

3.5

1.2 1.1 normalized execution time

2.5 2 1.5 1 0.5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

20/20/mix

Intset: union IntSet(100.0%) UniqueFM(103.8%) IntMap(107.4%)

20_20

20_10

20_01

16_16

16_08

16_01

12_12

12_06

12_01

08_08

0.2 08_01

20/20/e_o

20/20/asc

10/20/mix

10/20/e_o

10/20/asc

10/10/mix

10/10/e_o

10/10/asc

05/10/mix

05/10/e_o

05/10/asc

0

08_04

normalized execution time

3

Seq: queue

EdisonMap(111.1%) SetInlined(152.8%) Set(166.6%) 19.3

12

Seq(100.0%) Ed_Simple(38.1%) Trivial(43.3%)

Amortized(61.4%) Ed_Amortized(62.4%) Realtime(72.4%)

12.7

Figure 8. Benchmark of queue operations normalized execution time

10

The queue benchmark consists of two phases: first a certain number of elements is added to the queue (the number of the elements added is the first part of the input description) and then some of the previously added elements are removed from the queue (the second part of the input description). We also tried mixing the additions and deletions, but there were hardly any differences in performance, so we do not present these. In this benchmark we tested the following implementations:

8 6 4 2

• Seq from the CONTAINERS package, which implements 2-3 finger trees annotated with sizes [Hinze and Paterson 2006], • Trivial, which is a non-persistent queue with amortized bounds, described in Section 5.2 of [Okasaki 1999], • Amortized, which is a persistent queue with amortized bounds, described in Section 6.3.2 of [Okasaki 1999], • Realtime, which is a persistent queue with worst-case bounds, described in Section 7.2 of [Okasaki 1999], • Ed_Simple, Ed_Amortized and Ed_Seq from the E DISON C ORE package, which implement the same algorithms as Trivial, Amortized and Seq, respectively.

20/rnd_u

20/rnd_p

20/rnd_b

20/asc_u

20/asc_p

20/asc_b

14/rnd_u

14/rnd_p

14/rnd_b

14/asc_u

14/asc_p

14/asc_b

08/rnd_u

08/rnd_p

08/rnd_b

08/asc_u

08/asc_p

08/asc_b

0

Intset: treeunion IntSet(100.0%) IntMap(103.3%) UniqueFM(142.0%)

EdisonMap(253.9%) SetInlined(370.1%) Set(465.9%)

Figure 7. Benchmark of intsets operations II

The results are displayed in Figure 8. The Ed_Seq is missing, as it was roughly 20 times slower than the Seq implementation. Because the Trivial queue implementation is not persistent (cannot be shared), we do not consider it to be a practical alternative. That means the Seq implementation is only 50% slower than the fastest queue implementation available. That is a solid result, considering the additional functionality it provides.

faster. The IntSet is considerably faster than a Set Int, especially in the tree union benchmark, where it runs more than four times faster. Although IntSet behaves very well, we describe some improvements in Section 4 that make it still faster. 3.4

Benchmarking Sequences

The Seq type in CONTAINERS supports beside others both (a) deque functionality (add and remove elements at beginning and end), and (b) persistent-array functionality (indexing and update). We compared it to several other libraries, most of which support only (a) or (b) but not both, and which might therefore be expected to outperform Seq. 3.4.1

3.4.2

Persistent-array functionality

The index and update benchmark perform a sequence of index and update operations, respectively, one for each element in the structure (the size of this structure is in the input description). We benchmarked the following implementations: • Seq from the CONTAINERS package, • Array from the ARRAY package for the index benchmark only, • RandList from the RANDOM - ACCESS - LIST package, which implements the skew binary random-access list from Section 9.3 of [Okasaki 1999],

Queue functionality

The queue functionality performance is significant, as there are no other implementations of queues and deques in standard Haskell libraries and so the Seq is the first choice when a queue is needed.

17

3.4.3

4.5

The Seq type is neither fastest queue nor the fastest persistent array, but it excels when both these qualities are required3 . For comparison, when an IntMap is used in the queue benchmark, it is 2.5times slower than Seq, and Ed_RandList and Ed_BinRandList are 5-times and 7-times slower, respectively.

4 normalized execution time

Summary

3.5 3 2.5

4.

2

There are several methods of improving an existing code. The simplest is probably the “look and see” method – after carefully exploring the properties of the implementation (practically “staring at the source code for some time”) some obvious deficiencies can be found. As an example, consider the following definitions: data Tree a = Node a (Forest a) type Forest a = [Tree a] In the Data.Graph module, function for pre-order and post-order Tree traversal are provided. The reader is welcome to consider what is wrong about both of these implementations:

1.5 1 0.5 20

18

16

12

08

0

Seq: index Seq(100.0%) Array(8.1%) IntMap(65.8%)

Improving the CONTAINERS performance

Ed_RandList(101.6%) Ed_BinRandList(379.8%) RandList(380.5%)

8

preorder :: preorder (Node a ts) = preorderF :: preorderF ts =

7 normalized execution time

6 5

Tree a -> [a] a : preorderF ts Forest a -> [a] concat (map preorder ts)

postorder :: postorder (Node a ts) = postorderF :: postorderF ts =

4 3 2

The postorder case is straightforward – the list concatenation is linear in the length of the first list, so the time complexity of postorder performed on a path is quadratic. The preorder is a bit more challenging – the concat takes the time of the length of all but the last list given. This also results in quadratic behaviour, for example when the preorder is executed on a centipede (Figure 3). The same mistake is also present in the postorder function. It is trivial to reimplement both these functions to have linear time complexity. However, potential performance improvements are usually not found merely by examining the source code. Another method is to use profiling to see which part of the code takes long to execute and which would be beneficial to improve. Having two implementations, we can also examine why one is faster. In the simplest case it can be done at the level of Haskell sources. But if the reason for different performance is not apparent, we can inspect the differences at the level of Core Haskell [Tolmach 2001] using for example the -ddump-stranal GHC flag, which shows the results of strictness analysis. If this is not enough, we can examine the C-- code [Jones et al. 1999] using the -ddump-cmm GHC flag. We had to resort to analysis on all these levels when improving the performance of the CONTAINERS. We now briefly describe the changes we made to improve the performance and present the benchmark results of the new implementations. The patches are available on the author’s website http://fox.ucw.cz/papers/containers/ and will soon be submitted for inclusion to the upstream. The correctness of these patches has been verified using tests from the CONTAINERS package.

1

16

14

12

10

08

0

Seq: update Seq(100.0%) IntMap(49.2%) Ed_RandList(134.0%)

Tree a -> [a] postorderF ts ++ [a] Forest a -> [a] concat (map postorder ts)

Ed_BinRandList(289.0%) RandList(303.2%)

Figure 9. Benchmark of sequence operations

• Ed_RandList from the E DISON C ORE package, which implements the same algorithm, • Ed_BinRandList from the E DISON C ORE package, which implements bootstrapped binary random-access list from Section 10.1.2 of [Okasaki 1999], • Ed_Seq from the E DISON C ORE package, • IntMap from the CONTAINERS package. The results are presented in Figure 9. Again we do not display Ed_Seq, because it was 10-20 times slower than Seq. The IntMap was used as a map from the Int indexes to the desired values. Despite the surplus indexes, it outperformed most of the other implementations. The Array is present only in the lookup benchmark, because the whole array has to be copied when modified and thus the update operation modifying only one element is very ineffective.

3 In

18

addition, a Seq can also be split and concatenated in logarithmic time.

4.1

which keeps in the set only the elements greater than the given bound (which could be −∞): filterGt :: (a -> Ordering) -> Set a -> Set a filterGt _ Tip = Tip filterGt cmp (Bin _ x l r) = case cmp x of LT -> join x (filterGt cmp l) r GT -> filterGt cmp r EQ -> r We altered it to: filterGt Nothing t = t filterGt (Just b) t = b ‘seq‘ filter’ t where filter’ Tip = Tip filter’ (Bin _ x l r) = case compare b x of LT -> join x (filter’ l) r GT -> filter’ r EQ -> r

Improving Sets

Since the Set implementation already has good performance relative to its competitors, we did not change the algorithm, but instead focused in improving its implementation. We made the following improvements: • As already mentioned, the methods of a Set works for any

comparable type (i.e. an instance of Ord) and therefore use generic comparison method. That hurts performance in case the methods which spend a lot of time comparing the elements (like member or insert) are used non-polymophically. By supplying an INLINE pragma we allow these methods to be inlined to the call site and if the call is not polymorphic, to use the specialized comparison instead of the generic one. We inline only the code performing the tree navigation, the rebalancing code is not duplicated to keep the code growth at minimum.

The results are displayed in Figures 10 and 11. The improved implementations are called NewSet and NewMap. We were able to reach the AVL implementation performance, except for the union benchmark. Yet we outperformed it on the tree union benchmark, which was our objective. Note that using the existing AVL implementation as a Map is not trivial, because it does not allow to implement all the functionality of a Map efficiently (notably elemAt, deleteAt etc.).

• When balancing a node, the function balance checked the bal-

ancing condition and called one of the four rotating functions, which rebuilt the tree using smart constructors. This resulted in a repeated pattern matching, which was unnecessary. We rewrote the balance function to contain all the logic and to use as few pattern matches as possible. That resulted in significant performance improvements in all Set methods that modify a given set. • When a recursive method accesses its parameter at different

4.2

recursion levels, Haskell usually has to check that it is evaluated each time it is accessed. For a member or insert, that causes a measurable slowdown. We rewrote these methods so that they evaluate the parameter at most once. To illustrate, we changed the original member method

Improving IntSets

The IntSet implementation was already extensively tuned and difficult to improve. We performed only minor optimizations: • As with the Sets, some recursive functions checked whether the

parameters were evaluated multiple times. We made sure it is done at most once. Because some functions were already strict in the key, it was enough to add the seq calls to appropriate places. This improved the lookup function significantly.

member :: Ord a => a -> Set a -> Bool member x t = case t of Tip -> False Bin _ y l r -> case compare x y of LT -> member x l GT -> member x r EQ -> True

• The implementation uses a function maskW. When m contains

exactly one bit set, the maskW i m should return only the values of bits of i than are higher than the bit set in m: m i maskW i m

to the following: member _ Tip = False member x t = x ‘seq‘ member’ t where member’ Tip = False member’ (Bin _ y l r) = case compare x y of LT -> member’ l GT -> member’ r EQ -> True • We improved the union to handle small cases – merging a set of size one is the same as inserting that one element. We achieved that by adding the following cases to the definition of a union: union (Bin _ x Tip Tip) t = insert x t union t (Bin _ x Tip Tip) = insertR4 x t That helped significantly in the tree union benchmark. We tried to use this rule also on sets of size 2 and 3, but the performance did not improve further.

0...010...0 a...abc...c a...a00...0

This method is defined as maskW i m = i .&. (complement (m-1) ‘xor‘ m) But there are other effective alternatives, for example: maskW i m = i .&. (-m - m) maskW i m = i .&. (m * complement 1) The last one is (unexpectedly for us) the best and caused the speedup in the insert, union and tree union benchmarks. The results are presented in Figures 12 and 13, the improved implementations are called NewIntSet and NewIntMap. The NewIntSet implementation is faster especially in the lookup and the insert benchmark. The speedup of the NewIntMap is a bit smaller.

• In the union method, a comparison with a possibly infinite

5.

element must be performed. That was originally done by supplying a comparison function, which was constant for the infinite bound. Supplying a value Maybe elem with infinity represented as Nothing improved the performance notably. To demonstrate the changes, consider the filterGt method,

New set and map implementation based on hashing

When a comparison of two elements is expensive, using a tree representation for a set can be slow, because at least log2 (N ) comparisons must be made for each operation. In this section we investigate whether we can do better on average, by developing a new implementation for set/map optimised for the expensivecomparison case.

4 The

insertR method works just like an insert, but it does not insert the element if it is already present in the set.

19

2.5 1.2

6.0

normalized execution time

1.4

1.1 normalized execution time

7.2

1 0.9 0.8

1.2 1 0.8 0.6 0.4

0.7 20/20/mix

20/20/e_o

20/20/asc

10/20/mix

10/20/e_o

10/20/asc

10/10/mix

10/10/e_o

10/10/asc

Set: union

Set: lookup Set(100.0%) NewSet(68.8%) NewMap(71.1%)

05/10/mix

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

04/asc

0.6

05/10/e_o

05/10/asc

0.2

Set(100.0%) AVL(40.4%) NewSet(92.8%)

RBSet(79.5%) AVL(84.9%) Map(107.1%)

NewMap(98.1%) Map(105.4%) RBSet(116.0%)

2.5 1.3

2.2

6.3

6.6

1.4

20/rnd_u

20/rnd_p

20/rnd_b

20/asc_u

20/asc_p

20/asc_b

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

04/asc

Set: treeunion

Set: insert Set(100.0%) AVL(48.2%) NewSet(58.3%)

14/rnd_u

0.2

0.3

14/rnd_p

0.4

14/rnd_b

0.4

08/asc_b

0.5

14/asc_u

0.6

14/asc_p

0.6

14/asc_b

0.7

0.8

08/rnd_u

0.8

1

08/rnd_p

0.9

1.2

08/rnd_b

1

08/asc_u

1.1

08/asc_p

normalized execution time

normalized execution time

1.2

Set(100.0%) NewSet(52.3%) AVL(56.0%)

RBSet(61.9%) NewMap(68.2%) Map(115.8%)

1.3

NewMap(58.1%) RBSet(83.0%) Map(110.5%)

Figure 11. Benchmark of improved sets operations II

1.1

Two approaches suggest themselves. First, one could use a hash table (Section 6.4 of [Knuth 1998]) to guess the position of an element in the set and performs only one comparison if the guess was correct. Another alternative is a trie (Section 6.3 of [Knuth 1998]), which can also be implemented using a ternary search tree ([Bentley and Sedgewick 1998]), which compares only subparts of elements. The problem with a hash table is that it is usually built using an array, but there is no available implementation of an array that could be shared, ie. be persistent. However, we have already seen that an IntMap can be used as a persistent array with reasonable performance. We used this fact and implemented a HashSet elem as data HashSet elem = HS (IntMap (Set elem)). The HashSet is therefore an IntMap indexed by the hash value of an element. In the IntMap, there is a Set elem containing elements with the same hash value (this set will be of size one if there are no hash collisions). A HashMap can be implemented in the same way as

1 0.9 0.8 0.7 0.6 0.5 20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

0.4 04/asc

normalized execution time

1.2

Set: delete Set(100.0%) NewSet(54.5%) NewMap(62.9%)

AVL(101.6%) Map(102.5%)

Figure 10. Benchmark of improved sets operations I

20

1.15

normalized execution time

normalized execution time

1.1

1.05

1

0.95

1.1 1.05 1 0.95 0.9

0.9 20/20/mix

20/20/e_o

20/20/asc

10/20/mix

10/20/e_o

10/20/asc

10/10/mix

10/10/e_o

10/10/asc

Intset: union

Intset: lookup IntSet(100.0%) NewIntMap(92.5%) NewIntSet(92.6%)

05/10/mix

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

04/asc

0.85

05/10/e_o

05/10/asc

0.85

IntSet(100.0%) NewIntSet(98.8%) NewIntMap(102.8%)

UniqueFM(94.7%) IntMap(97.9%)

2.2

1.6

1.4

UniqueFM(103.8%) IntMap(107.4%)

2.0

1.6

1.5 normalized execution time

normalized execution time

1.3 1.2 1.1 1

1.4 1.3 1.2 1.1 1 0.9

0.9

20/rnd_u

20/rnd_p

20/rnd_b

20/asc_u

20/asc_p

20/asc_b

14/rnd_u

14/rnd_p

14/rnd_b

14/asc_u

14/asc_p

14/asc_b

08/rnd_u

08/rnd_p

08/rnd_b

IntSet(100.0%) NewIntSet(95.5%) NewIntMap(100.5%)

IntMap(100.5%) UniqueFM(112.9%)

1.06

IntMap(103.3%) UniqueFM(142.0%)

Figure 13. Benchmark of improved intsets operations II

1.04 1.02 1

data HashMap key val = HM (IntMap (Map key val)). Such a data structure is usually called a hash trie and described in [Goubault 1994] or in [Bagwell 2001]. This data structure is quite simple to implement, using the methods of an IntMap and a Set or a Map. It offers a subset of IntMap interface, which does not depend on the elements being stored in an IntMap in ascending order (the elements are stored in ascending order of the hash value only). Namely, we do not provide toAscList (users can use sort . toList), split, and the methods working with the minimum and maximum element (findMin, findMax and others). Moreover, the folds and maps are performed in unspecified element order. We uploaded our implementation to the HackageDB as a package called HASHMAP. We performed the same lookup, insert and delete benchmark on the HashSet as on the Set and IntSet. We used the original unimproved implementation of the CONTAINERS package – the performance of the HashSet will improve once the improvements from Section 4 are incorporated.

0.98 0.96 0.94 0.92 0.9 0.88 20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

0.86 04/asc

normalized execution time

08/asc_u

20/rnd

20/asc

16/rnd

16/asc

12/rnd

12/asc

08/rnd

08/asc

04/rnd

04/asc

Intset: treeunion

Intset: insert IntSet(100.0%) NewIntSet(93.3%) NewIntMap(97.2%)

08/asc_p

08/asc_b

0.8 0.8

Intset: delete IntSet(100.0%) UniqueFM(91.9%) NewIntMap(97.9%)

IntMap(98.5%) NewIntSet(100.9%)

Figure 12. Benchmark of improved intsets operations I

21

The performance of a HashSet when using elements of type Int is displayed in Figure 14. It is worse than the IntSet, because it uses an additional Set for elements with same hash values. The HashSet should be beneficial when the comparison of the set elements is expensive. We therefore benchmarked it with Strings and ByteStrings elements. We compared the HashSet implementation to all alternatives present on the HackageDB (mostly trie-like data structures): • ListTrie and PatriciaTrie from the LIST- TRIES package implementing a trie and a Patricia trie (Section 6.3 of [Knuth 1998]), • BStrTrie from the BYTESTRING - TRIE package, which is specialized for ByteStrings and (like IntSet) implements a bigendian Patricia tree [Okasaki and Gill 1998], • StringSet from the T ERNARY T REES package, which implements a ternary search tree ([Bentley and Sedgewick 1998]) specialized for the elements of type String, • TernaryTrie from EdisonCore also implementing a ternary search tree. The results are presented in Figures 15 and 16. The length of the strings used in the benchmarks is the last number in the input description. We used uniformly distributed random strings of small letters (rnd in the input description) and also a consecutive ascending sequence of strings (asc in the input description). In the latter case the strings have a long common prefix of a’s. The ListTrie is not present in the benchmark results because it was 5-10 times slower than the HashSet. The HashSetNoC is the same as the HashSet, only the computation of a hash value of a ByteString is done in Haskell and not in C. There is quite significant slowdown in the case Haskell generating the hashing code. We discussed this with the GHC developers and were informed that the problem should be solved using the new LLVM backend [Terei 2009]. We also performed the union benchmark. We generated a sequence of elements (its length is the first part of the input description) and created two sets of the same size, one from the elements on the positions and the other from the elements on odd positions. Then we performed a union of those sets. The results for Int, String and ByteString elements are presented in Figure 17. The performance of a HashSet is superior to trie structures, even those specialised for the String or ByteString elements. As mentioned, the performance will improve even more with the enhancements of the CONTAINERS package.

1.7

normalized execution time

1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 20/rnd

20/asc

12/rnd

12/asc

04/rnd

04/asc

0.7

Hashset: lookup-int HashSet(100.0%) IntSet(88.7%)

AVL(94.3%) Set(129.0%)

5

normalized execution time

4.5 4 3.5 3 2.5 2 1.5 1 20/rnd

20/asc

12/rnd

12/asc

04/rnd

04/asc

0.5

Hashset: insert-int HashSet(100.0%) IntSet(76.7%)

AVL(114.7%) Set(233.9%)

4

6.

3

2 1.5 1

20/rnd

20/asc

12/rnd

12/asc

04/rnd

0.5

Hashset: delete-int HashSet(100.0%) IntSet(77.0%)

Conclusions and further work

We have undertaken a thorough performance analysis of the CON TAINERS package, comparing it to the most of existing alternatives found on the HackageDB. These measurements are interesting of its own accord, because they allow existing data structure implementations to be compared. Using the benchmark results and code profiling, we significantly improved the performance of the CONTAINERS package, making it comparable to the best implementations available. We will submit our patches for inclusion to the upstream shortly. Inspired by the benchmark results we also implemented a new persistent data structure based on hashing, which offers the best performance out of available set implementations with String and ByteString elements, but should perform well for any element type whose comparison is expensive. This data structure is now available on the HackageDB.

2.5

04/asc

normalized execution time

3.5

AVL(204.1%) Set(207.8%)

Improving a library’s performance is an unending process. Certainly the CONTAINERS package could be improved even further and more its methods could be benchmarked.

Figure 14. Benchmark of hashset operations on Ints

22

HashSet(100.0%) AVL(213.7%) Set(218.6%)

5

4

3.5

2.5 3

2

1.5 4.5

4

Hashset: delete-str

TernaryTrie(295.7%) PatriciaTrie(327.5%)

Figure 15. Benchmark of hashset operations on Strings

23 3.5

3

2.5

2

1.5

0.5

HashSet(100.0%) HashSetNoC(117.4%) BStrTrie(199.9%) 16/asc50

16/asc20

16/asc05

10/rnd50

16/rnd50

0.5 4.6

16/rnd05

1 BStrTrie(317.8%) Set(440.8%)

16/rnd50

Hashset: insert-bst

16/rnd05

1 16/rnd50

16/rnd05

16/asc50

16/asc20

16/asc05

10/rnd50

10/rnd05

10/asc50

10/asc20

10/asc05

04/rnd50

04/rnd05

5.5

16/asc50

12.7 7.7 8.9 10.1

16/asc20

HashSet(100.0%) HashSetNoC(118.5%) AVL(284.9%) 10/rnd05

HashSet(100.0%) HashSetNoC(122.2%) BStrTrie(209.2%)

16/asc05

Set(285.7%) StringSet(286.4%) PatriciaTrie(358.1%)

10/rnd50

Hashset: insert-str

10/rnd05

0 10/asc50

2

10/asc20

3

10/asc50

4

10/asc20

5

10/asc05

6

10/asc05

6

04/rnd50

14.0

04/rnd05

Hashset: lookup-str

04/rnd50

PatriciaTrie(171.4%) AVL(208.2%) Set(239.8%)

04/rnd05

0 04/asc50

1

04/asc20

04/asc05

2

normalized execution time

3

04/asc50

7

5

3

normalized execution time

16/rnd50

16/rnd05

16/asc50

16/asc20

4

04/asc20

04/asc05

16/rnd50

16/rnd05

16/asc50

16/asc20

10/rnd50 16/asc05

5

04/asc50

10/rnd50 16/asc05

10/rnd05

10/asc50

10/asc20

10/asc05

04/rnd50

04/rnd05

04/asc50

04/asc20

04/asc05

normalized execution time

6.1

04/asc20

04/asc05

4.5

normalized execution time

16.4

16/rnd50

16/rnd05

16/asc50

16/asc20

7.5

10/rnd50

9

16/asc05

HashSet(100.0%) AVL(197.5%) TernaryTrie(282.6%) 10/rnd05

10/asc50

10/asc20

10/asc05

04/rnd50

04/rnd05

04/asc50

04/asc20

04/asc05

normalized execution time

HashSet(100.0%) StringSet(119.2%) TernaryTrie(132.7%)

10/rnd05

10/asc50

10/asc20

10/asc05

5.5

04/rnd50

04/rnd05

04/asc50

04/asc20

04/asc05

normalized execution time 6 5.7

5

4.5

3.5

4

3

2.5

1.5

2

0.5

1

Hashset: lookup-bst AVL(255.1%) Set(270.9%) 6.9 9.0

8

4

2

1 1

0

Hashset: delete-bst

AVL(240.1%) Set(240.7%)

Figure 16. Benchmark of hashset operations on ByteStrings

Acknowledgments

4

I would like to express my sincere gratitude to Simon Peyton Jones for his supervision and guidance during my internship in Microsoft Research Labs, and also for the help with this paper. Our discussions were always very intriguing and motivating.

normalized execution time

The list of referenced HackageDB packages

All packages mentioned in this paper can be found on the HackageDB, which is a public collection of packages released by the Haskell community. The list of HackageDB packages currently resides at http://hackage.haskell.org/. We used the following packages in the benchmarks:

2.5 2 1.5 1

Packages used HASHMAP – 1.0.0.3 LIST- TRIES – 0.2 PROGRESSION – 0.3 RANDOM - ACCESS - LIST – 0.2 T ERNARY T REES – 0.1.3.4 T REE S TRUCTURES – 0.0.2

20/rnd

20/asc

12/asc

12/rnd

0.5 04/asc

ARRAY – 0.3.0.0 AVLT REE – 4.2 BYTESTRING - TRIE – 0.1.4 CONTAINERS – 0.3.0.0 CRITERION – 0.5.0.0 DEEPSEQ – 1.1.0.0 E DISON C ORE – 1.2.1.3

3

04/rnd

A.

3.5

Hashset: union-int HashSet(100.0%) IntSet(94.7%) 11.2

8

We also benchmarked internal data structures of the GHC compiler. Their implementation can be found in the sources of GHC 6.12.2, namely as files FiniteMap.hs and UniqFM.hs in the compiler/utils directory.

AVL(146.8%) Set(271.9%) 17.2

20.9

normalized execution time

7 6 5

References

4

S. Adams. Efficient sets – a balancing act. J. Funct. Program., 3(4): 553–561, 1993. G. M. Adelson-Velskii and E. M. Landis. An algorithm for the organization of information. Dokladi Akademia Nauk SSSR, (146), 1962. P. Bagwell. Ideal hash trees. Es Grands Champs, 1195, 2001.

3 2 1

J. Bentley and R. Sedgewick. Ternary search trees. Dr. Dobb’s Journal, April 1998. J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38(1): 86–124, 1989. ISSN 0022-0000. J. Goubault. Himml: Standard ml with fast sets and maps. In In 5th ACM SIGPLAN Workshop on ML and its Applications. ACM Press, 1994. L. J. Guibas and R. Sedgewick. A dichromatic framework for balanced trees. Foundations of Computer Science, Annual IEEE Symposium on, 0:8–21, 1978. ISSN 0272-5428.

16/rnd05

16/rnd50

16/rnd05

16/rnd50

16/asc50

16/asc20

16/asc05

10/rnd50

10/rnd05

10/asc50

10/asc20

10/asc05

04/rnd50

04/rnd05

04/asc50

04/asc20

04/asc05

0

Hashset: union-str HashSet(100.0%) AVL(153.9%) Set(249.4%) 7.3 10.7

7

TernaryTrie(296.3%) PatriciaTrie(359.3%) 8.5 12.8

6 normalized execution time

R. Hinze and R. Paterson. Finger trees: a simple general-purpose data structure. J. Funct. Program., 16(2):197–217, 2006. ISSN 0956-7968. S. L. P. Jones, N. Ramsey, and F. Reig. C--: A portable assembly language that supports garbage collection. In PPDP, pages 1–28, 1999. D. E. Knuth. The art of computer programming, volume 3: (2nd ed.) sorting and searching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998. ISBN 0-201-89685-0. C. Okasaki. Purely Functional Data Structures. Cambridge University Press, July 1999. ISBN 0521663504.

5 4 3 2 1

C. Okasaki and A. Gill. Fast mergeable integer maps. In In Workshop on ML, pages 77–86, 1998. A. Stepanov and M. Lee. The standard template library. Technical report, WG21/N0482, ISO Programming Language C++ Project, 1994.

A. Tolmach. An external representation for the ghc core language, 2001. URL http://www.haskell.org/ghc/docs/papers/core.ps.gz.

16/asc50

16/asc20

16/asc05

10/rnd50

10/rnd05

10/asc50

10/asc20

10/asc05

04/rnd50

04/rnd05

04/asc50

04/asc20

D. A. Terei. Low level virtual machine for glasgow haskell compiler, 2009. URL http://www.cse.unsw.edu.au/ pls/thesis/davidt-thesis.pdf.

04/asc05

0

Hashset: union-bst HashSet(100.0%) HashSetNoC(100.4%) AVL(293.7%)

BStrTrie(326.7%) Set(561.6%)

Figure 17. Benchmark of union operation on hashset

24

A Systematic Derivation of the STG Machine Verified in Coq Maciej Pir´og

Dariusz Biernacki

Institute of Computer Science University of Wrocław Wrocław, Poland [email protected]

Institute of Computer Science University of Wrocław Wrocław, Poland [email protected]

Abstract

plementational detail, it is not amenable to reasoning about operational aspects of the source language. A considerably more intuitive formalism that omits inessential details of implementation is natural semantics, proposed for lazy functional languages by Launchbury [8] and later refined by Sestoft [15]. Although the results by Launchbury and by Sestoft are eminent, they do not address the STG language, but a simpler variant of a normalized λ-calculus. In turn, Encina and Pe˜na in a series of articles proposed a natural semantics for a language that very much resembles STG [4–6], but it does not capture the evaluation model of the original STG machine in the way the heap is allocated and updated. This difference is confirmed by their abstract machines: they were obtained by an ad-hoc derivation and shown to be equivalent with the proposed natural semantics, but these machines differ from the original STG machine. As a matter of fact, none of the existing natural semantics has been defined exactly for the original STG language which allows for multiple binders and non-trivial update scenarios directed by update flags. Also, none of the proposed natural semantics for lazy evaluation captures fully the evaluation model embodied in the original STG machine. In order to fill this vacuum and as a first step towards a certified compiler for Haskell—a bigger project that we are working on—we present a natural operational semantics for the STG language that is an extension of the semantics given by Sestoft and from this semantics we mechanically derive the corresponding abstract machine. The derivation method we use consists of some standard steps such as argument stack introduction and environment introduction, but the critical transformation from a big-step operational semantics to the equivalent abstract machine is given by the transformation to continuation-passing style (CPS transformation) [13] followed by the defunctionalization of continuations [3, 13]. We, therefore, rely on Danvy et al.’s functional correspondence between evaluators and abstract machines [1], that has already proved useful before in the context of evaluators for a lazy lambda calculus [2], except that we transform semantic descriptions rather than interpreters. This derivation method transforms only the form of the semantics and leaves the evaluation model described by the semantics intact, so the natural semantics we propose and the abstract machine we derive are two sides of the same coin. Interestingly and as expected, the outcome of our derivation turns out to be the STG machine, only slightly reformulated. Hence, the STG machine, though designed for efficient evaluation and—more importantly—efficient implementation on stock hardware, can be seen as a natural counterpart of our semantics for the STG language, obtained via a systematic and universal derivation method. Additionally, having a method to mechanically transform a natural operational semantics into an abstract machine, we can

Shared Term Graph (STG) is a lazy functional language used as an intermediate language in the Glasgow Haskell Compiler (GHC). In this article, we present a natural operational semantics for STG and we mechanically derive a lazy abstract machine from this semantics, which turns out to coincide with Peyton-Jones and Salkild’s Spineless Tagless G-machine (STG machine) used in GHC. Unlike other constructions of STG-like machines present in the literature, ours is based on a systematic and scalable derivation method (inspired by Danvy et al.’s functional correspondence between evaluators and abstract machines) and it leads to an abstract machine that differs from the original STG machine only in inessential details. In particular, it handles non-trivial update scenarios and partial applications identically as the STG machine. The entire derivation has been formalized in the Coq proof assistant. Thus, in effect, we provide a machine checkable proof of the correctness of the STG machine with respect to the natural semantics. Categories and Subject Descriptors D.3.1 [PROGRAMMING LANGUAGES]: Formal Definitions and Theory—Semantics; D.3.4 [PROGRAMMING LANGUAGES]: Processors—Compilers General Terms

Languages, Theory, Verification

Keywords STG, natural semantics, abstract machine, derivation, verification, Coq

1.

Introduction

The Shared Term Graph (STG) language along with the Spineless Tagless G-machine (STG machine), both developed by PeytonJones and Salkild [11, 12], lie at the heart of the Glasgow Haskell Compiler (GHC)—the flagship Haskell compiler [7]. STG is a higher-order pure lazy functional language based on a normalized λ-calculus with multiple binders, datatype constructors and pattern matching. It is used as an intermediate language in GHC and compiled to code that mimics the execution of the STG machine. The STG abstract machine defines an operational semantics and an implementation model for STG. Since it contains a high degree of im-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

25

e

lf alt π

→ | | | → → →

x xi C xi letrec xi = lf i in e case e of alt i λπ xi .e C xi → e U |N

— application — saturated constructor — local definition — case expression — lambda-form — case alternative — update flag

Figure 1. The syntax of the STG language intermediate language in a compiler, the sharing of constructor names is not an issue, since the front-end type checking guarantees that each constructor name will be interpreted in the right datatype.)

augment the STG language with new features and then easily obtain a consonant machine, e.g., the STG machine with unboxed arithmetics. The entire derivation presented in this article has been formalized and proved correct in the Coq proof assistant. Thus we provide a machine checkable proof of the correctness of the STG machine with respect to the natural semantics. Additionally, we take this result as the starting point of the construction of a certified Haskell compiler in Coq. The rest of the article is organized as follows. In Section 2, we present the syntax and the natural semantics of STG. In Section 3, we describe our enabling technology—the functional correspondence at the level of semantics. In Section 4, we transform the natural semantics into the STG machine by argument stack introduction and environment introduction followed by the transformation to defunctionalized continuation-passing style. We also argue that the resulting machine is the STG machine despite some minor discrepancies between the two and we sketch how our construction scales to some common extensions of the language. In Section 5, we briefly describe the Coq formalization. In Section 6, we compare our work with the existing derivations of lazy abstract machines. In Section 7, we conclude and put the present result in the context of building a certified compiler for Haskell.

2.

Local definition Local definition expressions (aka letrec expressions) play a more significant role than in ordinary functional languages: they enclose subexpressions for lazy evaluation. Each local definition binds a lambda-form λπ xi .e, where e is an expression and xi is a (possibly empty) tuple of its arguments. Intuitively, we may think of it as an ordinary lambda expression. The symbol π represents an update flag (U for updatable and N for nonupdatable), which indicates whether after evaluation of the lambdaform the result should overwrite the lambda-form. If a lambda-form expects some arguments, it is already in normal form (as usual, we do not reduce under lambdas), so its update flag is always N . Definitions in one letrec block are assumed to be mutually recursive. Case expression Case expressions case e of alt i perform eager evaluation (by eager we mean “up to the outermost constructor”) of the subexpression e. The result is then matched against the list of alternatives alt i which binds arguments of the constructor in the matched alternative. The body of the matched alternative is then evaluated according to the lazy evaluation strategy. The transformation from an everyday-use functional language like Haskell to the STG language requires extraction of all nonvariable subexpressions to letrec definitions, normalization of case expressions and a static analysis for the update-flag annotation.

STG and its semantics

We begin with the syntax and semantics for the STG language which is essentially a normalized lambda calculus with simplified algebraic datatypes. 2.1

2.2

Natural semantics

In this section, we present a natural operational semantics for the STG language. It uses a heap to store all the lambda-forms needed to evaluate an expression. Free variables serve as pointers to the elements of the heap. When the body of an updatable lambda-form (i.e., one with the U flag) is evaluated, it is overwritten with the value, so no expression sharing this lambda-form will evaluate the same node for a second time. We split the set of variables X into two disjoint, enumerably infinite sets: the set of bound variables BOUND (ranged over by x1 , x2 , . . .) and the set of heap pointers POINTERS (ranged over by p, q, p1 , q1 , . . .), so:

Syntax

The syntax of the STG language is shown in Figure 1, where X = {x, y, z, p, q, . . .} is a set of variables and C = {C, C1 , C2 , . . .} is a set of names of constructors. The letters e, f, g, w will stand for expressions of the STG language. We denote sequences by juxtaposition (e.g., x1 . . . xn ) or by a line over indexed elements (e.g., xi ). If not stated otherwise, sequences may be empty. Appending sequences and inserting elements is also represented by juxtaposition. The symbol ε stands for the empty sequence. Application We apply only single variables to tuples of variables. This limited form of application is in correspondence with the lazy evaluation: the variables are pointers to thunks representing subexpressions that will be computed only if needed (or have already been computed and updated). The tuple may be empty, so there is no need for a separate variable case in the grammar.

X = BOUND ∪ POINTERS It is needed to provide sound local freshness of names in the semantics, as will be discussed later on. We call an expression wellformed if and only if all its bound variables are in BOUND, and all of its free variables are in POINTERS . The semantics is designed for well-formed expressions only. The semantics is given in Figure 2. It derives judgments of the form (Γ : e ↓ ∆ : f ). The pair (Γ : e) is called a configuration and (∆ : f ) a normal form. Γ and ∆ are heaps, i.e., partial functions from X to LF , where LF is the set of all lambda-forms. Values in the heap are called closures. In the following, Γ{x 7→ lf } stands for a heap Γ, explicitly indicating that Γ(x) = lf , while Γ⊕[x 7→ lf ] stands for a heap

Constructor Constructor expressions are built using a constructor name (an element from the set C) and its arguments (variables). All constructors are saturated, i.e., they must be given all their arguments. Since the STG language is not typed and there are no explicit datatype definitions, we may think of constructor names as being the lowest-level identifiers, e.g., positive integers, possibly shared between datatypes. (If the STG language is used as an

26

Γ : C pi ↓ Γ : C pi

C ON

Γ{p 7→ λN x1 . . . xm .e} : p p1 . . . pn ↓ Γ : p p1 . . . pn where n < m

A PP 1

Γ : e[x1 /p1 . . . xm /pm ] ↓ ∆ : w Γ{p 7→ λN x1 . . . xm .e} : p p1 . . . pm ↓ ∆ : w

A PP 2

Γ : e[x1 /p1 . . . xm /pm ] ↓ ∆ : q q1 . . . qk ∆ : q q1 . . . qk pm+1 . . . pn ↓ Θ : w Γ{p 7→ λN x1 . . . xm .e} : p p1 . . . pn ↓ Θ : w Γ : e ↓ ∆ : C qi Γ{p 7→ λU .e} : p ↓ ∆⊕[p 7→ λN .C qi ] : C qi Γ : e ↓ ∆{q 7→ λN x1 . . . xk xk+1 . . . xn .f } : q q1 . . . qk ∆⊕[p 7→ λN xk+1 . . . xn .f [x1 /q1 . . . xk /qk ]] : q q1 . . . qk p1 . . . pm ↓ Θ : w Γ{p 7→ λU .e} : p p1 . . . pm ↓ Θ : w Γ⊕[pi 7→ lf i [xi /pi ]] : e[xi /pi ] ↓ ∆ : w pi ∈ POINTERS \ Dom(Γ) Γ : letrec xi = lf i in e ↓ ∆ : w Γ : e ↓ ∆ : Ck pj ∆ : ek [xkj /pj ] ↓ Θ : w Γ : case e of Ci xij → ei ↓ Θ : w

m
A PP 3 A PP 4

A PP 5

L ETREC

C ASE

Figure 2. The natural semantics of the STG language Our solution with a bipartite set of variables solves the problem: generating fresh addresses in the L ETREC rule is local (we need freshness only with respect to the heap) and it fits the design pattern of nameless bound variables representation in Coq, where bound variables are represented as de Bruijn indices, and free variables as atoms. In order to formalize the above intuitions, we use the following definitions:

Γ extended or overwritten at x with lf . The operation e[xi /pi ] simultaneously substitutes each free occurence of xi in e with pi . Normal forms in this semantics are constructors and partial applications, as stated in the C ON and A PP 1 rules. (An application is partial only in the context of a heap, which encodes the whole graph of an expression.) There are two more rules for applications of variables representing non-updatable closures: A PP 2, when there are just enough arguments, and A PP 3, when there are too many arguments. Intuitively, we evaluate the body of the lambda-form, substituting actual arguments for formal arguments. If there are too many arguments, we use only the prefix of the argument list of the appropriate length. If the closure evaluates to a partial application, we append the remaining suffix of the argument list and continue with evaluation. If it evaluates to a constructor, the whole expression does not have a normal form, since it would be an application of the constructor to the suffix of the argument list, which is already saturated; such expression would be ill-typed in any strongly typed language. The rules for applications of variables representing updatable closures are similar. For a variable p representing an updatable (thus argument-free) closure, if the closure evaluates to a constructor (A PP 4), it is the value of the expression. But we also need to update the heap with the constructor, so that if any other expression in some lambda-form in the heap uses the pointer p, the closure will not have to be evaluated again. If the closure evaluates to a partial application q q1 . . . qk (A PP 5), we update the closure under the pointer p with the lambda-form under the pointer q, but with first k arguments already fed with q1 . . . qk . (In the A PP 5 rule n > k, since q q1 . . . qk is a partial application.)

D EFINITION 1. Let e be an expression or a lambda-form, and Γ be a heap. Then: 1. e is well-formed iff its bound variables are in BOUND and its free variables are in POINTERS . 2. e is closed by a heap Γ iff all its free variables are in Dom(Γ). 3. Γ is well-formed iff Dom(Γ) ⊆ POINTERS and each closure in Γ is well-formed and closed by Γ. 4. The configuration (Γ : e) is well-formed iff Γ and e are wellformed and e is closed by Γ. We do not mind that ill-formed programs and configurations may evaluate to nonsense values. For example the configuration (∅ : letrec x = λN .C in p) may evaluate to C if the lambda-form in the letrec expression is allocated under the address p. The following theorem ensures that if the root configuration is well-formed, configurations are well-formed throughout the derivation tree and no variables are captured: P ROPOSITION 2. For a well-formed configuration (Γ : e), if (Γ : e ↓ ∆ : w), then all configurations and normal forms in the derivation of (Γ : e ↓ ∆ : w) (including ∆ : w) are well-formed and all substitutions replace pointers for bound variables.

Variables, addresses, and fresh pointers A variable is fresh if and only if it does not interfere with any other variable in the derivation tree by an undesired variable capture. The freshness check (sometimes called a generation of a fresh variable) is local iff it can be done using only the context of a single rule, and does not refer to the whole derivation tree or any kind of external “fresh names generator.” Locality is a desirable property when one wants to reason in low-level details necessary for an implementation or formalization in proof systems like Coq.

Comparison with Sestoft’s semantics Our semantics is inspired by the semantics proposed by Sestoft [15] as a refinement for Launchbury’s semantics for a normalized λ-calculus [8]. The rules for constructors, letrec and case expressions are virtually the same. The difference is in lambda-forms, which in STG are tied to letrec definitions and bind multiple variables, while the Launchbury’s calculus contains the usual first-class λ-abstractions binding a single

27

e

→

n | abs e | e e where n ∈ Z and ∈ Σ n⇓n

e⇓n abs e ⇓ |n|

— empty

e1 ⇓ n 1 e2 ⇓ n 2 e1 e2 ⇓ n1 [ ]n2

Figure 3. Arithmetic expressions—the syntax and natural semantics continuation to a value. To compute the value of an expression, one supplies the evaluator with the identity continuation (kId ):

variable. The restricted shape of lambda-forms in STG makes “entering” a closure in the heap always identified with application (note that since we have multiple binders, an application to zero arguments is still an application), while they are different concepts in Sestoft’s semantics, represented by two different rules, VAR and A PP. In contrast to Launchbury’s calculus, the STG language is more complex in that it allows for multiple binders and update flags. On the other hand, our semantics does not cater for the concept of black holes, which, as advocated by Peyton-Jones [11], is superfluous as far as only sequential computation is concerned. It is fairly easy to embed Launchbury’s calculus into STG, and Sestoft’s natural semantics into ours in a provably correct way.

3.

evalCps :: Expr → (Integer → a) → a evalCps (Const n) k = k n evalCps (Abs e) k = evalCps e (λn → k (abs n)) evalCps (Op e1 op e2 ) k = evalCps e1 (λn1 → evalCps e2 (λn2 → k (interp op n1 n2 ))) kId :: Integer → Integer kId = id The next step is the defunctionalization of continuations. Each construction of a continuation (either by a named value, like kId , or anonymously, like λn → k (abs n)) is replaced by an explicit closure, which stores all the free variables of the continuation. Each application of a continuation is replaced by an application of the function apply which takes the closure as an argument and evaluates accordingly:

Functional correspondence

In this section we describe a method that facilitates a mechanical derivation of abstract machines from natural semantics. It is inspired by functional correspondence that consists in first, transforming an evaluator in direct style that implements a natural semantics into continuation-passing style (CPS) and second, defunctionalizing the continuations of the CPS evaluator, which leads to an evaluator implementing an abstract machine [1]. We illustrate the functional correspondence with the example of evaluating arithmetic expressions. Let Σ = {+, ∗, −, . . .} be a set of binary symbols and [·] : Σ → ZZ×Z be a natural interpretation of symbols in Σ. For any n ∈ Z let |n| denote its absolute value. The syntax and semantics of arithmetic expressions are shown in Figure 3. It is straightforward to implement an evaluator for this semantics in a functional meta-language, i.e., to write a function eval such that eval e = n iff the judgment (e ⇓ n) is provable. For each semantic rule, the function is recursively called and the final result is obtained by applying the corresponding operation to the results of the recursive calls. It could be encoded in Haskell as follows:

data Cont a = Id | K1 (Cont a) | K2 Expr String (Cont a) | K3 Integer String (Cont a) apply apply apply apply apply

:: Cont Integer → Integer → Integer Id n =n (K1 k ) n = apply k (abs n) (K2 e2 op k ) n1 = evalDcps e2 (K3 n1 op k ) (K3 n1 op k ) n2 = apply k (interp op n1 n2 )

evalDcps evalDcps evalDcps evalDcps

data Expr = Const Integer | Abs Expr | Op Expr String Expr

:: Expr → Cont (Const n) k (Abs e) k (Op e1 op e2 ) k

Integer → Integer = apply k n = evalDcps e (K1 k ) = evalDcps e1 (K2 e2 op k )

Note that the Cont datatype behaves like a stack, with Id corresponding to the empty stack, and K1 , K2 and K3 to three kinds of its elements. The mutually recursive functions evalDcps and apply may be thought of as evaluators of two semantics defined in terms of each other: E, proving judgments of the form Ehe, Ki i - n, and A, proving judgments of the form Ahm, Ki i - n, where Ki is a continuation stack. We call it the Defunctionalized CPS (D-CPS) semantics (Figure 4). The equivalence of the two semantics may be defined as follows: (e ⇓ n) iff Ehe, εi - n, and is easy to show by simple induction. We call the transformation from the natural semantics to the D-CPS semantics the D-CPS transformation. Note that the D-CPS semantics has a particular form: each rule has at most one premise, and the right-hand sides of the - symbol are identical for the premise and the conclusion. Thus, it is easy to transform the semantics into an abstract machine (Figure 5), where the states are left-hand sides of the - symbol, each semantic rule with a premise is transformed into a transition rule for the machine (from the left-hand side of the conclusion to the left-hand side of the premise) and the rule with no premises becomes a halting state:

interp :: String → Integer → Integer → Integer interp "+" = (+) interp "*" = (∗) interp "-" = (−) interp "mod" = mod eval :: Expr → Integer eval (Const n) =n eval (Abs e) = abs n where n = eval e eval (Op e1 op e2 ) = interp op n1 n2 where n1 = eval e1 n2 = eval e2 In the next phase we transform the evaluator into CPS. Now, the evaluator has one more argument—a continuation. The evaluator no longer returns a value, instead it tail-calls itself or applies the

28

Ahn, Ki i - m Ehn, Ki i - m

Ehe, K1 : Ki i - m Ehabs e, Ki i - m

Ah|n|, Ki i - m Ahn, K1 : Ki i - m

Ehe1 , K2(e2 , ) : Ki i - m Ehe1 e2 , Ki i - m

Ehe2 , K3(n1 , ) : Ki i - m Ahn1 , K2(e2 , ) : Ki i - m

Ahm, εi - m

Ahn1 [ ] n2 , Ki i - m Ahn2 , K3(n1 , ) : Ki i - m

Figure 4. A D-CPS semantics for arithmetic expressions Ehn, Ki i Ehabs e, Ki i Ehe1 e2 , Ki i Ahm, εi Ahn, K1 : Ki i Ahn1 , K2(e2 , ) : Ki i Ahn2 , K3(n1 , ) : Ki i

⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒

Ahn, Ki i Ehe, K1 : Ki i Ehe1 , K2(e2 , ) : Ki i m Ah|n|, Ki i Ehe2 , K3(n1 , ) : Ki i Ahn1 [ ] n2 , Ki i

Figure 5. An abstract machine for arithmetic expressions the equivalence of the D-CPS semantics and the abstract machine can be formulated as follows: Ehe, εi - n iff Ehe, εi ⇒∗ n, where ⇒∗ is the reflexive and transitive closure of the relation ⇒. Though the transformation from the D-CPS semantics into the abstract machine-based semantics is trivial, the change is conceptually significant. The former is a big-step semantics, i.e., one that proves judgments on a relation between expressions, stacks and the final value. The latter is a small-step semantics, which describes separate steps of computation. Our next objective will be to enhance the STG semantics to strengthen its computational properties and then to transform it into an abstract machine using the presented method.

4.

the accumulators—stacks containing variables (intuitively, pointers). The intuition is that whenever we see an application, we put the arguments in the accumulator, and take them out when they are needed for entering a closure. The argument-accumulating semantics is given in Figure 6. The A-ACCUM rule is introduced. It deals with applications by putting arguments in the accumulator. All the other rules dealing with application are limited to applications to the empty tuple of arguments, while the “real” arguments are now stored in the accumulator. The application rules A PP 2 and A PP 3 from the previous semantics are now merged to form the A-A PP 2.5 rule. It is possible because the spare arguments do not need to be held back in the second premise of A PP 3, but they travel up the derivation tree in the accumulator and can be accessed when needed. Note that only constructors and applications to empty tuple of arguments are now normal forms. The argument-accumulating semantics is sound and complete with respect to the STG language natural semantics.

From the natural semantics to the STG machine

In this section we present two additional semantics for the STG language, the first introducing the stack of arguments for applications, then refined by the introduction of environments instead of substitutions. Then we use the method described in the previous section to derive an abstract machine, which needs only a little make-up to become the Spineless Tagless G-machine. 4.1

P ROPOSITION 3 (soundness and completeness). If e is a closed expression, then: 1. (∅ : e ↓ ∆ : C pi ) iff h∅, e, εi h∆, C pi , εi, 2. (∅ : e ↓ ∆ : p pi ) iff h∅, e, εi h∆, p, pi i.

Argument stack introduction

An essential flaw of the STG language natural semantics is its treatment of applications with too many arguments. Whenever an application lacks some arguments (the A PP 1 rule is used) somewhere in the derivation of the first premise of A PP 3, there may be more arguments “waiting” in the second premise. Consider the following program (for arbitrary e and p): letrec in

4.2

Replacing substitution with environment

The next step toward an abstract machine is introduction of environments. This step is made simpler by the fact that in the argument-accumulating semantics for well-formed configurations we substitute only pointers for bound variables. Thus for each expression there will be an associated environment, which will bind addresses in the heap (pointers) with the free variables of the expression. The explicit environment semantics is shown in Figure 7. It proves judgments of the form hΓ, e, σ, qi i h∆, w, τ, ri i, where σ and τ are environments, i.e., partial functions from variables to variables. We will denote the set of all environments by ENV . Γ, ∆ and Θ are heaps of a new kind: their domain are variables from the set X and their codomain are closures, i.e., elements of lf ×ENV . FV (l ) stands for the set of all free variables of the lambda form l. The environment σX is a subset of an environment σ with its domain trimmed to the set of variables X, σ[xi /pi ] is an extension of σ by

f = λN x.e g = λN .f gp

First, in A PP 3 g is evaluated in the first premise. g does not take any arguments, and then f is evaluated to itself by A PP 1, because there are not enough arguments to proceed (the argument p is temporarily “forgotten” during the computation of the “argument bottleneck” g). Only then, in the second premise of the A PP 3 rule, the expression f p is created and evaluated. To solve this problem, we introduce another entity to our judgments, which we call the argument accumulator. The judgments now take the form hΓ, e, pi i h∆, w, qi i, where pi and qi are

29

hΓ, C pi , εi hΓ, C pi , εi

A-C ON

hΓ, p, (p1 . . . pm q1 . . . qn )i h∆, w, ri i m>0 hΓ, (p p1 . . . pm ), q1 . . . qn i h∆, w, ri i

A-ACCUM

hΓ{p 7→ λN x1 . . . xm .e}, p, (p1 . . . pn )i hΓ, p, (p1 . . . pn )i where n < m hΓ, e[x1 /p1 . . . xm /pm ], pm+1 . . . pn i h∆, w, ri i m6n hΓ{p 7→ λN x1 . . . xm .e}, p, (p1 . . . pn )i h∆, w, ri i

A-A PP 1 A-A PP 2.5

hΓ, e, εi h∆, C qi , εi hΓ{p 7→ λU .e}, p, εi h∆⊕[p 7→ λN .C qi ], C qi , εi

A-A PP 4

hΓ, e, εi h∆{q 7→ λN x1 . . . xk xk+1 . . . xn .f }, q, (q1 . . . qk )i h∆⊕[p 7→ λN xk+1 . . . xn .f [x1 /q1 . . . xk /qk ]], q, (q1 . . . qk p1 . . . pm )i hΘ, w, ri i hΓ{p 7→ λU .e}, p, (p1 . . . pm )i hΘ, w, ri i hΓ⊕[pi 7→ lf i [xi /pi ]], e[xi /pi ], ri i h∆, w, si i pi ∈ POINTERS \ Dom(Γ) hΓ, letrec xi = lf i in e, ri i h∆, w, si i hΓ, e, εi h∆, Ck pj , εi h∆, ek [xkj /pj ], qi i hΘ, w, ri i hΓ, case e of Ci xij → ei , qi i hΘ, w, ri i

A-A PP 5

A-L ETREC

A-C ASE

Figure 6. The argument-accumulating semantics 2. If h∅, e, ∅, εi h∆• , w, τ, qi i then there exists ∆ s.t. h∅, e, εi h∆, w[τ ], qi i.

[xi /pi ]. We also write e[σ] when we use the environment σ as a substitution. Intuitively, the argument accumulator stores pointers. The trimming of environments is not essential for the soundness and completeness of the explicit environment semantics. We decided to leave the trimming in the E-L ETREC rule and in the rules performing updates to indicate that closures in the heap are an abstraction of real-life closures in a real-life heap (where the closures contain only values for variables that are actually free in the function). To avoid confusion, we will now denote heaps used in the argument-accumulating semantics by A-heap and heaps used in the explicit environment semantics by E-heap.

4.3

Transformation to Defunctionalized CPS

We are now ready to perform the D-CPS transformation. It may be done in exactly the same manner as described in Section 3, and its result is shown in Figure 8. We call the resulting semantics the D-CPS semantics. The rules E-A PP 4 and E-A PP 5 give rise to two continuations, but—since the rules for them are the same—we may merge them into a single continuation UPD (for “update”). The continuation for the E-Case is named ALT (for “alternatives”).

D EFINITION 4. 1. An expression is env-well-formed iff the set of all its variables (both bound and free) is a subset of BOUND. 2. An environment σ is env-well-formed iff it is a function from BOUND to POINTERS . 3. An expression e is closed by an environment σ iff FV (e) ⊆ Dom(σ). 4. A E-heap Γ with Dom(Γ) ⊆ POINTERS is env-well-formed iff for each closure (e, σ) in Γ both e and σ are env-well-formed and e is closed by σ.

P ROPOSITION 7 (soundness and completeness). For any Γ, e, σ and pi , the following holds: hΓ, e, σ, pi i h∆, f, γ, qi i iff EhΓ, e, σ, pi , εi - h∆, f, γ, qi i. 4.4

From the D-CPS semantics to the abstract machine

The extraction of an abstract machine from the D-CPS semantics may be done exactly as described in Section 3. The resulting D-CPS machine is shown in Figure 9. The soundness and completeness is trivial and may be formulated as follows:

The correspondence between an A-heap and a E-heap is defined as follows: D EFINITION 5. An A-heap Γ and a E-heap Γ• are similar iff:

P ROPOSITION 8 (soundness and completeness). e, σ and pi , EhΓ, e, σ, pi , εi - h∆, f, γ, qi i iff

1. Dom(Γ) = Dom(Γ• ), 2. Γ• is env-well-formed, e) 3. for any p ∈ POINTERS , if Γ(p) = (λν yi .e and Γ• (p) = (λµ xi .e, τ ) then yi = xi , ee = e[τ ] and ν = µ.

EhΓ, e, σ, pi , εi =⇒

dcps ∗

4.5

For any Γ,

h∆, f, γ, qi i.

The STG machine

In this section we show that the D-CPS machine is in fact the Spineless Tagless G-machine in disguise and we compare the resulting machine to the original formulation by Peyton Jones and Salkild.

By Γ• we will denote a E-heap that is similar to an A-heap Γ. P ROPOSITION 6 (soundness and completeness). If e is a closed expression, then:

4.5.1

Merging and splitting of rules

First, we notice that QA-A PP 4 is of the form

1. If h∅, e, εi h∆, w, e qi i then there exist ∆• , w, τ s.t. h∅, e, ∅, εi h∆• , w, τ, qi i and w e = w[τ ].

dcps

. . . =⇒ Eh. . . C xi . . .i

30

hΓ, C xi , σ, εi hΓ, C xi , σ, εi

E-C ON

hΓ, x, σ, (σx1 . . . σxm q1 . . . qn )i h∆, w, γ, ri i m>0 hΓ, (x x1 . . . xm ), σ, q1 . . . qn i h∆, w, γ, ri i

E-ACCUM

hΓ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn i hΓ, x, σ, p1 . . . pn i where n < m hΓ, e, τ [x1 /p1 . . . xm /pm ], pm+1 . . . pn i h∆, w, γ, ri i m6n hΓ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn i h∆, w, γ, ri i

E-A PP 1 E-A PP 2.5

hΓ, e, τ, εi h∆, C xi , γ, εi hΓ{σx 7→ (λU .e, τ )}, x, σ, εi h∆⊕[σx 7→ (λN .C xi , γxi )], C xi , γ, εi hΓ, e, τ, εi h∆{γy 7→ (λN x1 . . . xk xk+1 . . . xn .f, µ)}, y, γ, q1 . . . qk i h∆⊕[σx 7→ (λN xk+1 . . . xn .f, µ[x1 /q1 . . . xk /qk ])], y, γ, q1 . . . qk p1 . . . pm i hΘ, w, ξ, ri i hΓ{σx 7→ (λU .e, τ )}, x, σ, p1 . . . pm i hΘ, w, ξ, ri i hΓ⊕[pi 7→ (lf i , τi [xi /pi ]FV(lf i ))], e, σ[xi /pi ], ri i h∆, w, γ, si i pi ∈ POINTERS \ Dom(Γ) hΓ, letrec xi = lf i in e, σ, ri i h∆, w, γ, si i hΓ, e, σ, εi h∆, Ck yj , γ, εi h∆, ek , σ[xkj /γyj ], qi i hΘ, w, ξ, ri i hΓ, case e of Ci xij → ei , σ, qi i hΘ, w, ξ, ri i

E-A PP 4

E-A PP 5

E-L ETREC

E-C ASE

Figure 7. The explicit environment semantics After the merging of rules, we notice that the A instruction applies now only to the rules for constructors. We will dub such rules return. We also split the E instruction into two: one for application to zero arguments (we will dub such rules enter), and for any other kind of expression (dubbed eval). We also notice that now there is no rule for configurations of the form heval, Γ, x, . . .i, where x is a single variable, therefore we abandon the side condition m > 0 in the Q-ACCUM rule, so that evaluating an application to zero arguments means entering the closure it represents. The changes in the machine are summarized in Figure 10. We claim that this machine is the STG machine up to some minor details described in the following subsection. As evidence, in Figure 10 we put numbers next to names of the rules; these numbers are the numbers of the transition rules in Peyton Jones and Salkild’s original STG machine [11] (not all numbers are present since our machine lacks primitive arithmetics and default alternatives in case expressions, and the H ALT rules are not featured in the original STG machine).

and Q-C ON is the only rule of the form dcps

Eh. . . C xi . . .i =⇒ . . . Therefore we can replace QA-A PP 4 with the following: Ah∆, C xi , γ, ε, UPD(p, ε) : Si i dcps

=⇒ Ah∆⊕[p 7→ λN .C xi (γxi )], C xi , γ, ε, Si i We can also split the H ALT rule into two rules, one for each kind of normal forms: dcps

AhΓ, C xi , σ, ε, εi =⇒ hΓ, C xi , σ, εi dcps

AhΓ, x, σ, pi , εi =⇒ hΓ, x, σ, pi i

Q-H ALT-C ON Q-H ALT-A PP

The expression on the left-hand side of the rule H ALT-A PP is an application with zero arguments (x), since the only rule of the form dcps . . . =⇒ Ah. . . w . . .i where w is an application is Q-A PP 1, in which w has no arguments. We will now merge the Q-A PP 1 rule with Q-H ALT-A PP and, separately, with Q-A PP 5. We replace these three rules with the following two:

4.5.3

We can combine all the local soundness and completeness theorems to formulate our main proposition. Recall that by ∆ and ∆• we denote a pair of similar heaps (Definition 5).

EhΓ{σx 7→ λN x1 . . . xm .e τ }, x, σ, p1 . . . pn , εi dcps

=⇒ hΓ, x, σ, p1 . . . pn i where n < m,

P ROPOSITION 9 (completeness). For a closed expression e, the following hold:

Eh∆{γy → 7 λN x1 . . . xk xk+1 . . . xn .f µ}, y, γ, q1 . . . qk UPD(p, p1 . . . pn ) : Si i where k < n

1. If (∅ : e ↓ ∆ : p pi ), there exist ∆• , x and σ such that stg ∗ heval, ∅, e, ε, ε, εi =⇒ h∆• , x, σ, pi i and σx = p. 2. If (∅ : e ↓ ∆ : C pi ), there exist ∆• , xi and σ such that: stg ∗ heval, ∅, e, ε, ε, εi =⇒ h∆• , C xi , σ, εi and σxi = pi .

dcps

=⇒ Eh∆⊕[p 7→ λN xk+1 . . . xn .f µ[x1 /q1 . . . xk /qk ]], y, γ, q1 . . . qk p1 . . . pn , Si i. 4.5.2

Soundness and completeness

Introduction of the STG instructions

So far we have used two kinds of “instructions:” eval (E) and apply (A), where E intuitively means that we are currently evaluating an expression, and A means that we have just finished evaluating an expression and we need an element from the stack of continuations to go on.

P ROPOSITION 10 (soundness). For a closed expression e, the following hold: stg ∗

1. If heval, ∅, e, ε, ε, εi =⇒ h∆• , x, σ, pi i then there exists ∆ such that (∅ : e ↓ ∆ : (σx) pi ).

31

AhΓ, w, σ, pi , εi - hΓ, w, σ, pi i

D-H ALT

AhΓ, C xi , σ, ε, Si i - h∆, w, γ, ri i EhΓ, C xi , σ, ε, Si i - h∆, w, γ, ri i

D-C ON

EhΓ, x, σ, (σx1 . . . σxm q1 . . . qn ), Si i - h∆, w, γ, ri i m>0 EhΓ, (x x1 . . . xm ), σ, q1 . . . qn , Si i - h∆, w, γ, ri i

D-ACCUM

AhΓ, x, σ, p1 . . . pn , Si i - h∆, w, γ, ri i n<m EhΓ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn , Si i - h∆, w, γ, ri i

D-A PP 1

EhΓ, e, τ [x1 /p1 . . . xm /pm ], pm+1 . . . pn , Si i - h∆, w, γ, ri i m6n EhΓ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn , Si i - h∆, w, γ, ri i

D-A PP 2.5

Eh∆⊕[p 7→ (λN .C xi , γxi )], C xi , γ, ε, Si i - hΘ, w, ξ, si i Ah∆, C xi , γ, ε, UPD(p, ε) : Si i - hΘ, w, ξ, si i

DA-A PP 4

EhΓ, e, τ, ε, UPD(σx, ri ) : Si i - hΘ, w, ξ, qi i EhΓ{σx 7→ (λU .e, τ )}, x, σ, ri , Si i - hΘ, w, ξ, qi i

DE -A PP 4.5

Eh∆⊕[p 7→ (λN xk+1 . . . xn .f, µ[x1 /q1 . . . xk /qk ])], y, γ, q1 . . . qk p1 . . . pm , Si i - hΘ, w, ξ, ri i Ah∆{γy 7→ (λN x1 . . . xk xk+1 . . . xn .f, µ)}, y, γ, q1 . . . qk , UPD(p, p1 . . . pm ) : Si i - hΘ, w, ξ, ri i

DA-A PP 5

EhΓ⊕[pi 7→ (lf i , τi [xi /pi ]FV(lf i ))], e, σ[xi /pi ], ri Si i - h∆, w, γ, si i pi ∈ POINTERS \ Dom(Γ) EhΓ, letrec xi = lf i in e, σ, ri , Si i - h∆, w, γ, si i

D-L ETREC

EhΓ, e, σ, ε, ALT(Ci xij → ei , σ, qi ) : Si i - hΘ, w, ξ, ri i EhΓ, case e of Ci xij → ei , σ, qi , Si i - hΘ, w, ξ, ri i

DE -C ASe

Eh∆, ek , σ[xkj /γyj ], qi , Si i - hΘ, w, ξ, ri i Ah∆, Ck yj γ, ε, ALT(Ci xij → ei , σ, qi ) : Si i - hΘ, w, ξ, ri i

DA-C ASE

Figure 8. The D-CPS semantics stg ∗

able x and an environment σ, and then enters the closure under the address σx, while the original enter rule takes the address. Similarly, return expects a constructor expression (a constructor name and a tuple of variables) and an environment, while in the original formulation it needs a constructor name and a tuple of addresses. The equivalence of both approaches is almost trivial.

2. If heval, ∅, e, ε, ε, εi =⇒ h∆• , C pi , σ, εi then there exists ∆ such that (∅ : e ↓ ∆ : C pi ). 4.5.4

Design differences

The original STG machine was designed, while ours was derived. Still, the design choices we have made when introducing successive semantics have a great impact on the final machine. In this section we compare the STG machine from Figure 10 with the machine described by Peyton Jones.

Problems with ill-typed expressions As pointed out by Encina and Pe˜na [4], the original STG machine might not behave as expected for some ill-formed programs. For example, consider the following program (which would be ill-typed in any reasonable strongly-typed language):

Stacks Our formulation of the STG machine has two stacks (argument accumulator and continuation stack), while the original STG machine has three stacks (argument stack, return stack, and update stack). The argument accumulator works exactly like the argument stack of the original STG machine, while the continuation stack covers the return stack and the update stack. Our two-stack machine can be simulated by a single-stack machine in exactly the same way as the original STG machine [11], since the harmonics of the stack operations in both machines are identical. However, we find the formulation with two stacks particularly clean (as opposed to single stack), since the return-update stack may be seen as a continuation (in particular, when we finish evaluation, we continue with an update), while the argument stack may not (we do not use arguments with computed normal forms, instead we constantly shuffle the argument stack during evaluation).

letrec in

id = λN x.x c = λU .C f = λU .case id of C → D fc

The original machine allocates the declarations, pushes the pointer to c on the argument stack and continues with evaluation of f . The case expression in f first computes id , which evaluates to C (we have an argument for it on the argument stack!). The whole expression finally evaluates to D. Even though the STG language is not typed, the intuition is that the evaluation should be broken in the case expression, since id should not get its argument. Indeed, it is the case when the three stacks of the original STG machine are simulated by a single stack, where the argument is “guarded” by an element containing case alternatives. This is a minor flaw, since for well-typed programs

Instructions and environments The enter and return instructions are formulated slightly differently: here, enter takes a vari-

32

dcps

AhΓ, w, σ, pi , εi =⇒ hΓ, w, σ, pi i

Q-H ALT

dcps

EhΓ, C xi , σ, ε, Si i =⇒ AhΓ, C xi , σ, ε, Si i

Q-C ON

dcps

EhΓ, (x x1 . . . xm ), σ, q1 . . . qn , Si i =⇒ EhΓ, x, σ, (σx1 . . . σxm q1 . . . qn ), Si i EhΓ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn , Si i dcps

=⇒ AhΓ, x, σ, p1 . . . pn , Si i EhΓ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn , Si i dcps

=⇒ EhΓ, e, τ [x1 /p1 . . . xm /pm ], pm+1 . . . pn , Si i

where m > 0

n<m

Q-A PP 1

m6n

Q-A PP 2.5

dcps

Ah∆, C xi , γ, ε, UPD(p, ε) : Si i =⇒ Eh∆⊕[p 7→ (λN .C xi , γxi )], C xi , γ, ε, Si i dcps

EhΓ{σx 7→ (λU .e, τ )}, x, σ, ri , Si i =⇒ EhΓ, e, τ, ε, UPD(σx, ri ) : Si i Ah∆{γy 7→ (λN x1 . . . xk xk+1 . . . xn .f, µ)}, y, γ, q1 . . . qk , UPD(p, p1 . . . pm ) : Si i dcps

=⇒ Eh∆⊕[p 7→ (λN xk+1 . . . xn .f, µ[x1 /q1 . . . xk /qk ])], y, γ, (q1 . . . qk p1 . . . pm ), Si i EhΓ, letrec xi = lf i in e, σ, ri , Si i dcps

=⇒ EhΓ⊕[pi 7→ (lf i , τi [xi /pi ]FV(lf i ))], e, σ[xi /pi ], ri , Si i

Q-ACCUM

pi ∈ POINTERS \ Dom(Γ)

dcps

EhΓ, case e of Ci xij → ei , σ, qi , Si i =⇒ EhΓ, e, σ, ε, ALT(Ci xij → ei , σ, qi ) : Si i dcps

Ah∆, Ck yj , γ, ε, ALT(Ci xij → ei , σ, qi ) : Si i =⇒ Eh∆, ek , σ[xkj /γyj ], qi , Si i

QA-A PP 4 QE -A PP 4.5 QA-A PP 5

Q-L ETREC QE -C ASE QA-C ASE

Figure 9. The D-CPS abstract machine 4.6

the machine with one stack behaves exactly like the machine with three stacks. Our formulation avoids this problem by storing the argument stack in the continuation when evaluating the inner expression in the DE -C ASE rule. The original STG machine behavior would be obtained if we leave the argument stack for the first premise in the argument-accumulating semantics in the A-Case rule:

Extensions

An advantage of using a constructive derivation method for obtaining an abstract machine from the underlying natural semantics is its scalability. Any changes in the latter smoothly ensue in the machine. A useful example of such an operation is adding primitive (unboxed) arithmetics. We need to include numerical literals and operators as base constructions in the language, and to extend the STG natural semantics with a few intuitive rules, with no need to alter any of the original rules. If careful when introducing environments, we can augment the machine with a primitive arithmetics similar to the one presented in the original papers [11, 12]. In the same way we can add other concepts that are easy to express in the natural semantics, but may not be that trivial in the machine, like (monadic) input/output, or the Haskell seq operator.

hΓ, e, qi i h∆, Ck pj , ri i h∆, ek [xkj /pj ], ri i hΘ, w, si i hΓ, case e of Ci xij → ei , qi i hΘ, w, si i An alternative update rule The A PP 5 rule from our natural semantics (and its successors: A-A PP 5, E-A PP 5 and S-A PP 1A PP 5) are problematic to implement in a compiler since we cannot create new expressions on the fly. Moreover, updating the heap as follows: ∆⊕[p 7→ (λN xk+1 . . . xn .f, µ[x1 /q1 . . . xk /qk ])] would in practice mean modifying the expressions because environments are precomputed, i.e., each variable in an expression is statically bound to the n-th element on the stack, the n-th slot of the current closure, or a register. One solution is to update the heap not with the partially applied lambda-form, but with the computed normal form: Γ : e ↓ ∆{q 7→ λN x1 . . . xk xk+1 . . . xn .f } : q q1 . . . qk ∆⊕[p 7→ λN . q q1 . . . qk ] : q q1 . . . qk p1 . . . pn ↓ Θ : w Γ{p 7→ λU .e} : p p1 . . . pn ↓ Θ : w

5.

Formalization in Coq

One of our main contributions is a formalization of the derivation of the STG machine in the Coq proof assistant. The complete source code with documentation can be found at: http: //www.ii.uni.wroc.pl/~dabi/Publications/Haskell10/ stg-in-coq/. The development is about 7500 lines of code long (ca. 100 definitions and 230 theorems). The STG language is formalized as a Coq datatype. Variables are represented either by an abstract type atom, which models the free variables of an expression, or as de Bruijn indices for bound variables (they correspond to the POINTERS and BOUND sets, respectively). We use de Bruijn notation to handle all three kinds of binders: arguments in lambda-forms, parameters of constructors in alternatives, and names of lambda-forms in letrec definitions. The Coq definition reads as follows:

Now it is sufficient to create code for partial applications for all possible number of arguments. This approach is used in the original formulation of the STG machine as an alternative update rule. 1 1 The original STG rule to which we would come via all our transformations

is called 17a.

33

stg

hreturn, Γ, C xi , σ, pi , εi =⇒ hΓ, C xi , σ, pi i

S-H ALT-C ON

stg

henter, Γ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn , εi =⇒ hΓ, x, σ, p1 . . . pn i

n<m

S-A PP 1H ALT

stg

heval, Γ, C xi , σ, ε, Si i =⇒ hreturn, Γ, C xi , σ, ε, Si i

S-C ON (5)

stg

S-ACCUM (1)

heval, Γ, (x x1 . . . xm ), σ, q1 . . . qn , Si i =⇒ henter, Γ, x, σ, (σx1 . . . σxm q1 . . . qn ), Si i henter, ∆{γy 7→ (λN x1 . . . xk xk+1 . . . xn .f, µ)}, y, γ, q1 . . . qk , UPD(p, p1 . . . pm ) : Si i stg =⇒ henter, ∆⊕[p 7→ (λN xk+1 . . . xn .f, µ[x1 /q1 . . . xk /qk ])], y, γ, (q1 . . . qk p1 . . . pm ), Si i henter, Γ{σx 7→ (λN x1 . . . xm .e, τ )}, x, σ, p1 . . . pn , Si i stg =⇒ heval, Γ, e, τ [x1 /p1 . . . xm /pm ], pm+1 . . . pn , Si i

k
m6n

stg

hreturn, ∆, C xi , γ, ε, UPD(p, ε) : Si i =⇒ hreturn, ∆⊕[p 7→ (λN .C xi , γxi )], C xi , γ, ε, Si i stg

henter, Γ{σx 7→ (λU .e, τ )}, x, σ, ri , Si i =⇒ heval, Γ, e, τ, ε, UPD(σx, ri ) : Si i heval, Γ, letrec xi = lf i in e, σ, ri , Si i stg

=⇒ heval, Γ⊕[pi 7→ (lf i , τi [xi /pi ]FV(lf i ))], e, σ[xi /pi ], ri , Si i

pi ∈ POINTERS \ Dom(Γ)

stg

heval, Γ, case e of Ci xij → ei , σ, qi , Si i =⇒ heval, Γ, e, σ, ε, ALT(Ci xij → ei , σ, qi ) : Si i stg

hreturn, ∆, Ck yj , γ, ε, ALT(Ci xij → ei , σ, qi ) : Si i =⇒ heval, ∆, ek , σ[xkj /γyj ], qi , Si i

S-A PP 1A PP 5 (17)

S-A PP 2.5 (2) SA-A PP 4C ON (16) SE -A PP 4.5 (15) S-L ETREC (3) SE -C ASE (4) SA-C ASE (6)

Figure 10. The STG machine the semantics with environments we allow a variable to be an exceeding index if it is in the domain of the associated environment (as in Definition 4). Hence, our approach is not “locally nameless”, as we work on free de Bruijn indices. The natural semantics and abstract machines are defined as inductive predicates. Heaps are represented by partial functions, for example the type of A-heaps is:

Parameter atom : Set. Inductive var : Set := | Index : nat -> var | Atom : atom -> var. Inductive expr : Set := | App : var -> list var -> expr | Constr : constructor -> list var -> expr | Letrec : list lambda_form -> expr -> expr | Case : expr -> list alt -> expr with lambda_form : Set := | Lf : upd_flag -> nat -> expr -> lambda_form with alt : Set := | Alt : constructor -> nat -> expr -> alt.

var -> option lambda form. The type of the predicate representing the natural semantics for the STG language is: heapA -> expr -> heapA -> expr -> Prop, while the type of transitions of the STG machine is:

The nat arguments of constructors Lf and Alt determine how many arguments a lambda-form or alternative binds. The definitions in Letrec are enumerated top-down. For example, consider the following expression: letrec in

configuration -> configuration -> Prop, where configuration is equal to: instruction * heapB * expr * env * vars * stack.

f = λN x y. f g x y g = λN x. x fg

The soundness and completeness theorems for each semantics are generalized to obtain stronger induction hypothesis, and proven by the standard Coq structural induction, or—if necessary—by a well-founded induction on the height of derivations of judgments.

The corresponding Coq term is: Letrec [Lf Dont_update 2 (App (Index 2) [Index 3, Index 1, Index, 0]), Lf Dont_update 1 (App (Index 0) nil)] (App (Index 0) [Index 1])

6.

Related work

The idea of deriving lazy machines from natural semantics was first proposed by Sestoft [15]. He used an informal method to change rules for constructing derivations in natural semantics into rules for constructing a sequence of machine states. Then Mountjoy suggested that the same method for an extended semantics may lead to a machine that is closer to STG, and gave a proof of equivalence of some more elaborate abstract machines (but still far from the

well-formed terms we consider terms that are locally closed, i.e., in which none of the variables is an index exceeding the number of surrounding binders (which corresponds to Definition 1). In

34

7.

STG machine) [10]. The work of Mountjoy was continued by Encina and Pe˜na [4–6]. They used similar methods to invent STGlike machines and gave detailed proofs of their equivalence with an initial natural semantics.2 Though our approach may at first seem similar to the Encina and Pe˜na’s, it is based on different principles. To underscore the differences, we will examine the four main concepts: languages, semantics, derivations and abstract machines.

Conclusion and future work

We have presented the natural semantics underlying the Spineless Tagless G-machine as evidenced by Danvy et al.’s functional correspondence between evaluators and abstract machines. Thus, we have shown that the functional correspondence, when lifted to the level of operational semantics is still effective and furnish provably correct transformations of non-trivial natural semantics into nontrivial abstract machines, without the need to pull the latter out of thin air. In particular, we have shown that the STG machine, though originally obtained by refining simpler machines (the G-machine and the Spineless G-machine) is just another incarnation of the natural semantics we have introduced. Our main result, i.e., the equivalence between the natural semantics and the STG machine, has two facets. First, it provides a proof of correctness of the STG machine with respect to the natural semantics that, in fact, is a generalization of the commonly accepted and well-understood semantics by Sestoft. From the compiler’s perspective, this result can be seen as a formal justification of the compilation process of the language Haskell: an abstract functional language is given provably correct low-level semantics that facilitates imperative code generation [11]. Symmetrically, we provide a proof of correctness of our natural semantics with respect to the well-known operational semantics of Haskell given by the STG machine, which ensures that one can safely reason about the operational aspects of Haskell code using the natural semantics instead of the abstract machine. Our ultimate goal is a Coq-certified compiler for a subset of Haskell and the present article is a first step towards it. Having an abstract machine, STG expressions can be easily compiled into a set of imperative instructions, which change the global state to mimic the execution of the STG machine. Formalized and verified in Coq, this process can be automatically transformed into a working compiler by the Coq program extraction mechanism (in fact, it has been done as the first author’s MSc thesis). Combined with the result from this article, it yields a compiler to a virtual machine with respect to the natural semantics.

Languages In their articles, Encina and Pe˜na introduce two new languages, both bearing the same name Fun. While neither of them is very different from STG, they were designed to fit the sole purpose of proving equivalence of a semantics and a machine. Our approach, in turn, is to take the well-known STG language exactly as introduced by Peyton Jones, and give it a natural semantics, which is an interesting challenge even outside the context of deriving abstract machines. Nevertheless, starting with the natural semantics for the STG language was the key to obtaining the STG machine. Semantics In our work, semantics for letrec and case expressions are similar to Encina and Pe˜na’s. They follow the approach of Launchbury and Sestoft. The key difference is in the treatment of multiple λ-binders and partial applications. The two semantics for both Fun languages consequently evaluate partial applications by allocation in the heap. They allocate either a primitive heap element pap, or the lambda-form with the actual arguments substituted by the corresponding prefix of formal arguments. Though in the machine this allocation may be fused with an update, we do not find such solution elegant when concerning natural semantics. Encina and Pe˜na admit that their semantics, just as their languages, are tailored for the transformation into a particular machine. Our ambition, on the other hand, is to propose a more general and intuitive natural semantics, ready for any other formal reasoning, like preservation of semantics by program transformations in optimizing compilers. We are also the first to address update flags in the semantics, which, if assigned correctly by a static program analysis, lead to a boost of performance.

Acknowledgments We would like to thank Małgorzata Biernacka and Filip Sieczkowski for numerous discussions and useful comments on this work as well as Jeremy Gibbons and the anonymous reviewers for helping us improve the presentation.

Derivations Encina and Pe˜na present their machines, but they do not explain how they invented them. They only refer to Sestoft’s approach, who used his intuition of flattening derivation trees into sequences of machine transitions. This is hardly a derivation understood as a transformation from one entity (in this case a semantics) into another (an abstract machine) using a well-defined method. Moreover, their machines do not implement exactly the same evaluation model as their semantics (for example, S3 from [6] allocates more closures then the corresponding machine). Our machine is a result of a method strongly inspired by a wellknown transformation of programs, which preserves most important properties, including the evaluation model.

References [1] Mads Sig Ager, Dariusz Biernacki, Olivier Danvy, and Jan Midtgaard. A functional correspondence between evaluators and abstract machines. In Miller [9], pages 8–19. [2] Mads Sig Ager, Olivier Danvy, and Jan Midtgaard. A functional correspondence between call-by-need evaluators and lazy abstract machines. Information Processing Letters, 90(5):223–232, 2004. [3] Olivier Danvy and Lasse R. Nielsen. Defunctionalization at work. In Harald Søndergaard, editor, Proceedings of the Third International ACM SIGPLAN Conference on Principles and Practice of Declarative Programming (PPDP’01), pages 162–174, Firenze, Italy, September 2001. ACM Press.

Abstract machines Both Fun languages are different than STG, thus their STG-like machines differ from the original STG machine. The most striking difference is the lack of enter, eval and return instructions which are STG-tuned incarnations of the eval (E) and apply (A) instructions arising naturally from the D-CPS transformation.

[4] Alberto de la Encina and Ricardo Pe˜na. Formally deriving an STG machine. In Miller [9], pages 102–112. [5] Alberto de la Encina and Ricardo Pe˜na. Proving the correctness of the STG machine. In Ricardo Pena and Thomas Arts, editors, IFL, volume 2670 of Lecture Notes in Computer Science, pages 88–104. Springer, 2003.

2 In

[6] Alberto de la Encina and Ricardo Pe˜na. From natural semantics to C: A formal derivation of two STG machines. Journal of Functional Programming, 19(1):47–94, 2009.

[6] Encina and Pe˜na present two machines: push/enter and eval/apply. We are interested only in the former, since it resembles the original STG machine presented in [11].

35

[7] Haskell homepage: http://www.haskell.org.

[11] Simon L. Peyton Jones. Implementing lazy functional languages on stock hardware: The spineless tagless G-machine. Journal of Functional Programming, 2(2):127–202, 1992.

[8] John Launchbury. A natural semantics for lazy evaluation. In Susan L. Graham, editor, Proceedings of the Twentieth Annual ACM Symposium on Principles of Programming Languages, pages 144– 154, Charleston, South Carolina, January 1993. ACM Press.

[12] Simon L. Peyton Jones and Jon Salkild. The spineless tagless G-machine. In Joseph E. Stoy, editor, Proceedings of the Fourth International Conference on Functional Programming and Computer Architecture, pages 184–201, London, England, September 1989. ACM Press.

[9] Dale Miller, editor. Proceedings of the Fifth ACM-SIGPLAN International Conference on Principles and Practice of Declarative Programming (PPDP’03), Uppsala, Sweden, August 2003. ACM Press.

[13] John C. Reynolds. Definitional interpreters for higher-order programming languages. In Proceedings of 25th ACM National Conference, pages 717–740, Boston, Massachusetts, 1972. Reprinted in Higher-Order and Symbolic Computation 11(4):363–397, 1998, with a foreword [14].

[10] Jon Mountjoy. The spineless tagless G-machine, naturally. In Paul Hudak and Christian Queinnec, editors, Proceedings of the 1998 ACM SIGPLAN International Conference on Functional Programming, SIGPLAN Notices, Vol. 34, No. 1, pages 163–173, Baltimore, Maryland, September 1998. ACM Press.

[14] John C. Reynolds. Definitional interpreters revisited. Higher-Order and Symbolic Computation, 11(4):355–361, 1998. [15] Peter Sestoft. Deriving a lazy abstract machine. Journal of Functional Programming, 7(3):231–264, May 1997.

36

A Generic Deriving Mechanism for Haskell Jos´e Pedro Magalh˜aes1

Atze Dijkstra1

Johan Jeuring1,2

Andres L¨oh1

1 Department 2 School

of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands of Computer Science, Open University of the Netherlands, P.O. Box 2960, 6401 DL Heerlen, The Netherlands {jpm,atze,johanj,andres}@cs.uu.nl

Abstract

approach (SYB, L¨ammel and Peyton Jones 2003, 2004) is due to its availability: it comes with the Glasgow Haskell Compiler (GHC), the main Haskell compiler, which can even derive the necessary type class instances to make everything work without clutter. To improve the usability of generics in Haskell, we believe a tighter integration with the compiler is necessary. In fact, the Haskell 98 standard already contains some generic programming, in the form of derived instances (Peyton Jones et al. 2003, Chapter 10). Unfortunately, the report does not formally specify how to derive instances, and it restricts the classes that can be derived to six only (Eq, Ord, Enum, Bounded, Show, and Read). GHC has since long extended these with Data and Typeable (the basis of SYB), and more recently with Functor, Foldable and Traversable. Due to the lack of a unifying formalism, these extensions are not easily mimicked in other compilers, which need to reimplement the instance code generation mechanism. To address these issues, we propose an approach to specifying how to derive an instance of a class, together with new behavior for the deriving mechanism in Haskell to automatically derive such a class. To allow for portability across compilers, our approach requires only Haskell 98 with multi-parameter type classes and support for a new compiler pragma. Specifically, our contributions are:

Haskell’s deriving mechanism supports the automatic generation of instances for a number of functions. The Haskell 98 Report only specifies how to generate instances for the Eq, Ord, Enum, Bounded, Show, and Read classes. The description of how to generate instances is largely informal. The generation of instances imposes restrictions on the shape of datatypes, depending on the particular class to derive. As a consequence, the portability of instances across different compilers is not guaranteed. We propose a new approach to Haskell’s deriving mechanism, which allows users to specify how to derive arbitrary class instances using standard datatype-generic programming techniques. Generic functions, including the methods from six standard Haskell 98 derivable classes, can be specified entirely within Haskell 98 plus multi-parameter type classes, making them lightweight and portable. We can also express Functor, Typeable, and many other derivable classes with our technique. We implemented our deriving mechanism together with many new derivable classes in the Utrecht Haskell Compiler. Categories and Subject Descriptors niques]: Functional Programming General Terms

D.1.1 [Programming Tech-

Languages

• We describe a new datatype-generic programming library for

1.

Introduction

Haskell. Although similar in many aspects to other approaches, our library requires almost no extensions to Haskell 98; the most significant requirement is support for multi-parameter type classes.

Generic programming has come a long way: from its roots in category theory (Backhouse et al. 1999), passing through dedicated languages (Jansson and Jeuring 1997), language extensions and pre-processors (Hinze et al. 2007; L¨oh 2004) until the flurry of library-based approaches of today (Rodriguez Yakushev et al. 2008). In this evolution, expressivity has not always increased: many generic programming libraries of today still cannot compete with the Generic Haskell pre-processor, for instance. The same applies to performance, as libraries tend to do little regarding code optimization, whereas meta-programming techniques such as Template Haskell (Sheard and Peyton Jones 2002) can generate nearoptimal code. Instead, generic programming techniques seem to evolve in the direction of better availability and usability: it should be easy to define generic functions and it should be trivial to use them. Certainly some of the success of the Scrap Your Boilerplate

• We show how this library can be used to extend the deriving

mechanism in Haskell, and provide sample derivings, notably for the Functor class. • We provide a detailed description of how the representation for

a datatype is generated. In particular, we can represent almost all Haskell 98 datatypes. • We provide a fully functional implementation of our library

in the Utrecht Haskell Compiler (UHC, Dijkstra et al. 2009). Many useful generic functions are defined using generic deriving in the compiler. We also provide a package which compiles both in UHC and GHC, showing in detail the code that needs to added to the compiler, the code that should be generated by the compiler, and the code that is portable between compilers.1 The remainder of this paper is structured as follows: first we give a brief introduction to generic programming in Haskell (Section 2), which also introduces the particular library we use. We proceed to show how to define generic functions (Section 3), and then

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

1 http://dreixel.net/research/code/gdmh.tar.gz

37

data U1 ρ = U1 data (+) φ ψ ρ = L1 {unL1 :: φ ρ } | R1 {unR1 :: ψ ρ } data (×) φ ψ ρ = φ ρ × ψ ρ

describe the necessary modifications to the compiler for supporting our approach (Section 4). Finally, we discuss alternative designs (Section 5), review related work (Section 6), propose future work (Section 7) and conclude in Section 8.

2.

We encode lifted sums with (+) and lifted products with (×). Nullary products are encoded with lifted unit (U1 ).2 The type variable ρ is present in all representation types: it represents the parameter over which we abstract. We use an explicit combinator to mark the occurrence of this parameter:

Generic programming

We use the generic function encode as a running example throughout this paper. This function transforms a value into a sequence of bits:

newtype Par1 ρ = Par1 {unPar1 :: ρ }

data Bit = 0 | 1 class Encode α where encode :: α → [Bit]

As our representation is functorial, we can encode composition. Although we cannot express this in the kind system, we require the first argument of composition to be a representable type constructor. The second argument can only be the parameter, a recursive occurrence of a functorial datatype, or again a composition. We use Rec1 to represent recursion, and (◦) for composition:

We want the user to be able to write data Exp = Const Int | Plus Exp Exp deriving (Show, Encode)

newtype Rec1 φ ρ = Rec1 {unRec1 :: φ ρ } newtype (◦) φ ψ ρ = Comp1 (φ (ψ ρ))

and to use encode like test :: [Bit] test = encode (Plus (Const 1) (Const 2))

PolyP (Jansson and Jeuring 1997) treats composition in a similar way. Finally, we have two types for representing meta-information and tagging:

This should be all that is necessary to use encode. The user should need no further knowledge of generics, and encode can be used in the same way as show, for instance. Behind the scenes, the compiler generates an instance for Encode Exp based on a generic specification of instances of class Encode. There are several ways to specify such an instance, both using code generation and datatype-generic approaches. We choose a datatype-generic approach because it is type-safe and elegant (Hinze et al. 2007). We will discuss alternative designs and motivate our choice in more detail in Section 5. For now we proceed to describe our new generic programming library. The three basic ingredients for generic programming, as described by Hinze and L¨oh (2009), are:

newtype K1 ι γ ρ = K1 {unK1 :: γ } newtype M1 ι γ φ ρ = M1 {unM1 :: φ ρ } We use K1 for tagging and M1 for storing meta-information. The role of the ι parameter in these types is made explicit by the following type synonyms: data D data C data S data R data P

1. Support for overloaded functions 2. A run-time type representation

We use Rec0 to tag occurrences of (possibly recursive) types of kind ? and Par0 to mark additional parameters of kind ? (other than ρ). For meta-information, we use D1 for datatype information, C1 for constructor information and S1 for record selector information. We group five combinators into two because in many generic functions the behavior is independent of the meta-information or tags. In this way, fewer trivial cases have to be given. We present the meta-information associated with M1 in detail in the next section. Note that we abstract over a single parameter ρ of kind ?. This means we will be able to express generic functions such as

3. A generic view on data Since we use Haskell, (1) is easy: an overloaded (ad-hoc polymorphic) function is a method of a type class. For (2), we introduce a type representation similar to the one used in the regular (Van Noort et al. 2008) and instant-generics (Chakravarty et al. 2009) libraries, in Section 2.1. For (3), we again use type classes to encode embedding-projection pairs for user-defined datatypes in Section 2.3. 2.1

type D1 = M1 D type C1 = M1 C type S1 = M1 S type Rec0 = K1 R type Par0 = K1 P

A run-time type representation

fmap :: (α → β ) → φ α → φ β

The choice of a run-time type representation affects not only the compiler writer but also the expressiveness of the whole approach. A simple representation is easier to derive, but might not allow the definition of some generic functions. More complex representations are more expressive, but require more work for the automatic derivation of instances. We present a set of representation types that tries to balance these factors. We use the common sum-of-products representation without explicit fixpoints but with explicit abstraction over a single parameter. Therefore, representable types are functors, and we can compose types. Additionally, we provide useful types for encoding meta-information (such as constructor names) and tagging arguments to constructors. We show examples of how these representation types are used in Section 2.4. The basic ingredients of the sum-of-products representation types are:

but not bimap :: (α → γ) → (β → δ ) → φ α β → φ γ δ For bimap we need another type representation that can distinguish between the parameters. All representation types need to carry one additional type argument. However, we think that, in practice, few generic functions require abstraction over more than a single type parameter. 2.2

Meta-information

For some generic functions we need information about datatypes, constructors, and records. This information is stored in the type representation: 2 We also have lifted void (V ) to represent nullary sums, but for simplicity 1 we omit it from this discussion and from the generic functions in Section 3.

38

class Datatype γ where datatypeName :: γ → String moduleName :: γ → String

2.3

A generic view on data

We obtain a generic view on data by defining an embeddingprojection pair between a datatype and its type representation. We use the following classes for this purpose:

class Selector γ where selName :: γ → String selName = const ""

class Representable0 α τ where from0 :: α → τ χ to0 :: τ χ → α class Representable1 φ τ where from1 :: φ ρ → τ ρ to1 :: τ ρ → φ ρ

class Constructor γ where conName :: γ → String conFixity :: γ → Fixity conFixity = const Prefix conIsRecord :: γ → Bool conIsRecord = const False

We use τ to encode the representation of a standard type. Since τ is built from representation types, it is functorial. In Representable1 , we encode types of kind ? → ?, so we have the parameter ρ. In Representable0 there is no parameter, so we invent a variable χ which is never used. All types need to have an instance of Representable0 . Types of kind ? → ? also need an instance of Representable1 . This separation is necessary because some generic functions (like fmap or traverse) require explicit abstraction from a single type parameter, whereas others (like show or enum) do not. Given the different kinds involved, it is unavoidable to have two type classes for this representation. Note, however, that we have a single set of representation types (apart from the duplication for tagging recursion and parameters).

Names are unqualified. We provide the datatype name together with the module name. This is the only meta-information we store for a datatype, although it could be easily extended to add the kind, for example. We only store the name of a selector. For a constructor, we also store its fixity and mark if it has fields. This last information is not strictly necessary, as it can be inferred by looking for non-empty selNames, but it simplifies some generic function definitions. The datatypes Fixity and Associativity are unsurprising: data Fixity = Prefix | Infix Associativity Int data Associativity = LeftAssociative | RightAssociative | NotAssociative

Avoiding extensions Since we want to avoid using advanced Haskell extensions such as type families (Schrijvers et al. 2008) or functional dependencies (Jones 2000), we use a simple multiparameter type class for embedding-projection pairs. In fact, τ is uniquely determined by α (and φ ). We could encode the representation type more naturally with a type family:

We provide default definitions for conFixity and conIsRecord to simplify instantiation for prefix constructors that do not use record notation.3 Finally, we tie the meta-information to the representation:

class Representable0 α where type Rep0 α :: ? → ? from0 :: α → Rep0 α χ to0 :: Rep0 α χ → α

instance (Datatype γ) ⇒ Datatype (M1 D γ φ ρ) where datatypeName = datatypeName ◦ unMeta moduleName = moduleName ◦ unMeta instance (Constructor γ) ⇒ Constructor (M1 C γ φ ρ) where conName = conName ◦ unMeta instance (Selector γ) ⇒ Selector (M1 S γ φ ρ) where selName = selName ◦ unMeta unMeta :: M1 ι γ φ ρ → γ unMeta = ⊥

Since type families and functional dependencies are not yet part of any Haskell standard, we do not use them. Instead, we use multi-parameter type classes, and solve the ambiguities that arise by coercing with asTypeOf . 2.4

Example representations

We now show how to represent some standard datatypes. Note that all the code in this section is automatically generated by the compiler, as described in Section 4.

Function unMeta operates at the type-level only, so it does not need an implementation. We provide more details in Section 4.5, and the examples later in Section 2.4 and Section 3.6 also clarify how we use these classes. Note that we could encode the meta information as an extra argument to M1 :

Representing Exp. as follows:

The meta-information for datatype Exp looks

data $Exp data $ConstExp data $PlusExp instance Datatype $Exp where moduleName = "ModuleName" datatypeName = "Exp" instance Constructor $ConstExp where conName instance Constructor $PlusExp where conName

data M1 ι γ φ ρ = M1 Meta (φ ρ) data Meta = Meta String Fixity . . . However, with this encoding we have trouble writing generic producers, since when we are producing an M1 we have to produce a Meta for which we have no information. With the above representation we avoid this problem by using type-classes to fill in the right information for us. See Section 3.5 for an example of how this works.

= "Const" = "Plus"

In moduleName, "ModuleName" is the name of the module where Exp lives. The particular datatypes we use for representing the meta-information at the type-level are not needed for defining generic functions, so they are not visible to the user. In this paper, we prefix them with a $. The type representation ties the meta-information to the sumof-products representation of Exp:

3 We

also provide an empty default selName because all constructor arguments will be wrapped in an S1 , independently of using record notation or not. We omit this in the example representations of this section for space reasons, but it becomes clear in Section 4.

39

type RepExp = 0 D1 $Exp ( C1 $ConstExp (Rec0 Int) + C1 $PlusExp (Rec0 Exp × Rec0 Exp))

data Decl ρ = Decl (Var ρ) (Expr ρ) data Var ρ = Var ρ | VarL (Var [ρ ]) Note that Expr makes use of an infix constructor (∗), has a selector (unVar), and uses lists in Let. Datatype Var is nested, since in the VarL constructor Var is called with [ρ ]. These oddities are present only for illustrating how our approach represents them. We show only the essentials of the encoding of this set of mutually recursive datatypes, starting with the meta-information:

Note that the representation is shallow: at the recursive occurrences we use Exp, and not RepExp 0 . The embedding-projection pair implements the isomorphism between Exp and RepExp 0 : instance Representable0 Exp RepExp where 0 from0 (Const n) = M1 (L1 (M1 (K1 n))) from0 (Plus e e0 ) = M1 (R1 (M1 (K1 e × K1 e0 ))) to0 (M1 (L1 (M1 (K1 n)))) = Const n to0 (M1 (R1 (M1 (K1 e × K1 e0 )))) = Plus e e0

data $TimesExpr data $VarExpr Expr data $UnVar instance Constructor $TimesExpr where = "*" conName conFixity = Infix RightAssociative 6 instance Constructor $VarExpr Expr where = "Var_Expr" conName conIsRecord = True instance Selector $UnVar where selName = "unVar"

Here it is clear that from0 and to0 are inverses: the pattern of from0 is the same as the expression in to0 , and vice-versa. Representing lists. The representation for a type of kind ? → ? requires an instance for both Representable1 and Representable0 . For lists data List ρ = Nil | Cons ρ (List ρ) deriving (Show, Encode) we generate the following code:

We have to store the fixity of the ∗ constructor, and also the fact that VarExpr has a record. We store its name in the instance for Selector, and tie the meta-information to the representation:

type RepList 0 ρ= D1 $List ( C1 $NilList U1 + C1 $ConsList (Par0 ρ × Rec0 (List ρ)))

type RepExpr = D1 $Expr 1 ( ( C1 $ConstExpr (Rec0 Int) + C1 $TimesExpr (Rec1 Expr × Rec1 Expr)) + ( C1 $VarExpr Expr (S1 $UnVar (Rec1 Var)) + C1 $LetExpr (([ ] ◦ Rec1 Decl) × Rec1 Expr)))

instance Representable0 (List ρ) (RepList 0 ρ) where from0 Nil = M1 (L1 (M1 U1 )) from0 (Cons h t) = M1 (R1 (M1 (K1 h × K1 t))) to0 (M1 (L1 (M1 U1 ))) = Nil to0 (M1 (R1 (M1 (K1 h × K1 t)))) = Cons h t

In RepExpr we see the use of S1 . Also interesting is the represen1 tation of the Let constructor: the list datatype is applied not to the parameter ρ but to Decl ρ, so we use composition to denote this. Note also that we are using a balanced encoding for the sums (and also for the products). This improves the performance of the typechecker, and makes generic encoding more space-efficient, for instance. We omit the representation for Decl. For Var we use composition again:

We omit the definitions for the meta-information, which are similar to the previous example. We use Par0 to tag the parameter ρ, as we view lists as a kind ? datatype for Representable0 . This is different in the Representable1 instance: type RepList 1 = D1 $List ( C1 $NilList U1 + C1 $ConsList (Par1 × Rec1 List)) instance Representable1 List RepList 1 where from1 Nil = M1 (L1 (M1 U1 )) from1 (Cons h t) = M1 (R1 (M1 (Par1 h × Rec1 t))) to1 (M1 (L1 (M1 U1 ))) = Nil to1 (M1 (R1 (M1 (Par1 h × Rec1 t)))) = Cons h t

type RepVar 1 = D1 $Var ( C1 $VarVar Par1 + C1 $VarLVar (Var ◦ Rec1 [ ])) In the VarL constructor, Var is applied to [ρ ]. We represent this as a composition with Rec1 [ ]. When we use composition, the embedding-projection pairs become slightly more complicated:

We treat parameters and recursion differently in RepList and 0 List we use Par and Rec for mere tagging; in RepList 0 0 1 . In Rep0 RepList we use Par1 and Rec1 instead, which store the parameter 1 and the recursive occurrence of a type constructor, respectively. We will see later when defining generic functions (Section 3) how these are used.

instance Representable1 Expr RepExpr where 1 from1 (Const i) = M1 (L1 (L1 (M1 (K1 i)))) from1 (e1 ∗ e2 ) = M1 (L1 (R1 (M1 (Rec1 e1 × Rec1 e2 )))) from1 (VarExpr v) = M1 (R1 (L1 (M1 (M1 (Rec1 v))))) from1 (Let d e) = M1 (R1 (R1 (M1 (Comp1 (fmap Rec1 d) × Rec1 e)))) to1 (M1 (L1 (L1 (M1 (K1 i))))) = Const i to1 (M1 (L1 (R1 (M1 (Rec1 e1 × Rec1 e2 ))))) = e1 ∗ e2 to1 (M1 (R1 (L1 (M1 (M1 (Rec1 v)))))) = VarExpr v to1 (M1 (R1 (R1 (M1 (Comp1 d × Rec1 e))))) = Let (fmap unRec1 d) e

Representing type composition. We now present a larger example, involving more complex datatypes, to show the expressiveness of our approach. Datatype Expr represents abstract syntax trees of a small language: infixr 6 ∗ data Expr ρ = Const Int | Expr ρ ∗ Expr ρ | VarExpr {unVar :: Var ρ } | Let [Decl ρ ] (Expr ρ)

We need to use fmap to apply the Rec1 constructor inside the lists. In this case we could use map instead, but in general we require the

40

first argument to ◦ to have a Functor instance so we can use fmap. In to1 we need to convert back, this time mapping unRec1 . For Var, the embedding-projection pair is similar:

Note that while the instances for the representation types are given for the Encode1 class, only the Encode class is exported and allowed to be derived. This is because its type is more general, and because we need a two-level approach to deal with recursion: for the K1 instance, we recursively call encode instead of encode1 . Recall our representation for Exp (simplified and with type synonyms expanded):

instance Representable1 Var RepVar 1 where from1 (Var x) = M1 (L1 (M1 (Par1 x))) from1 (VarL xs) = M1 (R1 (M1 (Comp1 (fmap Rec1 xs)))) to1 (M1 (L1 (M1 (Par1 x)))) = Var x to1 (M1 (R1 (M1 (Comp1 xs)))) = VarL (fmap unRec1 xs)

type RepExp = K1 R Int + K1 R Exp × K1 R Exp 0 Since Int and Exp appear as arguments to K1 , and our instance of Encode1 for K1 ι φ requires an instance of Encode φ , we need instances of Encode for Int and for Exp. We deal with Int in the next section, and Exp in Section 3.3. Finally, note that we do not need Encode1 instances for Rec1 , Par1 or (◦). These are only required for generic functions which make use of the Representable1 class. We will see an example in Section 3.4.

Note that composition is used both in the representation for the first argument of constructor Let (of type [Decl ρ ]) and in the nested recursion of VarL (of type Var [ρ ]). In both cases, we have a recursive occurrence of a parametrized datatype where the parameter is not just the variable ρ. Recall our definition of composition: data (◦) φ ψ ρ = Comp1 (φ (ψ ρ))

3.2

The type φ is applied not to ρ, but to the result of applying ψ to ρ. This is why we use ◦ when the recursive argument to a datatype is not ρ, like in [Decl ρ ] and Var [ρ ]. When it is ρ, we can simply use Rec1 . We have seen how to represent many features of Haskell datatypes in our approach. We give a detailed discussion of the supported datatypes in Section 7.1.

3.

instance Encode Int where encode = . . . instance Encode Char where encode = . . . Since Encode is exported, a user can also provide additional base type instances, or ad-hoc instances (types for which the required implementation is different from the derived generic behavior).

Generic functions

3.3

Default definition

We miss an instance of Encode for Exp. Instances of generic functions for representable types rely on the embedding-projection pair to convert from/to the type representation and then apply the generic function:

In this section we show how to define type classes with derivable functions. 3.1

Base types

We have to provide the instances of Encode for the base types:

Generic function definition

Function encode is a method of a type-class:

encodeDefault :: (Representable0 α τ, Encode1 τ) ⇒ τ χ → α → [Bit] encodeDefault rep x = encode1 ((from0 x) ‘asTypeOf ‘ rep)

data Bit = 0 | 1 class Encode α where encode :: α → [Bit]

Function encodeDefault tells the compiler what to fill in for the instance of each of the derived types. Because we do not want to use functional dependencies for portability reasons, we pass the representation type explicitly to function encodeDefault . This function uses the representation type to coerce the result type of from0 with asTypeOf . This slight complication is a small price to pay for extended portability. Now we can show the instance of Encode for Exp and List:

We cannot provide instances of Encode for our representation types, as those have kind ? → ?, and Encode expects a parameter of kind ?. We therefore define a helper class, this time parametrized over a variable of kind ? → ?: class Encode1 φ where encode1 :: φ χ → [Bit]

instance Encode Exp where encode = encodeDefault (⊥ :: RepExp χ) 0 instance (Encode ρ) ⇒ Encode (List ρ) where encode = encodeDefault (⊥ :: RepList 0 ρ χ)

For constructors without arguments we return the empty list, as there is nothing to encode. Meta-information is discarded: instance Encode1 U1 where encode1 = [ ] instance (Encode1 φ ) ⇒ Encode1 (M1 ι γ φ ) where encode1 (M1 a) = encode1 a

Both instances look similar and trivial. However, the instance for List requires scoped type variables to type-check. We can avoid the need for scoped type variables if we create an auxiliary local function encodeList with the same type and behavior of encodeDefault :

For a value of a sum type we produce a single bit to record the choice. For products we concatenate the encoding of each element:

instance (Encode ρ) ⇒ Encode (List ρ) where encode = encodeList ⊥ where encodeList :: (Encode ρ) ⇒ RepList 0 ρ χ → List ρ → [Bit] encodeList = encodeDefault

instance (Encode1 φ , Encode1 ψ) ⇒ Encode1 (φ + ψ) where encode1 (L1 a) = 0 : encode1 a encode1 (R1 a) = 1 : encode1 a instance (Encode1 φ , Encode1 ψ) ⇒ Encode1 (φ × ψ) where encode1 (a × b) = encode1 a ++ encode1 b

Here, the local function encodeList encodes in its type the correspondence between the type List ρ and its representation RepList 0 ρ. Its type signature is required, but can easily be obtained from the type of encodeDefault by replacing the type variables α and τ with the concrete types for this instance. For completeness, we give the instance for Exp in the same fashion:

It remains to encode constants. Since constant types have kind ?, we resort to Encode: instance (Encode φ ) ⇒ Encode1 (K1 ι φ ) where encode1 (K1 a) = encode a

41

instance Encode Exp where encode = encodeExp ⊥ where encodeExp :: RepExp χ → Exp → [Bit] 0 encodeExp = encodeDefault

We apply the argument function in the parameter case: instance Functor1 Par1 where fmap1 f (Par1 a) = Par1 (f a) Unit and constant values do not change, as there is nothing we can map over. We apply fmap1 recursively to meta-information, sums and products:

It might seem strange that we choose not to use Haskell’s builtin functionality for default definitions for class methods. Unfortunately we cannot use default methods, for two reasons:

instance Functor1 U1 where fmap1 f U1 = U1 instance Functor1 (K1 ι γ) where fmap1 f (K1 a) = K1 a instance (Functor1 φ ) ⇒ Functor1 (M1 ι γ φ ) where fmap1 f (M1 a) = M1 (fmap1 f a) instance (Functor1 φ , Functor1 ψ) ⇒ Functor1 (φ + ψ) where fmap1 f (L1 a) = L1 (fmap1 f a) fmap1 f (R1 a) = R1 (fmap1 f a) instance (Functor1 φ , Functor1 ψ) ⇒ Functor1 (φ × ψ) where fmap1 f (a × b) = fmap1 f a × fmap1 f b

1. Since we avoid using type families and functional dependencies, we need to explicitly pass the representation type as an argument to encodeDefault . 2. A default case would force us to move the Representable0 α τ and Encode1 τ class constraints to the Encode class, possibly preventing ad-hoc instances for non-representable types and exposing Encode1 to the user. However, if the compiler is to generate instances for Exp and other representable datatypes automatically, how does it know which function to use as default? The alternative to standard Haskell default methods is to use a naming convention for this function (like appending Default to the class function name, as in our example). It is more reliable to use a pragma:

If we find a recursive occurrence of a functorial type, we call fmap again, to tie the recursive knot:

{−# DERIVABLE Encode encode encodeDefault #−}

instance (Functor φ ) ⇒ Functor1 (Rec1 φ ) where fmap1 f (Rec1 a) = Rec1 (fmap f a)

This pragma takes three arguments, which represent (respectively):

The remaining case is composition:

1. The class which we are defining as derivable

instance (Functor φ , Functor1 ψ) ⇒ Functor1 (φ ◦ ψ) where fmap1 f (Comp1 x) = Comp1 (fmap (fmap1 f ) x)

2. The method of the class which is generic (and therefore needs a default definition) 3. The name of the function which serves as a default definition

Recall that we require the first argument of (◦) to be a user-defined datatype, and the second to be a representation type. Therefore, we use fmap1 for the inner mapping (as it will map over a representation type) but fmap for the outer mapping (as it will require an embedding-projection pair). This is the general structure of the instance of (◦) for a generic function. Finally, we define the default method:

Such a pragma also has the advantage of indicating derivability for a particular class. We could use a keyword such as derivable to signal that a class is allowed to be derived: derivable class Encode α where . . . However, by using a pragma instead (as described above) we ensure more portability, as compilers without support for our derivable type classes can still compile the code. Since a class can have multiple generic methods, multiple pragmas can be used for this purpose. Note, however, that a derivable class can only have non-generic methods if there is a default definition for these, as otherwise we have no means for implementing the non-generic methods. Alternatively, we could treat generic methods as default methods, filling in the generic definition automatically if the user does not give a definition. This would allow classes to have normal, generic, and default methods. However, it would complicate the code generation mechanism. 3.4

{−# DERIVABLE Functor fmap fmapDefault #−} fmapDefault :: (Representable1 φ τ, Functor1 τ) ⇒ τ ρ → (ρ → α) → φ ρ → φ α fmapDefault rep f x = to1 (fmap1 f (from1 x ‘asTypeOf ‘ rep)) Now Functor can be derived for user-defined datatypes. The usual restrictions apply: only types with at least one type parameter and whose last type argument is of kind ? can derive Functor. The compiler derives the following instance for List: instance Functor List where fmap = fmapList (⊥ :: RepList 1 ρ) where fmapList :: RepList ρ → (ρ → α) → List ρ → List α 1 fmapList = fmapDefault

Generic map

In this subsection we define the generic map function fmap, which implements the Prelude’s fmap. Function fmap requires access to the parameter in the representation type. As before, we export a single class together with an internal class where we define the generic instances:

Note that the instance Functor List also guarantees that we can use List as the first argument to (◦), as the embedding-projection pairs for such compositions need to use fmap. The instances derived for Expr, Decl, and Var are similar.

class Functor φ where fmap :: (ρ → α) → φ ρ → φ α

3.5

Generic empty

We can also easily express generic producers: functions which produce data. We will illustrate this with function empty, which produces a single value of a given type:

class Functor1 φ where fmap1 :: (ρ → α) → φ ρ → φ α

class Empty α where empty :: α

Unlike in Encode, the type arguments to Functor and Functor1 have the same kind, so we do not really need two classes. However, for consistency, we use the same style as for kind ? generic functions.

This function is perhaps the simplest generic producer, as it consumes no data. It relies only on the structure of the datatype to produce values. Other examples of generic producers are the methods

42

in Read and the Arbitrary class from QuickCheck, and binary’s get. As usual, we define an auxiliary type class:

We define a helper class Show1 , with shows1 as the only method. For each representation type there is an instance of Show1 . The extra Bool argument will be explained later. Datatype meta-information and sums are ignored. For units we have nothing to show, and for constants we call shows recursively:

class Empty1 φ where empty0 :: φ χ Most instances of Empty1 are straightforward:

class Show1 φ where shows1 :: Bool → φ χ → ShowS

instance Empty1 U1 where empty0 = U1 instance (Empty1 φ ) ⇒ Empty1 (M1 ι γ φ ) where empty0 = M1 empty0 instance (Empty1 φ , Empty1 ψ) ⇒ Empty1 (φ × ψ) where empty0 = empty0 × empty0 instance (Empty φ ) ⇒ Empty1 (K1 ι φ ) where empty0 = K1 empty

instance (Show1 φ ) ⇒ Show1 (D1 γ φ ) where shows1 b (M1 a) = shows1 b a instance (Show1 φ , Show1 ψ) ⇒ Show1 (φ + ψ) where shows1 b (L1 a) = shows1 b a shows1 b (R1 a) = shows1 b a instance Show1 U1 where = id shows1 U1 instance (Show φ ) ⇒ Show1 (K1 ι φ ) where shows1 (K1 a) = shows a

For units we can only produce U1 . Meta-information is produced with M1 , and since we encode the meta-information using type classes (instead of using extra arguments to M1 ) we do not have to use ⊥ here. An empty product is the product of empty components, and for K1 we recursively call empty. The only interesting choice is for the sum type:

The most interesting instances are for the meta-information of a constructor and a selector. For simplicity, we always place parentheses around a constructor and ignore infix operators. We do display a labeled constructor with record notation. At the constructor level, we use conIsRecord to decide if we print surrounding brackets or not. We use the Bool argument to shows1 to encode that we are inside a labeled field, as we will need this for the product case:

instance (Empty1 φ ) ⇒ Empty1 (φ + ψ) where empty0 = L1 empty0 In a sum, we always take the leftmost constructor for the empty value. Since the leftmost constructor might be recursive, function empty might not terminate. More complex implementations can look ahead to spot recursion, or choose alternative constructors after recursive calls, for instance. Note also the similarity between our Empty class and Haskell’s Bounded: if we were defining minBound and maxBound generically, we could choose L1 for minBound and R1 for maxBound. This way we would preserve the semantics for derived Bounded instances, as defined by Peyton Jones et al. (2003), while at the same time lifting the restrictions on types that can derive Bounded. Alternatively, to keep the Haskell 98 behavior, we could give no instance for ×, as enumeration types will not have a product in their representations. The default method simply applies to0 to empty0 :

instance (Show1 φ , Constructor γ) ⇒ Show1 (M1 C γ φ ) where shows1 c@(M1 a) = showString "(" ◦ showString (conName c) ◦ showString " " ◦ wrapRecord (shows1 (conIsRecord c) a ◦ showString ")") where wrapRecord :: ShowS → ShowS wrapRecord s | conIsRecord c = showString "{ " ◦ s ◦ showString " }" wrapRecord s | otherwise = s

{−# DERIVABLE Empty empty emptyDefault #−} emptyDefault :: (Representable0 α τ, Empty1 τ) ⇒τ χ →α emptyDefault rep = to0 (empty0 ‘asTypeOf ‘ rep)

For a selector, we print its label (as long as it is not empty), followed by an "=" and the value. In the product, we use the Bool to decide if we print a space (unlabeled constructors) or a comma: instance (Show1 φ , Selector γ) ⇒ Show1 (M1 S γ φ ) where shows1 b s@(M1 a) | null (selName s) = shows1 b a | otherwise = showString (selName s) ◦ showString " = " ◦ shows1 b a instance (Show1 φ , Show1 ψ) ⇒ Show1 (φ × ψ) where shows1 b (a × c) = shows1 b a ◦ showString (if b then "," else " ") ◦ shows1 b c

Now the compiler can produce instances such as: instance Empty Exp where empty = emptyExp ⊥ where emptyExp :: RepExp χ → Exp 0 emptyExp = emptyDefault instance (Empty ρ) ⇒ Empty (List ρ) where empty = emptyList ⊥ where emptyList :: (Empty ρ) ⇒ RepList 0 ρ χ → List ρ emptyList = emptyDefault

Finally, we provide the default: {−# DERIVABLE Show shows showsDefault #−} showsDefault :: (Representable0 α τ, Show1 τ) ⇒ τ χ → α → ShowS showsDefault rep x = shows1 False (from0 x ‘asTypeOf ‘ rep)

Instances for other types are similar. 3.6

Generic show

To illustrate the use of constructor and selector labels, we define the shows function generically:

We have shown how to use meta-information to define a generic show function. If we additionally account for infix constructors and operator precedence for avoiding unnecessary parentheses, we obtain a formal specification of how show behaves on every Haskell 98 datatype.

class Show α where shows :: α → ShowS show :: α → String show x = shows x ""

43

4.

Compiler support

function could have been inlined. These definitions will be refined in Section 4.4.

We now describe in detail the required compiler support for our generic deriving mechanism. We start by defining two predicates on types, isRep0 (φ ) and isRep1 (φ ), which hold if φ can be made an instance of Representable0 and Representable1 , respectively. The statement isRep0 (φ ) holds if φ is any of the following:

4.3

See Figure 3 for the type representation of type constructors. We keep the sum-of-products structure and meta-information unchanged. At the arguments, however, we can use Par0 , Par1 , Rec0 , Rec1 , or composition. We use Par1 for the type variable α, and Par0 for other type variables of kind ?. A recursive occurrence of a type containing αn is marked with Rec1 . A recursive occurrence of a type with no type variables is marked with Rec0 , as there is no variable to abstract from. Finally, for a recursive occurrence of a type which contains something else than αn we use composition, and recursively analyze the contained type.

1. A regular Haskell 98 datatype without context 2. An empty datatype 3. A type variable of kind ? We also require that for every type ψ that appears as an argument to a constructor of φ , isRep0 (ψ) holds. φ cannot use existential quantification, type equalities or any other extensions. The statement isRep1 (φ ) holds if the following conditions both hold:

4.4

1. isRep0 (φ ) Note that isRep0 holds for all the types of Section 2.4, while isRep1 holds for List, Expr, Decl, and Var. Furthermore, we define the predicate ground (φ ) to determine whether or not a datatype has type variables. For instance, ground ([Int]) holds, but ground ([α ]) not. Finally, we assume the j existence of an indexed fresh variable generator fresh pi , which j binds pi to a unique fresh variable. For the remainder of this section, we consider a user-defined datatype data D α1 . . . αn = Con1 {l11 :: p11 , . . . , l1o1 :: po11 } .. . 1 :: p1 , . . . , lom :: pom } | Conm {lm m m m

4.5

with n type parameters, m constructors and possibly labeled paramj j eter li of type pi at position j of constructor Coni .

Meta-information

We generate three meta-information instances. For datatypes, we generate

Type representation (kind ?)

instance Datatype $D where moduleName = mName datatypeName = dName ,

In Figure 1, we show how we generate type representations for a datatype D satisfying isRep0 (D). We generate a number of empty datatypes which we use in the meta-information: one for the datatype, one for each constructor and one for each argument to a constructor. The type representation is a type synonym (RepD 0 ) with as many type variables as D. It is a wrapped sum of wrapped products: the wrapping encodes the meta-information. We wrap all arguments to constructors, even if the constructor is not a record. Since we use a balanced sum (resp. product) encoding, a generic function can use the meta-information to find out when the sum (resp. product) structure ends, which is when we reach C1 (resp. S1 ). Each argument is tagged with Par0 if it is one of the type variables, or Rec0 if it is anything else (type application or a concrete datatype). 4.2

Representable1 instance

The definition of the embedding-projection pair for kind ? → ? datatypes, shown in Figure 4, reflects the more complicated type representation. The patterns are unchanged. However, the expressions in to1 need some additional unwrapping. This is encoded in var and unwC: an application to a type variable other than αn has been encoded as a composition, so we need to unwrap the elements of the contained type. We use fmap for this purpose: since we require isRep1 (φ ), we know that we can use fmap (see Section 3.4). The user should always derive Functor for container types, as these can appear to the left of a composition. Unwrapping is dual to wrapping: we use Par1 for the type parameter αn , Rec1 for containers of αn , K1 for other type parameters and ground types, and composition for application to types other than αn . Considering composition, in to1 we generate only Comp1 applied to a fresh variable, as this is a pattern; the necessary unwrapping of the contained elements is performed in the right-hand side expression. In from1 the contained elements are tagged properly: this is performed by wCα .

2. φ is of kind ? → ? or k → ? → ?, for any kind k

4.1

Type representation (kind ? → ?)

where dName is a String with the unqualified name of datatype D and mName is a String with the name of the module in which D is defined. For constructors, we generate instance Constructor $Coni where = name conName {conFixity = fixity} {conIsRecord = True} , where i ∈ 1..m, and name is the unqualified name of constructor Coni . The braces around conFixity indicate that this method is only defined if Coni is an infix constructor. In that case, fixity is Infix assoc prio, where prio is an integer denoting the priority of Coni , and assoc is one of LeftAssociative, RightAssociative, or NotAssociative. These are derived from the declaration of Coni as an infix constructor. The braces around conIsRecord indicate that this method is only defined if Coni uses record notation. For all i ∈ {1..m}, we generate

Representable0 instance

The instance Representable0 RepD 0 is defined in Figure 2, as introduced in Section 2. The patterns of the from0 function are the constructors of the datatype applied to fresh variables. The same patterns become expressions in function to0 . The patterns of to0 are also the same as the expressions of from0 , and they represent the different values of a balanced sum of balanced products, properly wrapped to account for the meta-information. Note that, for Representable0 , the functions tuple and wrap do not behave differently depending on whether we are in from0 or to0 , so for these declarations the dir argument is not needed. Similarly, the wrap

j

instance Selector $Li {where selName

j

= li } ,

where j ∈ {1..oi }. The brackets indicate that the instance is only given a body if Coni uses record notation. Otherwise, the default implementation for selName is used, i.e. const "".

44

data $D

j

data $Con1 .. . data $Conm data $L11 .. . data $Lomm

j

om m type RepD 0 α1 . . . αn = D1 $D (∑i=1 (C1 $Coni (∏j=1 (S1 $Li (arg pi )))))

= V1 ∑ni=1 x | n ≡ 0 |n≡1 =x n−m | otherwise = ∑m i=1 x + ∑i=1 x where m = bn / 2c

j

j

j

arg pi | ∃k ∈ {1..n} : pi ≡ αk = Par0 pi | otherwise

∏ni=1 x | n ≡ 0

j

= Rec0 pi

= U1 |n≡1 =x n−m | otherwise = ∏m i=1 x × ∏i=1 x where m = bn / 2c Figure 1. Code generation for the type representation (kind ?)

instance Representable0 (D α1 . . . αn ) (RepD 0 α1 . . . αn ) where { from from from0 pat1 = exp1 ; to0 patto = expto 1 1; .. .. . . from from to from0 patm = expm ; to0 patto m = expm ; } inji,m x | m ≡ 0 = ⊥ |m≡1=x | i 6 m0 = L1 (inji,m0 x) | i > m0 = R1 (inji0 ,m−m0 x) where m0 = bm / 2c i0 = bi / 2c

from

expto i from

expi

= pati

= Coni (fresh p1i ) . . . (fresh poi i )

= patto i

= M1 (inji,m (M1 (tuplei (p1i . . . poi i ))))

j

oi tupledir = M1 U 1 i (pi . . . pi ) | oi ≡ 0 j | oi ≡ j = M1 (wrapdir (fresh pi )) k+1 1 k dir | otherwise = (tupledir . . . pm i )) i (pi . . . pi )) × (tuplei (pi where k = boi / 2c

wrapdir p = K1 p Figure 2. Code generation for the Representable0 instance j

j

om m type RepD 1 α1 . . . αn−1 = D1 $D (∑i=1 (C1 $Coni (∏j=1 (S1 $Li (arg pi ))))) j

j

j

arg pi | ∃k ∈ {1..n−1} : pi ≡ αk

= Par0 pi

j

| pi ≡ αn = Par1 j j | pi ≡ φ αn ∧ isRep1 (φ ) = Rec1 pi j | pi ≡ φ β ∧ isRep1 (φ ) ∧ ¬ ground (β ) = φ ◦ arg β j | otherwise = Rec0 pi

n ∑m i=1 x and ∏j=1 x as in Figure 1.

Figure 3. Code generation for the type representation (kind ? → ?) instance Representable1 (D α1 . . . αn−1 ) (RepD 1 α1 . . . αn−1 ) where { from from to ; from1 pat1 = exp1 ; to1 patto = exp 1 1 .. .. . . from from to from1 patm = expm ; to1 patto m = expm ; } from

patdir i , expi j

j

j

var pi | pi ≡ φ α ∧ α 6≡ αn j ∧ isRep1 (φ ) = fmap unwCα (fresh pi ) j | otherwise = fresh pi

dir x). , inji,m x, and tupledir i (p1 . . . pm ) as in Figure 2 (but using the new wrap

j

j

unwCα | α |α |α |α

wrapdir pi | pi ≡ αn = Par1 (fresh pi ) j j | pi ≡ φ αn ∧ isRep1 (φ ) = Rec1 (fresh pi ) j j | ∃k ∈ {1..n} : pi ≡ αk = K1 (fresh pi ) j

oi 1 expto i = Coni (var pi ) . . . (var pi )

j

| pi ≡ φ α ∧ ¬ isRep1 (φ ) = K1 (fresh pi ) j j | pi ≡ φ α ∧ dir ≡ from = Comp1 (fmap wCα (fresh pi )) j | otherwise = Comp1 (fresh pi )

≡ αn = unPar1 ≡ φ αn ∧ isRep1 (φ ) = unRec1 ≡ φ β ∧ ground (β ) = unRec0 ≡ φ β ∧ isRep1 (φ ) = fmap unwCβ ◦ unComp1

wCα | α ≡ αn = Par1 | ground (α) = K1 | α ≡ φ αn ∧ isRep1 (φ ) = Rec1 | α ≡ φ β ∧ isRep1 (φ ) = Comp1 ◦ (fmap wCβ )

Figure 4. Code generation for the Representable1 instance

45

4.6

Default instances

as possible, for example during desugaring, so later compiler stages can type check the generated code. However, the generation needs kind information of types and classes, which is only available after kind checking. In UHC, the datatypes and instances are directly generated as intermediate Core, directed by kind information, and only the derived instances are intertwined with type checking and context reduction because of the use of the default deriving functions.

The instances of a class representing the different cases of a generic function on representation types present somewhat more of a challenge because they refer to a specific function defined by the generic programmer (in our running example encodeDefault). The compiler knows which function to use due to the DEFAULT pragma (Section 3.3). After the default function has been determined, the only other concern is passing the explicit type representation, encoded as a typed ⊥. 4.6.1

Use of fmap. The generation of embedding-projection pairs for types with composition requires fmap, which in turn requires the context reduction machinery to resolve overloading. This complicates the interaction with the compiler pipeline, because the generation becomes not only kind-directed, but also context reduction proof-directed. However, all occurrences of fmap are applied to the identity function id, because wrappers like Par1 are defined as newtypes. In UHC, the use of context reduction is avoided assuming the equality fmap id ≡ id.

Generic functions on Representable0

For each generic function f that is a method of the type class F, and for every datatype D with type arguments α1 . . . αn and associated representation type RepD 0 α1 . . . αn χ, the compiler generates: instance (C . . .) ⇒ F (D α1 . . . αn ) where f = fD ⊥ where fD :: (C . . .) ⇒ RepD 0 α1 . . . αn χ → β fD = fDefault

Code size. Some quick measurements show a 10% increase in the size of the generated code. Although language pragmas like GenericDeriving and NoGenericDeriving could selectively switch this feature on or off, this would defeat the purpose of genericity. Once turned off for a datatype, no Representables are generated, and no generic instances can be defined anymore. Instead, later transformations should prune unused code. These issues need further investigation.

The type β is the type of f specialized to D, and χ is a fresh type variable. The context C is the same in the instance head and in function fD . The exact context generated depends on the way the user specified the deriving. If deriving F was attached to the →, . . . , F − →, where → − datatype, we generate a context F − α α α is the n 1 variable α applied to enough fresh type variables to achieve full saturation. This approach gives the correct behavior for Haskell 98 derivable classes like Show. In general, however, it is not correct: we cannot assume that we require F αi for all i ∈ {1 . . n}: generic children, for instance, does not require any constraints, as it is not a recursive function. Worse even, we might require constraints other than these, as a generic function can use other functions, for instance. To avoid these problems we can use the standalone deriving extension. If we have a standalone deriving

Bootstrapping. As soon as a user defines a datatype, code generation generates the supporting datatypes. Such datatypes (e.g. $Con1 ) and the datatypes used by supporting datatypes (e.g. Bool, used in the return type of conIsRecord) are mutually dependent, which is detected by binding group analysis. Each binding group type analysis must deal with mutually dependent datatypes. This also means that the supporting definitions must be available in the first module that contains a datatype. Interaction with desugaring. Currently, deriving clauses are just syntactic sugar for standalone deriving. After desugaring, we cannot decide to generate a Representable0 or a Representable1 instance because kind information is not available. Automatically generating the correct context for such an instance cannot be done either. To work around this limitation, we only accept deriving clauses for generic classes that use Representable0 . Derivings for Representable1 classes have to use standalone deriving syntax, since then we no longer need to infer a context, and can let the programmer provide the required context.

deriving instance (C . . .) ⇒ F (D α1 . . . αn ) we can simply use this context for the instance. In general, however, the compiler should be able to infer the right context by analyzing the context of the generic function and the structure of the datatype. 4.6.2

Generic functions on Representable1

For each generic function f that is a method of the type class F, and for every datatype D with type arguments α1 . . . αn and associated representation type RepD 1 α1 . . . αn , the compiler generates:

5.

instance (C . . .) ⇒ F (D α1 . . . αn−1 ) where f = fD ⊥ where fD :: (C . . .) ⇒ RepD 1 α1 . . . αn → β fD = fDefault

We have described how to implement a deriving mechanism that can be used to specify many datatype-generic functions in Haskell. There are other alternatives, of varying complexity and type-safety. 5.1

The type β is the type of f specialized to D (in other words, f :: β ). This code is almost the same as that for generic functions on Representable0 , with a small exception for handling the last type variable (αn ). The context can be copied from the standalone deriving, if one was used, or just inferred by the compiler. 4.7

Alternatives

Pre-processors

The simplest, most powerful and least type safe alternative to our approach is to implement deriving by pre-processing the source file(s), analyzing the datatypes definitions and generating the required instances with a tool such as DrIFT (Winstanley and Meacham 2008). This requires no work from the compiler writer, but does not simplify the task of adding new derivable classes, as programming by generating strings is not very convenient. Staged meta-programming lies in between a pre-processor and an embedded datatype-generic representation. GHC supports Template Haskell (Sheard and Peyton Jones 2002), which has become a standard tool for obtaining reflection in Haskell. While Template Haskell provides possibly more flexibility than the purely librarybased approach we describe, it imposes a significant hurdle on the

UHC specifics

We have a prototype implementation of our deriving mechanism in UHC. Although generating the required datatypes and instances is straightforward, we have to resolve some subtle issues. In our implementation, the following issues arose: Which stage of the compiler pipeline generates the datatypes and instances? Ideally, all deriving-related code is generated as early

46

that derive the corresponding class. We only miss the ability to represent nested higher-kinded datatypes, as our representation abstracts only over a parameter of kind ?. Regarding expressiveness, our library scores good for most criteria: we can abstract over type constructors, give ad-hoc definitions for datatypes, our approach is extensible, supports multiple generic arguments, represents the constructor names and can express consumers, transformers, and producers. We cannot express gmapQ in our approach, but our generic functions are still first-class: we can call generic map with generic show as argument, for instance. Adhoc definitions for constructors would be of the form:

compiler writer, who does not only have to implement a language for staged programming (if one does not yet exist for the compiler, like in UHC), but also keep this complex component up-todate with the rest of the compiler, as it evolves. As an example, Template Haskell support for GADTs and type families only arrived much later than the features themselves. Also, for the derivable class writer, using Template Haskell is more cumbersome and error-prone than writing a datatype-generic definition in Haskell itself. For these reasons we think that our library-based approach, while having some limitations, has a good balance of expressive power, type safety, and the amount of implementation work required. 5.2

instance Show Exp where shows (Plus e1 e2 ) = shows e1 ◦ showString "+" ◦ shows e2 shows x = showsDefault (⊥ :: RepExp χ) x 0

Generic programming libraries

Another design choice we made was in the specific library approach to use. We have decided not to use any of the existing libraries but instead to develop yet another one. However, our library is merely a variant of existing libraries, from which it borrows many ideas. We see our representation as a mixture between regular (Van Noort et al. 2008) and instant-generics (Chakravarty et al. 2009). We share the functorial view with regular; however, we abstract from a single type parameter, and not from the recursive occurrence. Our library can also be seen as instant-generics extended with a single type parameter. However, having one parameter allows us to deal with composition effectively, and we do not duplicate the representation for types without parameters. Since we wanted to avoid using GADTs, and we wanted an extensible approach, we had to exclude most of the other generic programming libraries. The only possible choice would have been EMGM (Oliveira et al. 2007), which supports type parameters, is modular and does not require fancy extensions. However, EMGM duplicates the representation for higher arities, and encodes the representation of a type at the value level. We prefer encoding the representation only at the type level, as this has proven to allow for type-indexed datatypes (see Section 7.2).

6.

However, in our current implementation, RepExp is an internal type 0 synonym not exposed to the user. Exposing it to the user would require a naming convention. If UHC supported type families (Schrijvers et al. 2008), Rep0 could be a visible type family, which would solve our problem for ad-hoc definitions of constructors. It would also remove the need for using asTypeOf in Section 2.3. Regarding usability, our approach supports separate compilation, is highly portable, has automatic generation of its two representations, requires minimal work to instantiate and define a generic function, is implemented in a compiler and is easy to use. We have not yet benchmarked our library in UHC. In GHC, we believe it will be as efficient as instant-generics and regular.

7.

Future work

Our solution is applicable to a wide range of datatypes and can express many generic functions. However, some limitations still remain, and many improvements are possible. In this section we outline some possible directions for future research. 7.1

Supported datatypes

Our examples in Section 2 show that we can represent many common forms of datatypes. We believe that we can represent all of the Haskell 98 standard datatypes in Representable0 , except for constrained datatypes. We could easily support constrained datatypes by propagating the constraints to the generic instances. Regarding Representable1 , we can represent many, but not all datatypes. Consider a nested datatype for representing balanced trees:

Related work

The generic programming library we present shares many aspects with regular (Van Noort et al. 2008) and instant-generics (Chakravarty et al. 2009). Clean (Alimarine and Plasmeijer 2001) has also integrated generic programming directly in the language. We think our approach is more lightweight: we express our generic functions almost entirely in Haskell and require only one small syntactic extension. On the other hand, the approach taken in Clean allows defining generic functions with polykinded types (Hinze 2002), which means that the function bimap (see Section 2.1), for instance, can be defined. Not all Clean datatypes are supported: quantified types, for example, cannot derive generic functions. Our approach does not support all features of Haskell datatypes, but most common datatypes and generic functions are supported. An extension for derivable type classes similar to ours has been developed by Hinze and Peyton Jones (2001) in GHC. As in Clean, this extension requires special syntax for defining generic functions, which makes it harder to implement and maintain. In contrast, generic functions written in our approach are portable across different compilers. Furthermore, Hinze and Peyton Jones’s approach cannot express functions such as fmap, as their type representation does not abstract over type variables. Rodriguez Yakushev et al. (2008) give criteria for comparing generic programming libraries. These criteria consider the library’s use of types, and its expressiveness and usability. Regarding types, our library scores very good: we can represent regular, higherkinded, nested, and mutually recursive datatypes. We can also express subuniverses: generic functions are only applicable to types

data Perfect ρ = Node ρ | Perfect (ρ, ρ) We cannot give a representation of kind ? → ? for Perfect, since for the Perfect constructor we would need something like Perfect ◦ Rec1 ((, ) ρ). However, the type variable ρ is no longer available, because we abstract from it. This limitation is caused by the fact that we abstract over a single type parameter. The approach taken by Hesselink (2009) is more general and fits closely with our approach, but it is not clear if it is feasible without advanced language extensions. Note that for this particular case we could use a datatype which pairs elements of a single type: data Pair ρ = Pair ρ ρ The representation for the Perfect constructor could then be Perfect ◦ Rec1 Pair. 7.2

Type-indexed datatypes

Some generic functionality, like the zipper (Huet 1997) and generic rewriting (Van Noort et al. 2008), require not only type-indexed functions but also type-indexed datatypes: types that depend on the

47

structure of other types (Hinze et al. 2002). We plan to investigate how type-indexed datatypes can be integrated easily in our generic deriving mechanism, while still avoiding advanced language extensions.

grant. We thank Thomas van Noort and the anonymous reviewers for their helpful feedback.

7.3

Artem Alimarine and Rinus Plasmeijer. A Generic Programming Extension for Clean. In IFL’01, pages 168–185. Springer-Verlag, 2001.

References

Generic functions

The representation types we propose limit the kind of generic functions we can define. We can express the Haskell 98 standard derivable classes Eq, Ord, Enum, Bounded, Show, and Read, even lifting some of the restrictions imposed on the Enum and Bounded instances. All of these are expressible for Representable0 types. Using Representable1 , we can implement Functor, as the parameter of the Functor class is of kind ? → ?. The same holds for Foldable and Traversable. For Typeable we can express Typeable0 and Typeable1 . On the other hand, the Data class has very complex generic functions which cannot be expressed with our representation. Function gfoldl, for instance, requires access to the original datatype constructor, something we cannot do with the current representation. In the future we plan to explore if and how we can change our representation to allow us to express more generic functions. 7.4

Roland Backhouse, Patrik Jansson, Johan Jeuring, and Lambert Meertens. Generic programming—an introduction. In AFP’98, volume 1608 of LNCS, pages 28–115. Springer, 1999. Manuel M. T. Chakravarty, Gabriel C. Ditu, and Roman Leshchinskiy. Instant generics: Fast and easy, 2009. Draft version. Atze Dijkstra, Jeroen Fokker, and S. Doaitse Swierstra. The architecture of the Utrecht Haskell compiler. In Haskell’09, pages 93–104. ACM, 2009. Erik Hesselink. Generic programming with fixed points for parametrized datatypes. Master’s thesis, Utrecht University, 2009. Ralf Hinze. Polytypic values possess polykinded types. SCP, 43(2-3):129– 159, 2002. Ralf Hinze and Andres L¨oh. Generic programming in 3D. SCP, 74(8): 590–628, 2009. Ralf Hinze and Simon Peyton Jones. Derivable type classes. Electronic Notes in Theoretical Computer Science, 41(1):5–35, 2001. Ralf Hinze, Johan Jeuring, and Andres L¨oh. Type-indexed data types. In MPC’02, volume 2386 of LNCS, pages 148–174. Springer, 2002. Ralf Hinze, Johan Jeuring, and Andres L¨oh. Comparing approches to generic programming in Haskell. In Datatype-Generic Programming, volume 4719 of LNCS, pages 72–149. Springer, 2007.

Efficiency

The instances derived in our approach are not specialized for a datatype and may therefore incur an unacceptable performance penalty. However, our recent research (Magalh˜aes et al. 2010) indicates that simple inlining and symbolic evaluation, present in some form in every optimizing compiler, suffice in most cases to optimize away all overhead from generic representations. We plan to investigate how these optimizations can be expressed and automatically applied without any user intervention in UHC. 7.5

G´erard Huet. The zipper. JFP, 7(5):549–554, 1997. Patrik Jansson and Johan Jeuring. PolyP—a polytypic programming language extension. In POPL’97, pages 470–482. ACM, 1997. Mark Jones. Type classes with functional dependencies. In ESOP’00, volume 1782 of LNCS, pages 230–244. Springer, 2000. Ralf L¨ammel and Simon Peyton Jones. Scrap your boilerplate: a practical approach to generic programming. In TLDI’03, pages 26–37, 2003. Ralf L¨ammel and Simon Peyton Jones. Scrap more boilerplate: reflection, zips, and generalised casts. In ICFP’04, pages 244–255. ACM, 2004. Andres L¨oh. Exploring Generic Haskell. PhD thesis, Utrecht University, 2004.

Implementation in GHC

Our approach is designed to be as portable as possible. Therefore, we would like to implement it in other compilers, most importantly in GHC. As a first step, we believe we can easily implement most of our generic deriving mechanism in GHC using Template Haskell. The code for the generic functions is kept intact: only the DERIVABLE pragma needs a different syntax. For the user code, a code splice would trigger the generation of generic representations and function instances.

8.

Jos´e Pedro Magalh˜aes, Stefan Holdermans, Johan Jeuring, and Andres L¨oh. Optimizing generics is easy! In PEPM’10, pages 33–42. ACM, 2010. Thomas van Noort, Alexey Rodriguez Yakushev, Stefan Holdermans, Johan Jeuring, and Bastiaan Heeren. A lightweight approach to datatypegeneric rewriting. In WGP’08, pages 13–24. ACM, 2008. Bruno C.d.S. Oliveira, Ralf Hinze, and Andres L¨oh. Extensible and modular generics for the masses. In TFP’06, pages 199–216. Intellect, 2007.

Conclusion

We have shown how datatype-generic programming can be better integrated in Haskell by revisiting the deriving mechanism. All Haskell 98 derivable type classes can be expressed as generic functions in our library, with the advantage of becoming easily readable and portable. Additionally, many other type classes, such as Functor and Typeable, can be declared derivable. Our extension requires little extra syntax, so it is easy to implement. Adding new generic derivings can be done by generic programmers in regular Haskell; previously, this would be the compiler developer’s task, and would be done using code generation, which is more errorprone and verbose. We have implemented our solution in UHC and invite everyone to derive instances for their favorite datatypes or even write their own derivings. We hope our work paves the future for a redefinition of the behavior of derived instances for Haskell Prime (Wallace et al. 2007).

Simon Peyton Jones et al. Haskell 98, Language and Libraries. The Revised Report. Cambridge University Press, 2003. A special issue of JFP. Alexey Rodriguez Yakushev, Johan Jeuring, Patrik Jansson, Alex Gerdes, Oleg Kiselyov, and Bruno C.d.S. Oliveira. Comparing libraries for generic programming in Haskell. In Haskell’08, pages 111–122. ACM, 2008. Tom Schrijvers, Simon Peyton Jones, Manuel M. T. Chakravarty, and Martin Sulzmann. Type checking with open type functions. In ICFP’08, pages 51–62. ACM, 2008. Tim Sheard and Simon Peyton Jones. Template metaprogramming for Haskell. In Haskell’02, pages 1–16. ACM, 2002. Malcom Wallace et al. Derived instances—Haskell Prime. http://hackage.haskell.org/trac/haskell-prime/wiki/ DerivedInstances, April 2007. [Online; accessed 07-June-2010]. Noel Winstanley and John Meacham. DrIFT user guide. http: //repetae.net/computer/haskell/DrIFT/drift.html, February 2008. [Online; accessed 07-June-2010].

Acknowledgments This work has been partially funded by the Portuguese Foundation for Science and Technology (FCT) via the SFRH/BD/35999/2007

48

Exchanging Sources Between Clean and Haskell A Double-Edged Front End for the Clean Compiler John van Groningen

Thomas van Noort

Peter Achten

Pieter Koopman

Rinus Plasmeijer

Institute for Computing and Information Sciences, Radboud University Nijmegen P.O. Box 9010, 6500 GL Nijmegen, The Netherlands {johnvg, thomas, p.achten, pieter, rinus}@cs.ru.nl

Abstract

can clearly recognize that [..] Clean is a compromise between a functional programming language and an intermediate language used to produce efficient code. For instance, a minimal amount of syntactic sugar is added in [..] Clean.”. Later, the core language was sugared. One particularly important factor was its adoption of uniqueness typing (Barendsen and Smetsers, 1993) to handle sideeffects safely in a pure lazy language. Based on this concept, a GUI library (Achten and Plasmeijer, 1995; Achten et al., 1992) was developed, which was used in large applications such as the Clean IDE, spreadsheet (de Hoon et al., 1995), and later the proof assistant Sparkle (de Mol et al., 2002). In 1994, Clean 1.0 appeared, which basically added the syntactic sugar to core Clean that was necessary to develop such large libraries and large applications. In the following years Clean turned open source, and extended its arsenal of functional language features with dynamic typing (Pil, 1999), and built-in generic programming (Alimarine and Plasmeijer, 2002), obtaining Clean 2.1 (Plasmeijer and van Eekelen, 2002). Whenever we refer to Clean in this paper, we mean this version. Very shortly after the presentation of Clean, Haskell was born as a concepts language out of the minds of a large collaboration that idealized an open standard to: “reduce unnecessary diversity in functional programming languages” and “be usable as a basis for further language research”. After three years, this effort resulted in the Haskell 1.0 standard (Hudak et al., 1992) and later the (revised) Haskell 98 standard (Peyton Jones, 2003; Peyton Jones and Hughes, 1999). Early this year, Haskell 2010 was announced and the Haskell’ standard is under current active development. Haskell especially enjoyed the benefits of a rapidly growing community; evolving and adapting standards quickly. The downside being that the term ‘Haskell’ became heavily overloaded. It is often not clear to what it refers: one of the standards, a specific implementation of the flagship Haskell compiler GHC, or something in between? Whenever we refer to Haskell in this paper, we mean Haskell 98 and explicate any deviations. We did not take part in the Haskell collaboration and chose to explore the world of functional programming on our own. After diverging onto different paths more than 20 years ago, we believe it is time to reap the benefits by exchanging (some of) each other’s evolutionary results. Both languages have developed interesting language features and concepts (e.g., uniqueness typing in Clean and monads with exceptions in Haskell) and many useful libraries (e.g., the workflow library iTask and the testing library Gast in Clean, and the parser combinator library Parsec and testing library QuickCheck in Haskell). Our long-term goal is to facilitate the exchange of such libraries and study the forthcoming interactions between languages features that are distinct to Clean or Haskell. There are many ways to achieve this goal. A naive approach is to define a new functional language that is the union of Clean and Haskell. The resulting language would become very

The functional programming languages Clean and Haskell have been around for over two decades. Over time, both languages have developed a large body of useful libraries and come with interesting language features. It is our primary goal to benefit from each other’s evolutionary results by facilitating the exchange of sources between Clean and Haskell and study the forthcoming interactions between their distinct languages features. This is achieved by using the existing Clean compiler as starting point, and implementing a double-edged front end for this compiler: it supports both standard Clean 2.1 and (currently a large part of) standard Haskell 98. Moreover, it allows both languages to seamlessly use many of each other’s language features that were alien to each other before. For instance, Haskell can now use uniqueness typing anywhere, and Clean can use newtypes efficiently. This has given birth to two new dialects of Clean and Haskell, dubbed Clean* and Haskell*. Additionally, measurements of the performance of the new compiler indicate that it is on par with the flagship Haskell compiler GHC. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features; D.3.4 [Programming Languages]: Processors - Compilers General Terms Keywords

Design, Languages

Clean, Haskell

1. Introduction The year of 1987 was a founding one for two pure, lazy, and strongly typed functional programming languages. Clean (Brus et al., 1987) was presented to the public for the first time and the first steps towards a common functional language, later named Haskell, were taken (Hudak et al., 2007). Clean was conceived at the Radboud University Nijmegen as a core language that is directly based on the computational model of functional term graph rewriting to generate efficient code. It also serves as an intermediate language for the compilation of other functional languages (Koopman and N¨ocker, 1988; Plasmeijer and van Eekelen, 1993). For these reasons, it deliberately used a sparse syntax (van Eekelen et al., 1990): “. . . at some points one

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

49

baroque due to different syntax in Clean and Haskell for very similar, but not identical, concepts. A second approach is to develop two separate compilers that translate Clean to Haskell and vice versa. This would require an incredible amount of work and is quite hard since features from one language do not always easily project to the other language. This can be simplified by disallowing such features to be used in the libraries under exchange, but that restricts the application of libraries too much. Instead, we develop dialects of Clean and Haskell, dubbed Clean* and Haskell*, that include just enough extra language features to use each other’s libraries conveniently. Both new languages are realised in a double-edged front end for the Clean compiler that runs in two modes:

ules reside in .icl files and contain all implementations of functions, datastructures, and type classes. Definition modules reside in .dcl files and specify the corresponding interfaces by the exported definitions. Besides importing an entire module, Clean allows the explicit import of elements of a module, distinguishing between the sort of element (functions, types, type classes, etc.). This has been included in Haskell* during this project. Although Haskell 1.0 also used a module system with separate module interfaces, these were abandoned as of Haskell 1.3 because they were increasingly perceived as compiler-generated artifacts, rather than interface definitions (Hudak et al., 2007). Instead, the header of a module enumerates its exported symbols. This perception fits within the language philosophy of Haskell to have the programmer specify only what is required to successfully compile a program. For instance, in Haskell it is allowed to export an identifier x in a module M but not its type, and to import x in another module N. Because x is not in scope in module N, it cannot be given an explicit type. However, the compiler can, and has to, find this type by inspecting module M. Haskell prescribes no relation between module names and files, but by convention each module resides in a .hs or .lhs file. Haskell provides fine-grained control over the names of imported definitions. This is achieved via hiding specific definitions, qualified imports of modules, and hierarchical modules (this last feature is an extension of Haskell). These constructs have been included in Clean* during this project. User-defined definition modules as used in Clean have as advantage that a programmer obtains a clear description of the offered interface of a specific library module, which is very useful from an engineering point of view. A disadvantage of the approach is that a definition module cannot be used by a compiler to provide additional information about the actual implementation, which might be used for optimizations such as inlining.

• Clean* mode that accepts Clean 2.1 programs extended with

Haskell 98 features. • Haskell* mode that accepts Haskell 98 programs extended with

Clean 2.1 features. Although Clean and Haskell are both pure and lazy functional languages, there are many subtle differences. An overview of most of the syntactical differences has been given in (Achten, 2007). In this paper we mainly focus on the semantic differences and describe our effort to marry them within the two extended languages. We do not aim to give a complete and detailed overview, but instead identify the biggest challenges and describe the intuition behind their solution and implementation. Concretely, our contributions are the following: • We identify the most salient differences between Clean and

Haskell: modules, functions, macros, newtypes, type classes, uniqueness typing, monads, records, arrays, dynamic typing, and generic functions (Sections 2 to 12). • With each difference we discuss if and how Clean* and Haskell*

support the exchange and briefly explain how this is incorporated in an implementation.

2.2 Compilation strategies When the Clean compiler compiles an implementation module, it is first verified that the exported definitions match the corresponding implementation. Imported definition modules are assumed to match their implementation and an implementation module is only recompiled if it is new, or required by its timestamp. Compilation of modules takes place from top to bottom. When the compiled version of an imported module is up to date, it suffices to inspect only the definition modules of the imported modules, which significantly speeds up the compilation process. Clean modules are compiled to intermediate ABC code (Koopman et al., 1995), from which object code is generated. The compilation process of a Haskell program is more involved. Because modules can confine themselves to exporting definitions only, but not their types, all sources of imported modules must be available. During compilation, interface files are generated that can be used instead. In the end, object files are generated that are used by a linker to create an executable.

• We provide a concrete implementation of the front end that sup-

ports Clean, Haskell, and their dialects Clean* and Haskell*1 . We give a brief comparison of the current performance of the front end in relation to GHC (Section 13). We end this paper with related work (Section 14) and conclude with a discussion and future work (Section 15). Since Clean and Haskell are syntactically so much alike, it can be quite hard to disambiguate examples from both languages. Therefore, we choose to start each code fragment with a comment line, // Clean or -- Haskell respectively, choosing redundancy over opacity. Similarly for the dialects of the languages, we start with a comment line // Clean* or -- Haskell*.

2. Modules Clean and Haskell come with many libraries. Instead of migrating these manually, we aim to support the exchange of sources via the front end. It allows Clean modules to import Haskell modules and vice versa. In this section we first briefly compare the two module systems (Section 2.1) and corresponding compilation strategies (Section 2.2). Then we discuss how the front end facilitates mixed compilation of modules in Clean* and Haskell* (Section 2.3).

2.3 Mixed compilation The support of mixing Haskell* or Clean* modules in the Clean compiler is based on definition modules. In the Clean world, these definition modules are still defined separately. The definition module of a Haskell* module is generated by the compiler. When Clean* with Haskell* modules are mixed, the compiler has to switch between compilation strategies: Clean* modules are compiled top down as usual, while Haskell* modules have to be compiled bottom up in order to generate the required definition modules. The compiler has to know with what kind of module it is dealing with. If the module is a .icl file, it is assumed that there is a manually defined .dcl file available. Otherwise, if the module is an .hs or .lhs file, an accompanying .dcl file is generated.

2.1 Module systems From the beginning, Clean has used a module system that is very similar to that of Modula-2 (Wirth, 1985). Implementation mod1 The

front end is under active development, current releases are available via http://wiki.clean.cs.ru.nl/Download_Clean

50

is important to observe that this is a syntactic issue: it neither limits the type system nor the use of currying in Clean. As an example, consider the following function that combines the application of the well-known functions map and concat (named flatten in Clean):

If a previous compilation of a Haskell* module already generated such a definition module, the new definition module is compared to the old one. If they are identical, the old definition module is kept, leaving its timestamp unchanged. Otherwise, it is replaced by the new definition module. Before a module is compiled, the definition modules of all imported modules have to be available. If these do not exist or are out of date since their timestamp is newer than the one of the definition module, the corresponding Haskell* modules have to be compiled first in order to generate the required definition modules. As we will see in the following sections, generated definition modules from Haskell* modules sometimes include additional information to inform the compiler of typical Haskell* constructs. For efficiency reasons it is sometimes worthwhile to define definition modules of Haskell modules by hand. In Section 6 we see an example where we manually include specialization information in exported function types.

// Clean concatMap :: (a -> [b]) [a] -> [b] concatMap f xs = flatten (map f xs)

The function type exposes the arity of the implementation, which is two in this case. Hence, if we change the definition to a pointfree notation, the type of the function changes. We use the infix Clean function o for function composition, in contrast to Haskell’s Prelude . notation: // Clean concatMap :: (a -> [b]) -> ([a] -> [b]) concatMap f = flatten o map f

(It should be noted that, as usual, the right-most brackets can be omitted because -> associates to the right.) Now, the arity of the function is one, which is reflected in its type by the insertion of a function type. Moving the first argument inwards changes the arity of the type again, making it of arity zero:

3. Functions The semantics of the core of Clean is based on term-graph rewriting. The expression that is computed is a computation graph and functions are sugared versions of term-graph rewrite rules. Sharing is explicit in both computation graphs and functions. In Clean, the signature of a function reveals information about its arity, strictness, and uniqueness properties. The first two concepts are discussed in this section, the third in Section 7. Sharing is explicit in Clean functions. Variable names in function argument patterns, and case patterns as well, really point to a subgraph in the computation graph after matching a redex. Multiple occurrences of these variables on the right-hand side of a function and case patterns implies that these are shared. Similarly, local graph definitions (i.e., using let or where) on the right-hand side of a function are also always shared. The local function definitions are always lambda lifted. In all cases, = is used as a separator between the left-hand side and right-hand side of a function or local definition. Locally, graph definitions are considered to be constant definitions, and hence, these are shared. If the programmer intends a function of arity zero, this is denoted using => as a separator, or by providing an explicit type signature. Haskell does not explicitly specify what must be shared, but every implementation uses similar rules as stated above. At the top level of a Clean module, every definition is considered to be a function definition. If the programmer intends a constant in applicative form (CAF), this is denoted by using =: as a separator. As an example, we define the well-known efficient list of fibonacci numbers as a constant:

// Clean concatMap :: ((a -> [b]) -> [a] -> [b]) concatMap = \f -> flatten o map f

The parentheses around the function type express that this is a constant function. In Haskell, all these implementations are given the same type, namely: -- Haskell concatMap :: (a -> [b]) -> [a] -> [b]

Consequently, such a type does not reflect the arity of its implementation. Similar effects occur in the use of type synonyms in function signatures. Suppose that we define the following type synonym: // Clean :: ListF a b :== a -> [b] -- Haskell type ListF a b = a -> [b]

In Haskell, ListF a b -> ListF [a] b is also a valid type for any of the implementations of concatMap, but in Clean (ListF a b) -> ListF [a] b is only valid for the second definition with arity one. Since its first version, Clean comes with a strictness analyzer (N¨ocker, 1994) as well as strictness annotations for function signatures. Strictness information is crucial to generate efficient code. The programmer can add strictness annotations to function arguments, and hence export this information in the corresponding definition module. Haskell has no support for strictness information in function signatures. Clean and Haskell both support strictness annotations in datatypes in very similar ways, therefore this is not discussed.

// Clean fibs =: [1 : 1 : [x + y \\ x <- fibs & y <- tl fibs]]

If we used = as a separator instead, this would result in recomputing the list for each invocation. In Haskell, a top-level function without arguments is assumed to be a CAF, unless it has an explicit overloaded type signature. Hence, the above example can be expressed as a function without risk of recomputation:

Exchange Clean* functions can be used easily by Haskell* and vice versa without modification. Haskell* function definitions are interpreted as term graph rewrite rules as described above. In Haskell* function signatures can be given strictness annotations in the same fashion as in Clean*. Strictness information is derived during compilation and exported in the corresponding definition module. Below is discussed how the arity information is derived and exported.

-- Haskell fibs = 1 : 1 : zipWith (+) fibs (tail fibs)

In Clean the programmer can make the tradeoff between (possible) recomputation and space usage. In Haskell this choice is fixed to storing the results and hence usage of space. The arity of term-graph rewrite rules can be greater than one, in contrast to functions considered from a λ-calculus perspective as chosen by Haskell. For this reason, function signatures in Clean show the arity of their implementation, while signatures are curried in Haskell. The advantage to knowing the arity of a function is efficiency: a function application knows when it is fully saturated. It

Implementation The issue with function arity shows up in interfaces between Clean* and Haskell* modules. The front end transforms user-provided Haskell* types for exported functions in the

51

generated definition module and makes the arity of a Haskell* function explicit. Suppose we have the following Haskell* definition of the concatMap function:

Implementation Currently, it remains future work to export macros from Haskell*.

-- Haskell* concatMap :: (a -> [b]) -> [a] -> [b] concatMap f xs = concat (map f xs)

5. Newtypes Although type synonyms are useful to document code and explain the purpose of a type, they suffer from the disadvantage that they cannot serve as an instance of a type class or be recursive. Clean’s syntax for type synonyms indicates that they are just macros at the type level. Haskell 1.3 introduces newtype declarations (i.e., datatype renamings) which are syntactically identical to an algebraic datatype with exactly one constructor of arity one, but which intention is to behave semantically as a type synonym. For instance, here are two newtype definitions:

When a Haskell* module exports this function, the front end generates a Clean type for the definition module that reflects the arity of the implementation, which is two in this case: concatMap :: (a -> [b]) [a] -> [b]

If we define this function in point-free notation, the arity of the implementation changes and the exported type becomes:

-- Haskell newtype Nat = Nat Int newtype Fix f = In (f (Fix f))

concatMap :: (a -> [b]) -> [a] -> [b]

Note that in this case, the exported type is syntactically identical to the original Haskell type, but explicitly states that concatMap f yields a function value. Similarly, when a type synonym obscures the arity of a function, its exported type is transformed. Suppose we export the following functions with one identical Haskell* type:

This eliminates the above mentioned drawbacks: Nat can be made an instance of say the type class Integral, and Fix is clearly a recursive type. The constructors are still included in patterns and construction, but are assumed to be erased by the compiler. Hence, every Nat instance behaves as an ordinary Int value and every Fix f behaves as a plain recursive function. Clean does not support newtypes. The best approximation is to use an algebraic datatype with a strict argument:

-- Haskell* concatMap2, concatMap1, concatMap0 :: ListF a b -> ListF [a] b concatMap2 f xs = concat (map f xs) concatMap1 f = \xs -> concat (map f xs) concatMap0 = \f xs -> concat (map f xs)

// Clean :: Nat = Nat !Int :: Fix f = In !(f (Fix f))

With each version, the type synonym is expanded to match the arity of the implementation of the function. Thus, the definition module contains:

Operationally, this version is more expensive than a version where these constructors are erased at compile time.

concatMap2 :: (a -> [b]) ![a] -> [b] concatMap1 :: (a -> [b]) -> [a] -> [b] concatMap0 :: ((a -> [b]) -> [a] -> [b])

Exchange All Haskell* newtypes can be imported and used in Clean* modules and adhere to the assumed Haskell semantics. The mentioned Clean types are defined as newtypes in Clean* as follows:

Only concatMap2 is strict in its list argument since concat and map are strict, and the other definitions return functions that still expect one ore two arguments.

// Clean* :: Nat =: Nat Int :: Fix f =: In (f (Fix f))

Note that this code fragment is also legal Haskell*.

4. Macros

Implementation The implementation of newtypes avoids the constructor overhead since all constructors belonging to newtypes are erased at compile time. Removing constructors is not as trivial as it seems. For example, consider the Haskell wrapper function toNat:

Clean 0.8 added macros to the language. A macro can be regarded as a function with one alternative and just named arguments. Macros are substituted at compile time, and hence are not allowed to be recursive. Naturally, it may use other recursive functions or define recursive functions locally. Note that the substitution is a graph reduction, and not a textual substitution. For instance, we define a macro to double a value:

-- Haskell toNat :: Int -> Nat toNat = Nat

// Clean double x :== x + x

We have to introduce an identity function if the constructor Nat is erased. Also, constructors need to be erased from patterns in function definitions:

Here, the application double (fib 100) is reduced at compile time to let x = fib 100 in x + x. Hence, the computation of x is shared. In Haskell, the programmer can use the INLINE pragma to encourage the compiler to inline the body of a function. For instance, the above macro is defined as follows in Haskell as a function to be inlined:

-- Haskell fromNat :: Nat -> Int fromNat (Nat _) = 10

If we would leave the constructor, the function becomes strict while the semantics requires a nonstrict function. The value fromNat ⊥ must be rewritten to 10 and not to ⊥. Also, the newtypes itself must be erased at compile time in order to make annotations for uniqueness typing on the argument of the newtype effective. The type wrapped in the newtype obtains the type annotations of the newtype definition, instead of the strictness annotation shown earlier. This implies that Nat has to be replaced by Int. Evidently, this is not possible for recursive newtypes.

-- Haskell {-# INLINE double #-} double x = x + x

Exchange Haskell* modules can import and use Clean macros, and define them using the same syntax. The INLINE pragma is not yet included in Clean*. However, macros subsume this concept.

52

6. Type classes

In Clean, any overloaded function is specialized within module boundaries. Therefore, only exported functions and instances possibly need to be specialized using the special keyword in a definition module:

Haskell has supported type classes from the very beginning. Clean, having started as a core language, added type classes to the language with version 1.0 in 1994. There are a number of differences that need to be discussed. While Clean supports multi-parameter type classes, the parameters of a Haskell type class are restricted to one (although many Haskell implementations allow more parameters). For example, consider the following type class Array a e that is used for arrays of type a with elements of type e, as we will see in Section 10:

// Clean eqL :: [a] [a] -> Bool | Eq a special a = Int; a = Bool instance Eq [a] | Eq a

special a = Int; a = Bool

In contrast to Haskell, such specializations are specified by a substitution instead of the substituted type. To avoid boilerplate programming, Haskell supports a deriving clause for data or newtype declarations. This relieves the programmer from writing instances of the type classes Eq, Ord, Enum, Bounded, Show, Read, and Ix herself, but instead lets the compiler do the job. In Clean, this kind of type-directed boilerplate programming is achieved by generic functions, as we will discuss later in Section 12. Haskell uses a rather elaborate system of type classes to organize numerical values: Num, Real, Fractional, Integral, RealFrac, Floating, and RealFloat for handling values of type Int, Integer, Float, Double, and Rational. Numeric denotations are overloaded: 0 is of the type Num a => a and is in fact the expression fromInteger (0 :: Integer). Therefore, a Haskell programmer needs to add a type signature to disambiguate overloading from time to time. A default declaration provides another approach to disambiguate these cases. This consists of a sequence of types that are instances of the numeric classes. In case of an ambiguous overloaded type variable that uses at least one numeric class, the sequence of types are tried in order to find the first instance that satisfies the constraints. A module has at most one such declaration, and by default it is default (Integer, Double). Clean uses a much simpler approach: numbers are either integer (Int) or floating point (Real) and their denotations are different: 0 is always of type Int, and 0.0 is always of type Real. Coercion between these types is achieved explicitly using any of the overloaded functions toInt, toReal, fromInt, or fromReal.

// Clean class Array a e where createArray :: Int e -> (a e) size :: (a e) -> Int

Type classes in Haskell can suggest default implementations for its members that can be overruled in specific instances. For instance in the equality type class: -- Haskell class Eq a where (==) :: a -> a -> Bool (/=) :: a -> a -> Bool x == y = not (x /= y) x /= y = not (x == y)

If an instance provides no definition, the default definition is used. In Clean, default members are defined using macros, which are described earlier in Section 4: // Clean class Eq a where (==) :: a a -> Bool (/=) x y :== not (x == y)

The difference with Haskell is that default members via macros cannot be redefined. In contrast to Haskell, Clean does support defaults on the level of instances. For example, consider the catch-all instance for Eq:

-- Haskell {-# SPECIALIZE eqL :: [Int] -> [Int] -> Bool #-} {-# SPECIALIZE eqL :: [Bool] -> [Bool] -> Bool #-}

Exchange Haskell* supports the less restrictive multi-parameter type classes of Clean. Not only can we import such definitions in Haskell*, we can also define such type classes ourselves and provide instances. When importing a type class from the other language, the semantics of default members remains the same: Clean* can redefine Haskell default members while Haskell* cannot redefine Clean macros. The arity of the members of a concrete instance is determined by the importing language. Members of an instance of a Clean type class in Haskell* can be of any arity, while the arity of the members of a Haskell type class in Clean* is the number of arguments. Specialization in the style of Haskell is not yet implemented. Recall that specialized definitions are generated within module boundaries, similar to Clean. The type class hierarchy for numerical values in Haskell is available in Clean* as a library. Haskell’s types for numerical values are currently not supported in Clean*. However, Haskell* can use Clean’s numerical types by prefixing such a value with ‘. The value ‘0 is of the Clean type Int, just like the Haskell value 0 :: Int. Similarly, the value ‘0.0 is of the Clean type Real like the Haskell value 0.0 :: Double. Proper support for efficient Float values in Haskell* is still under active development.

eqL eqL eqL eqL eqL

Implementation The front end uses Clean macros to implement default members in Haskell*. The default members can be redefined, but their current form is restricted. A default member in Haskell* must have the same arity as the type it has been given, it can only consist of one alternative, and no infix-style definition is

// Clean instance Eq a where _ == _ = False

This instance is used whenever no other instance matches. Consequently, overlap occurs between instances, but this is only allowed on the top level. We cannot define both instances of Eq for both (Int, a) and (a, Int) in Clean. As we discussed in Section 3, Clean enforces an explicit arity of function type signatures while Haskell types do not reflect the arity of their implementation. Hence, the members of the instances of a Clean type class must agree on their arity as specified by the type class. Instances of a Haskell type class can differ in arity from each other and the original type class definition. To avoid the overhead of the dictionary-passing style translation of type class, Haskell includes the SPECIALIZE language pragma to generate specialized versions at compile time. For instance, in the overloaded equality on lists, we indicate that specialized definitions for Int and Bool are to be generated and used when possible:

:: Eq a => [a] -> [a] -> Bool [] [] = True [] _ = False _ [] = False (x:xs) (y:ys) = x == y && eqL xs ys

53

allowed. Also, such default members cannot yet be exported, this is future work. Since the arity of members of Haskell instances can differ, the generated definition module of a Haskell* module must include the types of the exported instance members to reflect their arity. To facilitate efficient implementations of some of the Haskell Prelude functions, Clean includes redefinitions of exported specialized instances and functions. For example, the exported Haskell function that converts Integral values has the following signature:

in Section 10. The programmer can annotate function arguments and datastructures with uniqueness attributes for the same purpose. Uniqueness can also be used to implement I/O, by annotating values that are somehow ‘connected’ with the outside world as being unique, which is discussed in Section 8. As an example of uniqueness typing, consider a stateful map function, mapS, that threads a unique state of type *s (type variables need to be attributed uniformly): // Clean mapS :: (a *s -> (b, *s)) [a] *s -> ([b], *s) mapS f [] s = ([], s) mapS f [x:xs] s = ([y:ys], s2) where (y, s1) = f x s (ys, s2) = mapS f xs s1

fromIntegral :: (Integral a, Num b) => a -> b special a = Int, b = Double :== fromIntegralIntDouble

Here, we manually include a type signature in the definition module that defers the specialization to a more efficient implementation in fromIntegralIntDouble. Derived instances in Haskell* are automatically included in the generated definition module such that these can be imported from another module. The implementation of the deriving construct in Haskell* is not as straightforward as it may seem. A fixed-point computation is required to determine the context by reduction, if some of the derived instances are already defined but with a more complicated context. In Clean, CAFs are not allowed to be overloaded since such a value must have a single type in order to be a proper constant. In Haskell, overloaded CAFs without an explicit type signature are allowed, but overloading is resolved at compile time using the monomorphism restriction and the default rule as described earlier. Consequently, the type of an overloaded CAF cannot be determined just using its definition and the types of the functions it uses, but also by the uses of the CAF in the module. Therefore, we may have to type check the entire module before we can determine the type of the CAF. The following implementation is used:

Actually, the most general type for mapS is one that allows both nonunique and unique arguments. The . annotation ensures that the same type variable is assigned the same uniqueness attribute: // Clean mapS :: (.a .s -> (.b, .s)) [.a] .s -> ([.b], .s)

The type variable .a is either unique or nonunique in the signature, the same holds for .b and .s. For reasons of presentation, we usually omit these extensive type signatures. World-as-value programming is supported syntactically in Clean using #-definitions, also known as let-before definitions. For instance, mapS is preferably written as: // Clean mapS :: (.a .s -> mapS f [] s = mapS f [x:xs] s # # =

1. The type of a CAF is determined without the monomorphism restriction and default rule. If it is not overloaded, type checking continues in the usual way.

(.b, .s)) [.a] .s -> ([.b], .s) ([], s) (y, s) = f x s (ys, s) = mapS f xs s ([y:ys], s)

Note that this definition is a sugared version of the earlier mapS definition using local where definitions.

2. If it is overloaded and used by another function, a preliminary type of this function is determined using the overloaded type of the CAF. The type of the use of the CAF, after unification, is remembered. If the function contains more than one use, the types of these are unified. Other CAFs that are used are remembered together with their types.

Exchange Haskell* accepts uniqueness typing in Clean style. It can use Clean functions that manipulate unique values. As an example, here is a function that uses Clean I/O to write data to a file using an accumulating parameter: -- Haskell* writeLines :: Show a => [a] -> *File -> *File writeLines [] file = file writeLines (x:xs) file = writeLines xs (fwrites (clstring (show x)) file)

3. If a function with such a preliminary type is used by another function, this function is typed as if the function used the CAFs remembered in the preliminary type. Hence, a preliminary type is inferred that contains the types of the CAFs that are used (possibly indirectly) by this function. Note that a CAF that uses another CAF is treated in a similar way.

We use Clean’s StdFile library function fwrites to write a string to a file and clstring to convert a Haskell string to a Clean string (their difference is discussed in Section 10). Naturally, the uniqueness properties of Haskell* functions need to be verified. Types can be annotated with uniqueness attributes explicitly, or uniqueness information is derived and exported in the corresponding generated definition module. For instance, consider this Haskell* function to update an element in a list:

4. The remembered preliminary types of the CAFs are unified to determine their types. 5. All functions for which preliminary types were inferred are type checked again, but now using the no longer overloaded types of the CAFs.

-- Haskell* updateAt _ _ [] = [] updateAt 0 x (_:ys) = x : ys updateAt n x (y:ys) = y : updateAt (n - 1) x ys

7. Uniqueness typing Uniqueness typing relies heavily on the fact that sharing is completely explicit in Clean, as discussed in Section 3. A value that is unique has a single path from the root of the computation graph to the value. A function demands such an argument using the * annotation in its signature. Function bodies that violate this constraint are not well typed, and hence are rejected during compilation. Values that have a single reference can be updated destructively without compromising functional semantics. This allows Clean to support arrays with in-place updates of its elements, as we discuss later

This function can be applied to a list that may contain unique values (.a) and preserves the uniqueness of the spine of the list (u:[.a]): -- Haskell* updateAt :: Num n => n -> .a -> u:[.a] -> u:[.a]

The uniqueness attributes in this type are identical to those of updateAt in Clean’s StdList module.

54

Uniqueness annotations can also enforce constraints. Consider the following function to swap an element in a possibly spineunique list, instead of updating it:

// Clean :: StateF s b = StateF !.(s -> .(b, s)) instance Monad (StateF .s) where return x = StateF (\s -> (x, s)) (>>=) (StateF f) g = StateF (\s -> let (x, s1) = f s (StateF h) = g x in h s1 )

-- Haskell* swapAt :: Int -> .b -> u:[.b] -> (.b, v:[.b]), [u <= v] swapAt _ x [] = (x, []) swapAt 0 x (y:ys) = (y, x:ys) swapAt n x (y:ys) = (z, y:zs) where (z, zs) = swapAt (n - 1) x ys

The source and result list now have different uniqueness attributes (u and v respectively), but they are related in the sense that the uniqueness of the source list is at least as unique as the result list ([u <= v]). In this case it means that from a nonunique source list you cannot construct a unique result list (due to sharing of part of the list spine), but from a unique source list you can construct a nonunique result list.

Monads are used to structure programs. Using this Monad class, the function mapS from Section 7 is expressed more elegantly: // Clean mapS :: (a -> m b) [a] mapS f [] = return mapS f [x:xs] = f x mapS f return

Implementation The issues that are related to the monomorphism restriction and default rule, as discussed earlier in Section 6, needed to be solved in order to adopt Clean’s uniqueness typing in Haskell*.

In Haskell, monads are supported syntactically with do-notation. Hence we can choose for the definition of mapS for a notation similar to the Clean version or the version with do-notation: -- Haskell mapS :: Monad m => (a -> m b) -> [a] -> m [b] mapS f [] = return [] mapS f (x:xs) = do y <- f x ys <- mapS f xs return (y:ys)

8. Monads Any practical programming language needs to be able to interact with the ‘outside’ world. Clean and Haskell have followed entirely different solutions for this challenge. In Clean 0.8, uniqueness typing has been included to support an explicit multiple environment passing style (i.e., the world-as-value style). In Haskell 1.3, monads were adopted in favor of the stream-based and continuation-based I/O of earlier Haskell versions. The basic philosophy of monads is that a monadic value represents a recipe that, when performed, may have side-effects and yields a value of some type. Technically, a monad consists of the combinators return and >>=:

The IO monad in Haskell is used to sequence I/O operations. The world is hidden from the programmer, and hence there is no danger of violating the single threadedness of this value. In Clean, the world is not hidden from the programmer, and single threadedness is guaranteed by marking them unique. The programmer either chooses to pass these objects explicitly as in the previous section, or to hide the unique object in a monad and pass it implicitly. The IO monad in Haskell also enables exception handling. Its single threadedness ensures a correct binding of exceptions to handlers in a lazy language.

-- Haskell infixl 1 >>= class Monad m where return :: a -> m a (>>=) :: m a -> (a -> m b) -> m b

Exchange Monads are integrated seamlessly with uniqueness typing. In the previous section we explained that unique types are available in Haskell*. The IO monad, as well as conversions from and to a unique world, is available in Clean* via:

A well-known instance of this class passes a state of type s from function to function. The state-passing function is wrapped in the newtype StateF:

-- Haskell* newtype IO a = IO (!*World -> *(a, !*World))

-- Haskell newtype StateF s b = StateF (s -> (b, s)) instance Monad return x = (StateF f) =

-> m [b] | Monad m [] >>= \y -> xs >>= \ys -> [y:ys]

Since this is an ordinary type, it is straightforward to pack a unique world in IO and to unpack it again.

(StateF s) where StateF (\s -> (x, s)) >>= g StateF (\s -> let (x, s1) = f s (StateF h) = g x in h s1 )

Implementation The basic transformation scheme from do-notation to ordinary monadic constructors is given by Peyton Jones (2003). In order to achieve efficient execution, the code obtained by this transformation needs to be optimized. Currently our implementation of Clean* performs a number of optimizations, such as inlining the member definitions of the IO instance for Monad. Also, the exception-handling mechanism is implemented in Clean* and Haskell*. The implementation maintains a stack of exception handlers and dynamically searches for the correct handler if an exception occurs. This makes installation of a handler via a catch relatively expensive, but prevents costs during ordinary evaluation.

A very similar class Monad is defined in Clean: // Clean class Monad m where return :: a -> m a (>>=) infixl 1 :: (m a) (a -> m b) -> m b

The differences with the Haskell definition are the notation for the fixity of the >>= combinator and the explicit arity in the types. Instead of a newtype for StateF we use an algebraic datatype, as described in Section 5. It should be noted that additional uniqueness attributes are required in the right-hand side of StateF to allow both b and s to be unique. We rely on uniqueness typing to ensure a correct single-threaded implementation:

9. Records Records were introduced in Clean 1.0. A Clean record is an algebraic datatype of one alternative that does not have a constructor, but a nonempty set of field names. Records are allowed to use the

55

Conversely, a Clean* module can import Haskell records and their field selector functions as well. For instance, a Haskell module that exports the above definition of Stream can be used in Clean*:

same (sub)set of field names. For instance, the following declarations happily coexist: // Clean :: GenTree a = {elt :: a, kids :: [GenTree a]} :: Stream a = {elt :: a, str :: Stream a}

// Clean* mkStream :: Stream a mkStream = Stream 0 mkStream

Field values are extracted via pattern matching on the field names or by using a field name as a selector. In case of overlapping field names, a programmer must disambiguate the expression by either providing one distinguishing field name in a pattern (e.g., {elt, kids} and {elt, str}) or by inserting the appropriate type constructor name (e.g., {GenTree | elt} in a pattern or x.GenTree.elt as a selector). Records are created by exhaustively enumerating all field names or by updating a subset of the field names of an existing record. Here is an example of a function that updates an element of a stream:

hdStream :: (Stream a) -> a hdStream s = elt s

A Haskell record is denotated as a vanilla algebraic datatype. Clean* does not support the field label syntax at Haskell record value construction. Implementation The mixed use of Clean records in Haskell* gives rise to several parser issues. Consider the following example: -- Haskell* analyzeThis = C {elt = 0, kids = []}

// Clean updStream :: Int a (Stream a) -> Stream a updStream i x s=:{elt, str} | i < 0 = s | i == 0 = {Stream | s & elt = x} | otherwise = {s & str = updStream (i - 1) x str}

This is either a normal Haskell record update in which C :: GenTree a, or the function C applied to a Clean record, but also a data constructor C with a Clean record of type Stream a: -- Haskell* data T a = C (Stream a)

Haskell supports records only partially (since Haskell 1.3) in the form of field labels. All arguments of a constructor of an algebraic datatype are either addressed by their position or by field labels. A field label f is allowed in several alternatives of an algebraic datatype T, provided they have the same type a. Every field label brings a new function in scope, named f :: T -> a. For this reason, no two datatypes can use the same field label, even if they have the same result type. To create a record, the corresponding constructor must be provided and a (possibly empty) set of field labels to be initialized. Any omitted nonstrict field label is silently initialized as ⊥. It is illegal to omit strict field labels at initialization. Given a record value, a new record is created by updating a subset of the field labels. As an example, the Stream datatype and the updStream function look as follows in Haskell:

In Haskell, the programmer can switch between layout-sensitive and layout-insensitive definitions within a function body. Layoutsensitive mode is assumed when no opening brace is encountered after one of the keywords where, let, do, or of. In Clean, layoutinsensitive mode is switched on or off at the beginning of an entire module, simply by ending the module header with ; (on) or not (off). Hence in Haskell*, using a local definition that patternmatches a Clean record is very similar to a local layout-insensitive definition. Consider the two following definitions: -- Haskell* f = (elt, kids) where {elt = 3; kids = []} g = (e, k)

where {elt = e, kids = [k]} = mkGenTree

-- Haskell data Stream a = Stream {elt :: a, str :: Stream a}

Here, it can only be determined that a local layout-insensitive definition is given due to the use of ; and missing = ... right-hand side. Currently, Haskell* allows switching to layout-insensitive mode via {, but does not allow switching back.

updStream :: Int -> a -> Stream a -> Stream a updStream i x s@(Stream {elt = elt, str = str}) | i < 0 = s | i == 0 = s {elt = x} | otherwise = s {str = updStream (i - 1) x str}

10. Arrays Clean has extensive language support for efficient arrays that can be updated destructively due to their uniqueness properties. Arrays with elements of type a come in three flavors: lazy ({a}, which is the default), strict ({!a}), and unboxed ({#a}). Since these are different types, array operations are organized as a multi-parameter type class Array a e where a is the array type, and e the element type. Array operations are bundled in module StdArray. Unboxed array elements can only be basic types, arrays, or records. Note that in Clean the String type is implemented as an unboxed array of Char values, and hence is synonym to {#Char}. In Haskell, String is synonym to a list of Char values. Clean array values can be created in several ways:

Exchange We allow both styles of records: a Clean* program can still define record types with overlapping field names, and a Haskell* program can define record types with multiple alternatives that use the same field labels. In Haskell*, it is allowed to import and use Clean records. Clean record fields are selected with ~, and the record type can be used to disambiguate field names. For instance, the Clean GenTree and Stream record types can be imported and used in the same Haskell* module: -- Haskell* mkGenTree :: GenTree a mkGenTree = {elt = 0, kids = []}

// Clean zeroes :: Int -> .(a Int) | Array a Int zeroes n = createArray n 0

mkStream :: Stream a mkStream = {elt = 0, str = mkStream}

fibs10 :: .(a Int) | Array a Int fibs10 = {1, 1, 2, 3, 5, 8, 13, 21, 34, 55}

rootGenTree :: GenTree a -> a rootGenTree t = t~GenTree~elt

fibs :: Int -> .(a Int) | Array a Int fibs n = {fib i \\ i <- [0..n - 1]}

The mkGenTree and mkStream functions construct a Clean record.

56

All of these functions create an array of type .(a Int) | Array a Int, where the . indicates that the array can be updated destructively. Here, zeroes n creates an array, via the Array class member function createArray, containing n zeroes, fibs10 contains the first ten fibonacci numbers, and fibs n uses an array comprehension to construct the first n fibonacci numbers. It should be noted that usually the programmer decides for one particular array type (lazy, strict, or unboxed) for efficiency reasons, and uses overloaded versions typically for array libraries. Arrays are updated destructively. The notation is similar to record updates, but instead of a field label, an index is provided. So, with a an array, then {a & [i] = x, [j] = y} destructively updates a at index positions (starting at zero) i and j with values x and y respectively. Array updates can be combined concisely with array comprehensions. For instance, the function fibs is defined more efficiently using a lazy array:

Implementation Haskell arrays in the Array module are implemented as strict Clean arrays: -- Haskell* data Array a b = Array !(!a, !a) !{b}

Due to this strict representation of arrays, all array operations come with strict arguments. Specialized versions of type Int are generated and exported, using special as discussed in Section 6, for the array operations that are overloaded in the Ix class. As an example, here are the exported signatures of array and listArray: -- Haskell* array :: Ix a => !(!a, special a = listArray :: Ix a => !(!a, special a =

!a) -> ![(a, b)] -> Array a b Int !a) -> ![b] -> Array a b Int

Also, a distinction is made between arrays that have a zero lower bound and other lower bound values.

// Clean fibs :: Int -> {Int} fibs n = a where a = {createArray n 1 & [i] = a.[i - 1] + a.[i - 2] \\ i <- [2..n - 1]}

11. Dynamic typing Clean supports a Dynamic type to wrap values into a black box together with its type, deferring type unification until run time. Haskell has no such feature, but GHC offers the Data.Dynamic library for similar but limited purposes. In Clean, a value is wrapped in a dynamic using the corresponding keyword:

Here, a.[i] selects the element at index i in array a. Indexes can also be used in patterns, making these either constants or variables. As an example, here is a palindrome checker for arrays:

// Clean wrapInt :: Int -> Dynamic wrapInt x = dynamic x :: Int

// Clean isPalindrome :: {e} -> Bool | Eq e isPalindrome a = size a <= 1 || check (0, size a - 1) a where check (i, j) a=:{[i] = x, [j] = y} = i >= j || x == y && check (i + 1, j - 1) a

The type annotation is only required when polymorphically typed values are wrapped. Unwrapping a value is performed via pattern matching, specifying the expected type:

Haskell provides only immutable arrays via the standard module

// Clean unwrapInt unwrapInt unwrapInt unwrapInt unwrapInt unwrapInt

Array. Arrays are implemented as an abstract datatype Array a b where a is the type of the bounds of the array and must be an instance of the Ix class, and b is the element type. Haskell lacks

denotations for arrays, array patterns, and array selections. Arrays are created using two library functions: -- Haskell array :: Ix a => (a, a) -> [(a, b)] -> Array a b listArray :: Ix a => (a, a) -> [b] -> Array a b

:: Dynamic (x :: (xs :: ((f, x) :: (f :: _

-> Int Int) [a]) (a -> Int, a)) A.a: [a] -> Int)

= = = = =

x length xs f x f [1..10] 10

In the second and third arm, a is a pattern variable and is unified with the concrete type that is stored in the dynamic value. Multiple occurrences of the pattern variable in the third arm forces unification of the components of the tuple type. In the fourth arm, a is universally quantified, and hence the value must be a polymorphic function on lists. Any value can be (un)wrapped, as long as there is a value representation of its type available. This is guarded by the builtin type class TC. For example, consider the following universal wrapping function:

In both cases, the first argument (l, u) defines the bounds of the array and the second argument influences the initial array elements. For array, each (i, x) in the (finite) list updates the array at index position i to value x. For listArray, the first u - l + 1 entries from the (possibly infinite) list determine the initial values of the array. In both cases unaddressed positions are initialized with ⊥. The // operator creates a new array from an existing array: -- Haskell (//) :: Ix a => Array a b -> [(a, b)] -> Array a b

// Clean wrap :: a -> Dynamic | TC a wrap x = dynamic x

The result array is identical to the source array, except that each (i, x) in the list sets the value at index position i to x.

The context in which this function is used determines the type that is stored in the dynamic with the value. Analogously, unwrapping a value can depend on the type that the context requires:

Exchange The Array module has been implemented in Haskell* and can be used in Clean*. Haskell* can import Clean arrays and manipulate them with the functions from the StdArray module. The Clean syntax of array element selection (a.[i]) conflicts with Haskell function composition and list notation. Hence, this is not supported in Haskell*. Instead, elements are selected with a?[i] which selects the element at index position i and returns the unaltered array a. Alternatively, the Array class member function select can be used. Also, we can denote Clean arrays in Haskell*. For instance, {1, 2, 3}, {!1, 2, 3}, and {#1, 2, 3} are legal denotations in Haskell*.

// Clean unwrap :: Dynamic -> Maybe a | TC a unwrap (x :: a^) = Just x unwrap _ = Nothing

Here, the type of the context determines with which type the dynamic content is unified. This is indicated by postfixing a type pattern variable with ^, which ‘connects’ it with the type variable occurring in the type of function.

57

If the programmer wishes to have an instance of equality for her custom type, say GenTree and Stream defined in Section 9, then this is expressed as:

Exchange Since Haskell does not support dynamic typing like Clean, we only have to consider the effects of Clean’s dynamic typing in Haskell*. The type Dynamic and type class TC are imported via the module StdDynamic in Haskell* since these are built in. When a Clean function is used that returns a dynamic value, the Haskell* module has to be able to denotate such values. Therefore, it supports the dynamic keyword. For instance, we are able to define the wrap function in Haskell* as follows:

// Clean derive geq GenTree, Stream

Such derived functions are exported in the same fashion. A kind annotation is always provided for a generic function. For instance, if we wish to test two general trees x and y for equality, we write geq{|*|} x y. Naturally, overloaded equality can be defined as a synonym of the generic variant:

-- Haskell* wrap :: TC a => a -> Dynamic wrap x = ‘dynamic x

// Clean instance Eq (GenTree a) | geq{|*|} a where x == y = geq{|*|} x y

The keyword is escaped using a ‘ to avoid any naming conflicts with similarly named definitions in Haskell. Also, we can unwrap a value in a dynamic pattern match in Haskell*:

The programmer can deviate from the generic recipe if she wishes. In that case, the generic function is specialized for a specific type. Suppose that two general trees are identical if they have the same elements when visiting the tree in left-first depth-first order:

-- Haskell* unwrap :: TC a => Dynamic -> Maybe a unwrap (x :: a^) = Just x unwrap _ = Nothing

// Clean geq{|GenTree|} fx x1 y2 = length e1 == length e2 && and (zipWith fx e1 e2) where (e1, e2) = (elts x1, elts y2) elts {elt, kids} = [elt : concatMap elts kids]

Implementation Since the Clean compiler already supports dynamic typing, the implementation did not pose many challenges. The only issue arisen in the Haskell parser was due to the use of the :: annotation which is obligatory when wrapping polymorphic values. It conflicts with Haskell where any expression can be annotated with a type using the same notation. For example, consider the following expression:

The fx parameter is provided by the generic mechanism and is the generic equality for the element types of the generalized tree. This specialization is exported using the derive syntax.

-- Haskell* wrappedId :: Dynamic wrappedId = ‘dynamic id :: A.a: a -> a

Exchange Haskell does not have any built-in support for generic functions, therefore, we only consider using Clean’s generic functions in Haskell*. Since every use of a generic function requires a kind annotation, Haskell* supports such annotations. When importing a generic function like geq in a Haskell* module, an instance for a Haskell* datatype is derived using the derive keyword. For similar reasons as ‘dynamic in Section 11, this keyword is escaped:

It is unclear whether the type annotation is part of Clean’s dynamic typing system or Haskell’s expression. Whenever the parser recognizes the ‘dynamic keyword, the subsequent type annotation is part the dynamic value. Otherwise, the type annotation is part of the expression.

-- Haskell* data BinTree a = Leaf a | Node (BinTree a) a (BinTree a)

12. Generic functions Clean supports generic programming as advocated by Hinze (2000) which was adopted in Clean in 2001. The style of programming is very similar to Generic Haskell (L¨oh et al., 2003). Generic programming is used to avoid boilerplate programming, for essentially the same purpose as instances can be derived automatically for type classes in Haskell, as discussed in Section 6. Haskell has no language support for generic functions. A generic function is a recipe that is defined in terms of the structure of datatypes, rather than the datatypes themselves. The key advantage is that there are only a few structural elements from which all custom datatypes can be constructed. For algebraic datatypes, the programmer needs to distinguish alternatives, products of (empty) fields, and basic types. As an example, here is an excerpt of the generic definition of equality: // Clean generic geq a geq{|Int|} geq{|UNIT|} geq{|EITHER|} geq{|EITHER|} geq{|EITHER|} geq{|PAIR|}

:: a a -> Bool x UNIT fx _ (LEFT x1) _ fy (RIGHT y1) fx fy _ fx fy (PAIR x1 y1)

y UNIT (LEFT x2) (RIGHT y2) _ (PAIR x2 y2)

= = = = = = &&

‘derive geq BinTree

We are even able to define generic functions in Haskell*. The earlier definition of geq remains the same, only its signature changes: -- Haskell* ‘generic geq a :: a -> a -> Bool

An escaped keyword is now used and the type no longer reflects the arity of its definition. Exporting generic functions and their derivations from a Haskell* module is not yet implemented. Implementation The implementation did not pose any challenges since Clean already includes support for generic functions.

13. Performance Although the implementation of the front end is not yet complete, it is already possible to compile a large class of Haskell programs to efficient code. We havfe compared the current implementation of the double-edged front end for the Clean compiler with GHC 6.12.2 by running the complete Haskell benchmark programs of Hartel (1993). We modified the parstof program slightly to prevent GHC from optimizing the program. It is intended that the computation is performed 40 times instead of once. To obtain good measurable execution times some of the input sizes of the programs were increased. Our benchmark environment used IA32 code on a computer with an AMD Opteron 146 2Ghz processor running the Windows XP X64 operating system.

x == y True fx x1 x2 fy y1 y2 False fx x1 x2 fy y1 y2

Note that this is not a single function definition, but rather a collection of function definitions that are indexed by a type constructor. They also do not need to reside in the same module, but can be defined anywhere provided that the generic type signature is in scope.

58

Program complab event fft genfft ida listcompr listcopy parstof sched solid transform typecheck wang wave4

Front end (s) 0.81 0.64 0.36 0.72 0.84 0.11 0.11 0.23 2.78 0.81 0.91 0.77 0.55 0.53

GHC (s) 1.03 1.23 0.78 1.37 0.87 0.25 0.26 0.19 1.84 1.11 1.28 0.86 0.64 0.72

Ratio 0.79 0.52 0.46 0.53 0.97 0.44 0.42 1.21 1.51 0.73 0.71 0.90 0.86 0.74

Front end GC Heap c 8M c 32M c 64M m 400K c 16M m 400K m 400K m 8M m 12M c 4M m 400K m 400K m 100M m 10M

GHC -O +RTS -H8M -H32M -H64M

offer interface types which mimic their counterpart in the external language. Both Haskell (Chakravarty, 2003) and Clean offer the possibility to exchange sources with C. Moreover, both languages offer support for using functions via this interface, GreenCard for Haskell and HtoClean for Clean. Exchanging sources between Clean and Haskell via this interface is very unattractive. The interface puts severe restrictions on the types that can be used. For instance, there is no notion of type classes and higher order functions, and parameterized recursive datatypes cause all kinds of problems. Also, such an interface is completely unsuited for lazy evaluation since this is not supported by C. Since C is a subset of the C++, every valid C program is also valid C++. Hence, every compiler for C++ accepts C, which makes interoperability between these two languages very easy. Such an approach is not applicable for our purposes since Clean nor Haskell is a subset of the other. The Microsoft .NET Framework supports multiple programming languages and focuses on language interoperability. It contains special designed languages like C#, F# and J#, as well as support for standard languages like Python and Lisp. Some alternative and free implementations of parts of this framework are Mono, CrossNet and Portable.NET. Since Haskell nor Clean is designed for such a framework this approach is not suited for our needs. Moreover, these frameworks are based on an object-oriented view of the world and have limited support for features in modern lazy functional languages. There is some work to translate Haskell to Clean in order to obtain Haskell programs with the speed of Clean programs. First, Hegedus (2001) translated Haskell structures to Clean. Next, Divi´anszky (2003) implemented a partial compiler from Haskell to Clean based on these concepts. Hackle (Naylor, 2004) is a compiler from a restricted subset of Haskell 98 to Clean. This compiler actually achieved performance gain compared to GHC for a number of programs. Although each of these approaches studied translating Haskell to Clean, the exchange of language features from both languages was not considered. There are a number of stand-alone Haskell implementations. The flagship compiler GHC supports the complete Haskell 98 standard, as well as a wide variety of language extensions. Hugs 98 provides an almost complete implementation of the standard, but unfortunately the last release dates from 2006. Nhc 98 is a small compiler that is compliant to the standard, its last release stems from 2007. Yhc branched from Nhc 98, but is not yet a complete Haskell 98 compiler. The recent UHC supports almost the complete standard and adds several experimental language extensions. None of these Haskell compilers has support for interoperability with Clean.

-H16M -H8M -H4M -H100M -H10M

Table 1. Execution times of Haskell using the front end and GHC The results are shown in Table 1. The columns show the name of the program, the execution times in seconds (elapsed wall clock time including startup), the ratio of execution times (comparing the execution time of GHC executables to the front end executables), and the provided options for the generated executables. For the front end we specify what garbage collector was used to obtain the best performance (‘c’ is the combination of a copying and compacting collector and ‘m’ is the combination of a marking and compacting collector) and the maximum heap size. With GHC we used the ‘-O’ optimization option and for the executables that required larger heaps we used ‘-H’ with the same heap size as for the Clean executables for the GHC executables, but only if this improved the performance. All benchmarks are single-module Haskell programs. Hence, GHC cannot obtain an advantage by cross-module optimization over our compiler. Since the current implementation of the front end is work in progress, not all planned optimizations are implemented yet. When these optimisations are implemented we will study the benchmarks and the reasons behind the observed differences in-depth. Currently, the benchmarks just show that our compiler achieves competitive results.

14. Related work Already in Fortran, the first programming language that offered functions, it was realized that it is sometimes convenient to use foreign functions, for instance to improve efficiency by directly using assembly functions. Soon after other languages were introduced, there was the desire to use parts of other programs. There are many programming languages that offer such interpretability, usually realized by a foreign function interface (FFI). A typical FFI offers a possibility to annotate a function as external. Then, the compiler assumes that the external function exists. It is the task of the linker to include that external function, which is compiled by the compiler of its host language, to the code generated for the program. It is evident that this approach to exchange sources between languages imposes huge restrictions on the compiler as well as the language. Not only must the stack layout of both languages be identical, but also the memory layout of all datastructures used. For instance, both languages must use the same precision for integers, and layout for records and multidimensional arrays. An example of an issue in the interface is that Fortran starts array indices by one, while most modern languages starts array indices at zero. Moreover, the array dimensions in Fortran are reversed compared to languages like C. Hence the array declaration A(n, m) in Fortran matches A[m][n] in C. The element A(i, j) in Fortran matches A[j-1][i-1]. To overcome such kind of problems, many languages

15. Discussion and Future Work In this paper we have described what it takes to exchange sources between Clean and Haskell. We discussed most of the differences in language features and the required extensions of both Clean and Haskell to denotate them. This has resulted in two dialects, dubbed Clean* and Haskell* respectively. Also, we briefly explained how their exchange is facilitated in a concrete implementation. We have seen how some of the language features go together nicely handin-hand (e.g., uniqueness typing and monads), while others lead to subtle conflicts (e.g., records). Besides the exchange of sources, the front end supports the exchange of features to a certain extent as well. Haskell programmers can now use uniqueness typing, dynamic typing, and generic functions. Clean programmers can use constructs like newtypes. Additionally, the front end comes with benefits for both Haskell and Clean programmers. For instance, Haskell programmers can use

59

the full-fledged IDE including project manager. Also, performance of compiled Haskell programs looks promising: on a par and for computation-intensive applications often slightly better than GHC. For Clean programmers, it is nice that their work becomes more easily accessible to the large Haskell community. Although the most important features of Haskell 98 have been implemented, the list of remaining issues is still rather long since some features took much more work than expected. When we started this project about three years ago, we knew that Haskell is a more baroque language than Clean. But only after digging into the details of the language we discovered that Haskell was even more complicated than anticipated. For instance, since Haskell makes heavily use of overloading and monads, more effort was needed to retain the efficiency that Clean is well known for. Also, the number of Haskell libraries which are really Haskell 98 compliant is rather limited. To enable the practical reuse of Haskell libraries, we have to implement some of GHC’s extensions, such as generalised algebraic datatypes and type families. This is challenging, not only in terms of the programming effort, but more because of the consequences it will have on features such as uniqueness typing. We believe this double-edged front end provides an excellent research and implementation laboratory to investigate these avenues.

Pieter Hartel. Benchmarking implementations of lazy functional languages II - Two years later. In John Williams, editor, Proceedings of the 6th International Conference on Functional Programming Languages and Computer Architecture, FPCA ’93, Copenhagen, Denmark, pages 341– 349. ACM Press, 1993. Hajnalka Hegedus. Haskell to Clean front end. Master’s thesis, ELTE, Budapest, Hungary, 2001. Ralf Hinze. A new approach to generic functional programming. In Tom Reps, editor, Proceedings of the 27th International Symposium on Principles of Programming Languages, POPL ’00, Boston, MA, USA, pages 119–132. ACM Press, 2000. Walter de Hoon, Luc Rutten, and Marko van Eekelen. Implementing a functional spreadsheet in CLEAN. Journal of Functional Programming, 5(3):383–414, 1995. Paul Hudak, Simon Peyton Jones, Philip Wadler, Brian Boutel, Jon Fairbairn, Joseph Fasel, Mar´ıa Guzm´an, Keving Hammond, John Hughes, Thomas Johnsson, Richard Kieburtz, Rishiyur Nikhil, Will Partain, and John Peterson. Report on the programming language Haskell, a nonstrict, purely functional language. ACM SIGPLAN Notices, 27(5):1–164, 1992. Paul Hudak, John Hughes, Simon Peyton Jones, and Philip Wadler. A history of Haskell: being lazy with class. In Barbara Ryder and Brent Hailpern, editors, Proceedings of the 3rd Conference on History of Programming Languages, HOPL III, San Diego, CA, USA, pages 1–55. ACM Press, 2007.

Acknowledgments

Pieter Koopman and Eric N¨ocker. Compiling functional languages to term graph rewriting systems. Technical Report 88-1, Radboud University Nijmegen, 1988.

The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. This work has been partly funded by the Technology Foundation STW through its project on “Demand Driven Workflow Systems” (07729).

Pieter Koopman, Marko van Eekelen, and Rinus Plasmeijer. Operational machine specification in a functional programming language. Software: Practice & Experience, 25(5):463–499, 1995. Andres L¨oh, Dave Clarke, and Johan Jeuring. Dependency-style Generic Haskell. In Colin Runciman and Olin Shivers, editors, Proceedings of the 8th International Conference on Functional Programming, ICFP ’03, Uppsala, Sweden, pages 141–152. ACM Press, 2003.

References Peter Achten. Clean for Haskell 98 programmers - A quick reference guide. http://www.st.cs.ru.nl/papers/2007/ achp2007-CleanHaskellQuickGuide.pdf, 2007.

Maarten de Mol, Marko van Eekelen, and Rinus Plasmeijer. Theorem proving for functional programmers - Sparkle: a functional theorem prover. In Thomas Arts and Markus Mohnen, editors, Selected Papers of the 13th International Workshop on the Implementation of Functional Languages, IFL ’01, Stockholm, Sweden, volume 2312 of Lecture Notes in Computer Science, pages 55–72. Springer-Verlag, 2002.

Peter Achten and Rinus Plasmeijer. The ins and outs of Concurrent Clean I/O. Journal of Functional Programming, 5(1):81–110, 1995. Peter Achten, John van Groningen, and Rinus Plasmeijer. High level specification of I/O in functional languages. In John Launchbury and Patrick Sansom, editors, Proceedings of the 5th Glasgow Workshop on Functional Programming, GFP ’92, Ayr, UK, Workshops in Computing, pages 1–17. Springer-Verlag, 1992.

Matthew Naylor. Haskell to Clean translation. Master’s thesis, University of York, 2004. http://www-users.cs.york.ac.uk/~ mfn/hacle/ hacle.pdf.

Artem Alimarine and Rinus Plasmeijer. A generic programming extension for Clean. In Thomas Arts and Markus Mohnen, editors, Selected Papers of the 13th International Workshop on the Implementation of Functional Languages, IFL ’01, Stockholm, Sweden, volume 2312 of Lecture Notes in Computer Science, pages 168–186. Springer-Verlag, 2002.

Eric N¨ocker. Efficient functional programming - Compilation and programming techniques. PhD thesis, Radboud University Nijmegen, 1994. Simon Peyton Jones, editor. Haskell 98 language and libraries: the revised report. Cambridge University Press, 2003.

Erik Barendsen and Sjaak Smetsers. Conventional and uniqueness typing in graph rewrite systems (extended abstract). In Rudrapatna Shyamasundar, editor, Proceedings of 13th Conference on the Foundations of Software Technology and Theoretical Computer Science, FSTTCS ’93, Bombay, India, volume 761 of Lecture Notes in Computer Science, pages 41–51. Springer, 1993.

Simon Peyton Jones and John Hughes. Report on the programming language Haskell 98. University of Yale, 1999. http://www.haskell. org/definition/. Marco Pil. Dynamic types and type dependent functions. In Kevin Hammond, Tony Davie, and Chris Clack, editors, Proceedings of the 10th International Workshop on the Implementation of Functional Languages, IFL ’98, London, UK, volume 1595 of Lecture Notes in Computer Science, pages 169–185. Springer-Verlag, 1999. Rinus Plasmeijer and Marko van Eekelen. Functional programming and parallel graph rewriting. Addison-Wesley Publishing Company, 1993.

Tom Brus, Marko van Eekelen, Maarten van Leer, and Rinus Plasmeijer. Clean: a language for functional graph rewriting. In Gilles Kahn, editor, Proceedings of the 3rd International Conference on Functional Programming Languages and Computer Architecture, FPCA ’87, Portland, OR, USA, pages 364–384, London, UK, 1987. Springer-Verlag.

Rinus Plasmeijer and Marko van Eekelen. Clean language report (version 2.1). http://clean.cs.ru.nl, 2002.

Manuel Chakravarty. The Haskell 98 Foreign Function Interface 1.0, an addendum to the Haskell 98 report. http://www.cse.unsw.edu.au/ ~chak/haskell/ffi, 2003.

Niklaus Wirth. Programming in MODULA-2 – 3rd, corrected edition. Texts and Monographs in Computer Science. Springer-Verlag, 1985.

P´eter Divi´anszky. Haskell-Clean compiler. http://aszt.inf.elte.hu/ ~fun_ver/2003/software/HsCleanAll2.0.2.zip, 2003. Marko van Eekelen, Eric N¨ocker, Rinus Plasmeijer, and Sjaak Smetsers. Concurrent Clean (version 0.6). Technical Report 90-20, Radboud University Nijmegen, 1990.

60

Experience Report: Using Hackage to Inform Language Design J. Garrett Morris Portland State University [email protected]

Abstract

on overlapping instances (such as Swierstra’s solution to the expression problem [10]), recent extensions to the class system [6, 8] exclude overlap. This led to several questions: should Habit support overlapping instances? If not, what kinds of programs would Habit users be prevented from writing? Are there viable alternatives to the use of overlapping instances? To help answer these questions, we surveyed the frequency and uses of overlapping instances in Hackage2 , an online repository of Haskell libraries and applications. Our survey is distinguished from the folklore and informal input that inform any language design both by being based on a large code library and by having an infrastructure to automate data collection. As much as possible, we reused the Hackage infrastructure to simplify the mechanics of the survey. In particular, we used and extended GHC and cabal-install [4], a tool to download and install packages (and their dependencies) automatically from Hackage. We hoped to answer the following questions:

Hackage, an online repository of Haskell applications and libraries, provides a hub for programmers to both release code to and use code from the larger Haskell community. We suggest that Hackage can also serve as a valuable resource for language designers: by providing a large collection of code written by different programmers and in different styles, it allows language designers to see not just how features could be used theoretically, but how they are (and are not) used in practice. We were able to make such a use of Hackage during the design of the class system for a new Haskell-like programming language. In this paper, we sketch our language design problem, and how we used Hackage to help answer it. We describe our methodology in some detail, including both ways that it was and was not effective, and summarize our results. Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—Applicative (functional) languages General Terms Keywords

1.

• What proportion of the total code on Hackage uses overlapping

instances?

Experimentation, Languages

• In code that uses overlapping instances, how many instances

Haskell, Hackage

overlap each other?

Introduction

• Are there common patterns among the uses of overlapping

instances?

As part of the High-Assurance Systems Programming1 project at Portland State University, we are designing Habit, a dialect of Haskell intended to support systems-level programming tasks with a high level of assurance. While Habit diverges from Haskell in several significant ways, such as being strict by default and attempting to infer the pointedness of expressions, it also shares many Haskell features, like the type class system. In deciding on the features of the Habit type class system, we were eager to learn as much as possible from the Haskell community’s experience, both with the core class system and with its more experimental aspects. One such aspect is overlapping instances, a feature of the Haskell class system implemented by GHC [12] and Hugs [3]. Notwithstanding the long history of overlapping instances (Gofer, for example, first implemented overlapping instances in version 2.28, released in February 1993), there is little consensus within the Haskell community about whether, or how, they should be supported or standardized. Indeed, while some recent work depends

In turn, the answers to these questions would inform the design of the Habit class system: whether to support overlapping instances completely, not at all, or to attempt to find a new approach that supported the uses of overlapping instances without introducing their complexity. This paper proceeds as follows: The remainder of Section 1 provides background information, including an overview of type classes, overlapping instances, and the Hackage infrastructure. As Hackage is still under active development, some aspects of Hackage will have changed since we conducted our survey in April 2009. This section attempts to describe Hackage as it was then, not as it is today; however, we will attempt to indicate those features that we know have changed in the meantime. Section 2 describes the methodology of our survey: how we modified GHC and cabal-install for our purposes, and how we used the modified tools. We believe that Hackage surveys can provide valuable data for other Haskell-related language design projects; therefore, as much as possible, we highlight strengths and document weaknesses in our methodology, both those that affected our survey directly and those that might be relevant for similar projects. Section 3 summarizes the results of our survey, and includes some observations on Hackage metadata. Finally, Section 4 discusses related and future work and concludes.

1 http://hasp.cs.pdx.edu

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. Copyright © 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00

2 http://hackage.haskell.org

61

1.1

Background: Overlapping Instances

C ([a], [b]). In contrast, Hugs insists that any overlapping instances must be orderable; as a result, it would reject any program containing the two instances for C, regardless of the remainder of the program.

This section provides a summary of the overlapping instances extension; more detailed discussion is available elsewhere [5, 7]. Type classes [14] describe relations on types and provide a general way to introduce and type overloaded functions. For example, the Show class includes types whose values have simple textual representations. A basic version of the Show class might be defined as follows:

1.2

Background: Hackage

Hackage is a large, online repository of Haskell libraries and applications. It organizes Haskell code into packages, each of which consists of a collection of source files along with a metadata file called a .cabal file. Each .cabal file contains: the name and version of the package; the names and version ranges of the package’s dependencies; the preferred optimization and profiling settings; the language extensions used within the package; and, optionally, other compiler flags specified directly. The build and dependency information can, in turn, vary depending on the local configuration and available libraries. The .cabal file options include ways to activate a number of standard Haskell preprocessors; however, unlike Makefiles they cannot invoke arbitrary additional tools or further modify the build process. In addition to the online repository of packages, there are several other tools in the Hackage infrastructure. Among those relevant to this work are Cabal (the Common Architecture for Building Applications and Libraries), which defines a library for building packages based on their .cabal files, and cabal-install, a tool for automatically downloading and installing packages and their dependencies. While Cabal supports several Haskell compilers, including GHC, Hugs, NHC and JHC, the majority of the language extensions that Cabal recognizes are only supported by GHC. Therefore, we used GHC for our survey and will restrict our attention to it for the remainder of the paper.

class Show t where show :: t → String Most primitive types, such as Int and Char, would naturally belong to the Show class. Moreover, if we can show the elements of a list, then we can show the list itself by using the Haskell convention of surrounding it with brackets and separating its elements by commas. We can write an instance of Show that implements this pattern, using the intercalate function from the Data.List library: instance Show t ⇒ Show [t] where show xs = [’[’] ++ intercalate "," (map show xs) ++ [’]’] Unfortunately, this instance will produce unidiomatic output for strings—because strings are lists of characters, the output of an expression like show "abc" would not be the string constant "abc" but instead the list constant [’a’, ’b’, ’c’]. We could write an instance that would generate more idiomatic output for this case: instance Show [Char] where show cs = [’"’] ++ cs ++ [’"’] However, a program that contained both the instances for Show [t] and Show [Char] would not be valid Haskell because the compiler could potentially resolve (i.e., choose an instance that implements) the predicate Show [Char] with either instance. As such, these instances would be considered overlapping. We can formalize the notion of overlapping instances using substitutions. Given two instances:

2.

Methodology

Our goal was to collect usage information on overlapping instances for as many of the packages on Hackage as possible. We hoped this would give us both an idea of how frequently Haskell programmers used overlapping instances, and a catalog of how they are used. In turn, these results would drive the design of the Habit class system. We divided the survey into two stages: first, to find which packages use overlapping instances; and second, to identify the overlapping instances within each of those packages. While it would be possible to examine source code for overlapping instances by hand, this process would be vulnerable to human error and would become impractical for larger numbers of packages. Instead, we instrumented GHC to detect overlapping instances and to output information about the location of each such instance as it was encountered. We then attempted to build as many packages from Hackage as possible and collected the output of our instrumentation. Sections 2.1 through 2.3 describe this process in more detail; Section 2.4 considers the alternative of using the .cabal metadata to determine which packages to search for overlapping instances; Section 2.5 evaluates our methodology.

instance P1 ⇒ C t1 instance P2 ⇒ C t2 These instances overlap if their conclusions unify; that is, if there are some substitutions S and T such that S t1 = T t2 . The overlapping instances extension [7] provides a means to disambiguate some sets of overlapping instances automatically by introducing a notion of specificity among instances. Given the same examples, the first instance is more specific than the second if there is a substitution S such that t1 = S t2 , but no substitution T such that T t1 = t2 . When resolving a predicate, the compiler chooses the most specific applicable instance. This extension allows the two instances of Show given earlier, as the Show [Char] instance is more specific than the Show [t] instance. However, given two instances such as: instance C (a, [b]) instance C ([a], b)

2.1

Determining package sets

The Hackage infrastructure requires that any set of packages that it installs includes at most one version of each package [1]; unfortunately, because different packages on Hackage have conflicting requirements, this means that installing all of Hackage at once is not possible. Therefore, our first task was to determine the largest set of packages to check for overlapping instances. To find such a set, we were inspired by Duncan Coutts’ description of using Hackage for regression testing [2]. First, we used cabal-install to generate a list of all available packages. We then attempted a dry run of installing those packages. Predictably,

it does not provide a way for the compiler to resolve the predicate C ([a], [b]) because neither instance is more specific than the other. The overlapping instances extension is implemented differently by different compilers. For example, GHC checks that the instances that apply to a predicate can be ordered by specificity when it attempts to resolve the predicate. As a consequence, it would accept a program containing the two instances for C above, but would subsequently reject any attempt to resolve a predicate of the form

62

packages—a package can only be built if all of its dependencies are installed—we were able to use cabal-install’s existing functionality to ensure that each package set was installed to a distinct location and used a distinct local package database. As a result, the packages installed in one package set were not visible when building any other package set, and all the sets could be built without conflict. Unlike Coutts’ regression tests, we were interested in more information than whether each package compiled successfully; we also needed the overlapping instance information emitted during compilation. This meant that we had to extract the survey results from the build logs of each package by hand, instead of being able to use the build reports that cabal-install generates automatically. Luckily, our output strings were easily identified by regular expressions, so collecting the overlapping instances from the different package sets was relatively easy. Alternatively, in the process of instrumenting GHC it would have been possible to output the information that we collected to particular files, possibly specified by a command line option; this would have eliminated the need for the regular expression pass over the build output. We did not take this step in performing our survey, as the output of our instrumentation was easy to detect and our changes to GHC were otherwise quite local.

cabal-install detected conflicting version requirements. At this point, our approach differed slightly from that described by Coutts. Rather than attempting to restrict the selection of packages to get a close to optimal choice, we moved conflicting packages to a separate package list. As a consequence, we had a number of package sets, each internally consistent but inconsistent with all of the other sets. This approach was moderately effective. Our initial package list included 1195 packages. From this, we constructed five package lists: the first contained 992 packages, and the remaining four included 139 more. This left 64 packages (5% of the total) that we made no attempt to install, because: • They required C libraries or version of GHC not available on

our survey machine; or, • They had internally inconsistent dependency requirements; or, • They depended on a package we were not attempting to install.

While our approach is simple to describe, filtering incompatible packages out of packages lists can be time consuming. In particular, if a given package is incompatible with a list, not only that package but all packages dependent on it must be removed from the list. To assist with this operation, we developed rudimentary support for tracing reverse dependencies through the Hackage database. Similar functionality is now independently available online [13]. 2.2

2.4

Our next task was to instrument the compiler to generate output about overlapping instances. By doing so, we avoided timeconsuming and error-prone manual inspection of Haskell source code. As described in Section 1.1, GHC orders instances by specificity when attempting to resolve a predicate and emits an error if the applicable instances cannot be so ordered. However, predicate resolution is an inappropriate place to add our instrumentation: the same set of overlapping instances might be detected numerous times, while other sets of overlapping instances might never be detected because no predicate required their use. We were able to find a suitable alternative place for our instrumentation. When validating instances, GHC checks that each new instance is not an exact duplicate of an instance it has already encountered. In the process, GHC also computes all the instances that unify with the new instance. This is precisely the list of overlapping instances, so we added code to the duplicate instance check to output that list. This check detects overlaps that are otherwise irrelevant to the compilation process. For example, consider the following overlapping instances (originally presented in Section 1.1):

• While .cabal files are one place that language extensions

may be specified, they are not the only place. Individual source files may also specify language extensions and compiler flags in compiler pragmas. Additionally, there are multiple ways that users can enable GHC’s support for overlapping instances, including the OverlappingInstances language option, the -XOverlappingInstances compiler flag, or the older -fallow-overlapping-instances compiler flag. • The presence of overlapping instance support only enables the

definition of overlapping instances; it does not require them. This means that packages that declare overlapping instance support may not actually contain any overlapping instances.

instance C (a, [b]) instance C ([a], b)

• Most significantly, GHC only requires that overlapping instance

support be enabled in the module that defines the less specific (overlapped) instances. For example, consider the example instances for Show from Section 1.1:

Our overlap detection would output this set of instances. On the other hand, GHC will not check that it can order these instances until it attempts to resolve a predicate of the form C ([a], [b]). In fact, as long as a program does not require GHC to resolve a predicate of that form, it would not even need to enable overlapping instance support. On the other hand, as one of the options we were considering for Habit was a strict limitations on overlap more akin to that implemented by Hugs, we were still interested in detecting this kind of unused overlap. 2.3

Alternative: Using Package Metadata

The mechanism described in the previous sections may seem overly elaborate, especially given that support for overlapping instances must be enabled by specific compiler flags. As compiler flags are listed in .cabal files, it would seem that most packages that used overlapping instances could be detected by searching the .cabal files for the relevant compiler options or language extensions [9], and much of the previous work—particularly that involved in compiling large portions of Hackage—could have been avoided. There were several technical reasons that convinced us to take our more labor-intensive approach:

Instrumenting GHC

instance Show t ⇒ Show [t] where . . . instance Show [Char] where . . . If these instances were in separate modules (perhaps even in separate packages), then only the module that contained the Show [t] instance would need overlapping instance support enabled. As a consequence, while examining those modules that had overlapping instance support would allow us to detect all instances that could potentially be overlapped, it would not indicate whether, or how often, any of those instances were actually overlapped.

Collecting Results

Having identified consistent sets of packages and constructed an instrumented compiler, we were ready to generate our survey data. Following the technique described by Coutts, we compiled each set of packages independently. While we cannot avoid installing

63

Set size 1 2 3 4 5 6 7 8 9 10 22 72

We will return to this idea in Section 3.2, where we will see if the packages detected with our methodology match up to those that would have been selected based on their metadata files. 2.5

Evaluation

In this section, we consider the effectiveness of our methodology. One advantage of our approach is that it required relatively little new code. While we had to modify the GHC type checker to emit details about overlapping instances, we were able to make use of the existing structure of the duplicate instance check. In total, we added 10 lines to GHC, not including comments. The changes to cabal-install to generate reverse dependences were larger— around 140 lines—but were localized to the implementation of a single additional command. We were also able to achieve decent coverage of Hackage. We attempted to compile 1131 (95%) of 1195 packages, without making any attempt to repair broken dependencies manually or to install packages that either depended on absent C libraries or required non-Cabal build processes. Unfortunately, of these 1131 packages, only 826 packages (73%) built and installed successfully. The primary cause of build failures was our choice of which compiler to instrument. At the time that we performed the survey, the latest released version of GHC was 6.10.2, while the version in development was 6.11.20090330. One significant change from GHC 6.10 to 6.11 was that GHC’s build system had been retooled and simplified. After several unexpected build failures using the 6.10 build tools, we decided to use 6.11 for the survey. While this resolved our build issues, it also had negative consequences. In addition to the compiler itself, GHC provides several packages, including the base package that includes the Haskell prelude as well as numerous primitive operations and basic combinators. GHC 6.11 included both versions 3 and 4 of the base library, whereas GHC 6.10 had provided only version 3. As base version 4 had not yet been released, some packages did not support the changes that it made, but still had dependencies on base without upper bounds. Cabal attempted to build these packages using base version 4, which failed during compilation. We believe that these deficiencies would be significantly reduced if the survey were redone now. The current version of GHC, GHC 6.12.2, is based on the version of GHC that we used to perform the survey; as a result, the survey could be done using a released version of GHC instead of a development version. The incompatibilities with versions of the base library are also reduced by new features of Cabal and cabal-install [11]. A final note is that our methodology seems to be most suited to asking positive questions, such as “how often are overlapping instances used?” or “how many packages use GADTs?” because it is possible to identify places where those expansions are implemented within the compiler and perform local instrumentation at that point. It seems harder to adapt our approach to questions such as “how many packages only use language features in Haskell 98”, as answering that question would require establishing that a (large) set of extensions are all not used. Instead of instrumenting a single point in the compiler, it would be necessary to check each extension of Haskell 98 and report whether none of them are used, which would likely require non-local code changes and data collection.

3.

Frequency 1 76 20 11 1 4 2 4 1 1 1 1

Table 1. The observed sizes of overlapping instance sets and the frequency with which each size appeared

3.1

Overlapping Instances

Of the 826 packages built during our survey, 57 (7%) used at least one overlapping instance. While this may seem like a relatively small proportion of the total code base, we think this level of usage is not insignificant, as overlapping instances are an experimental and somewhat arcane feature of the Haskell type system. In the packages that used overlapping instances, we found a total of 445 instances overlapping or overlapped by other instances. We partitioned these instances into sets, where each instance in a set overlaps at least one other instance in the set, and no instances outside the set. The 445 overlapping instances partition into 123 sets. (Intuitively, imagine a graph with a vertex for each instance, and an edge between two vertices if their corresponding instances overlap. Our overlapping sets correspond to connected components in the graph.) We can draw some additional conclusions about the use of overlapping instances by examining the sets. Out of the 123 sets, 19 included overlapping instances from different modules, and 6 (of those 19) included overlapping instances from different packages. 104 (85%) of the sets only included instances from a single module. This suggests that, while applications exist for instances overlapping across modules, much of the use of overlapping instances is quite local. We also analyzed the size (number of instances) of each set; the results are presented in Table 1. On average, each set had 3.6 instances. However 76 (62%) of the sets had only two instances. The average is pulled up by several outliers: for example, one set of overlapping instances contains 72 instances. This resulted from the definition of a new Show instance: instance JSON a ⇒ Show a where . . . that overlapped all other instances of the Show class. (One could argue further that this instance is an abuse of the Show class, as its output is in JSON format instead of the Haskell syntax that most Show instances use.) As a final note, there is one set of overlapping instances that claims to contain only one instance; this resulted from an oddity in the data set in which two different modules defined exactly the same instance. The program containing these modules was rejected by the compiler as a result; however, as our data was generated simultaneously to compilation, we still detected the identical instances. Our data suggests that while some uses require the full generality of overlapping instances, a greater proportion of uses contain a small number of locally-defined instances. To further refine this idea, we performed a manual examination of the extracted instances. We discovered two particularly common usage patterns:

Results

We summarize the more interesting results of our survey in two veins: first, our conclusions about the prevalence and usage patterns of overlapping instances; and second, some speculation about the usage of package-level flags and language annotations.

64

•

•

•@ @

•

~ @@ ~~ @@ ~ ~ @ ~ ~

would have been more useful to compute and emit specificity information with each overlapping instance. This would have allowed some automatic discovery of patterns. Even after manual examination, it is not always apparent whether an overlapping instance set belongs to either of the above patterns. For example, the following two overlapping instances appears in the mmtl package:

•

•

•

instance MonadState s (State s) where . . . Figure 1. Usage patterns for overlapping instances: On the left, a three-instance chain of alternatives; On the right, a default instance with three more specific implementations.

instance (MonadTrans t, Monad (t (State s))) ⇒ MonadState s (t (State s)) where . . . There are two ways we could interpret these instances:

Alternation. These instances express (usually simple) alternation by making later alternatives more general than earlier ones. This pattern is fragile to encode using overlapping instances: the intention of the programmer is (somewhat) obscured, the method does not easily scale to more than two or three alternatives, and users can potentially add additional alternatives unintended by the original programmer. Instances implementing alternation tend to be local to a single module, or at most a single package. Many examples of this style can be found in the HList package; for instance, the hOccursMany function returns all the elements of an HList with a particular type. It is implemented by the following three instances (all within a single module):

• Any state monad should include the State type at some point.

This pair of instances provides a complete implementation of the MonadState class. • The State type provides one way implement state monads, but

there are many others. This pair of instances is not the complete implementation of the MonadState class. It is not clear from the data which of these alternatives is preferred. While we found no implementations of the MonadState class outside the mmtl package, which supports the first interpretation, it does not seems as clear to us as the hOccursMany example above. One approach we could use to resolve questions like the usage of MonadState would be to take the intended use of the package into account. If the package defines an application, or defines a library with many users on Hackage, then we can be relatively certain of the conclusions drawn from the overlaps we detected. However, for libraries without many users on Hackage, the conclusions of our survey would still be uncertain.

instance HOccursMany e HNil where hOccursMany HNil = [] instance ( HOccursMany e l, HList l ) ⇒ HOccursMany e (HCons e l) where hOccursMany (HCons e l) = e:hOccursMany l

3.2

instance ( HOccursMany e l, HList l ) ⇒ HOccursMany e (HCons e’ l) where hOccursMany (HCons _ l) = hOccursMany l

Flags and Annotations

Having completed the survey, we returned to the question raised in Section 2.4 about whether using the Cabal metadata would be a suitable substitute for building all of Hackage. Surprisingly, we found that only 13 of the 57 packages that contained overlapping instances declared the corresponding language extension or GHC flag in their Cabal metadata. However, 59 packages that did not actually contain any overlapping instances included the overlapping instances flag in their metadata. We can imagine several reasons for this:

We do not imagine that a user would have reason to add additional instances of the HOccursMany class. Default implementations. These instances provide a default implementation for some complex behavior, based on other preexisting classes. This pattern is roughly similar to one of the functionalities of base classes in object-oriented hierarchies. These instances can be spread across multiple modules or packages. We found these examples particularly common in serialization and generic programming libraries; for example, the hsx package includes an instance declaration:

• Packages may use overlapping instances to provide default im-

plementations for new classes without providing any more specific implementations. In this case, the package author would need to enable overlapping instance support, but our method would only find overlapping instances if there were more specific implementations elsewhere on Hackage.

instance (XMLGen m, XML m ∼ x) ⇒ EmbedAsChild m x where asChild = return ◦ return ◦ xmlToChild

• Package authors may use standard .cabal file templates, or

may not remove options from .cabal files when they are no longer applicable.

This provides one way for the EmbedAsChild class to be populated, but is far from the only way. Several other packages, such as the HJScript package, add their own instances to the EmbedAsChild class. Earlier, we suggested that sets of overlapping instances can be viewed as graphs, with vertices for each instance and edges for each overlap. We could extend this intuition to take account of specifity by directed edges from the more specific to the less specific instances. This would allow us to describe the usage patterns graphically, as in Figure 1. Unfortunately, we did not collect enough information to automate classifying instances into the usage patterns easily. For each overlapping instance, our survey emits the list of unifying instances, because this is already computed by GHC. However, it

• Package authors may prefer to use source level language prag-

mas when particular features or options are only needed in a portion of an entire package.

4.

Conclusion

In the introduction, we posed three alternatives for the design of the Habit class system: 1. Support overlapping instances as they exist in implementations of Haskell; 2. Do not support overlapping instances at all; or,

65

Acknowledgements. Thanks to Mark Jones for his advice during the conception, execution, and description of this survey, and to the anonymous reviewers for their helpful feedback and discussion of the submitted draft of this work.

3. Define an alternative class system feature that supports many of the uses of overlapping instances without introducing as much complexity. Our survey suggested that there were a significant number of uses of overlapping instances, including several valuable type-class programming paradigms. This rules out Option 2. However, it also suggested that many uses of overlapping instances did not require the full power of the extension implemented by Haskell compilers, leading us to investigation of Option 3. Our consideration of the alternative pattern led to the creation of instance chains, a new feature of the Habit class system described at length elsewhere [5]. Our examination of the default instance pattern is less advanced; while we have alternative coding patterns that provide default implementations without using overlapping instances, they have not yet received as much testing as instance chains.

References [1] D. Coutts. Solving the diamond dependency problem. http://blog.well-typed.com/2008/08/solving-the-diamonddependency-problem/, 2008. Last accessed June 8, 2010. [2] D. Coutts. Regression testing with hackage. http://blog.welltyped.com/2009/03/regression-testing-with-hackage/, 2009. Last accessed June 8, 2010. [3] M. P. Jones. Hugs 98. http://haskell.org/hugs. [4] Lemmih, P. Martini, B. Bringert, I. Potoczny-Jones, and D. Coutts. cabal-install: The command-line interface for cabal and hackage. http://hackage.haskell.org/package/cabal-install. Last accessed June 7, 2010. [5] J. G. Morris and M. P. Jones. Instance chains: Type-class programming without overlapping instances. In ICFP ’10, Baltimore, MD, 2010. ACM. [6] D. Orchard and T. Schrijvers. Haskell type constraints unleashed. Lecture Notes in Computer Science, 6009:56–71, 2010. [7] S. Peyton Jones, M. P. Jones, and E. Meijer. Type classes: an exploration of the design space. In Haskell ’97, Amsterdam, The Netherlands, 1997. [8] T. Schrijvers, S. Peyton Jones, M. Chakravarty, and M. Sulzmann. Type checking with open type functions. In IFCP ’08, pages 51–62, Victoria, BC, Canada, 2008. ACM. [9] D. Stewart. Re: [Haskell-cafe] Overlapping/Incoherent instances. http://www.haskell.org/pipermail/haskell-cafe/ 2008-October/049155.html, 2008. Last accessed June 8, 2010. [10] W. Swierstra. Data types a` la carte. JFP, 18(04):423–436, 2008. [11] The Cabal Team. #435 (ban upwardly open version ranges in dependencies on base). http://hackage.haskell.org/trac/hackage/ ticket/435, 2009. Last accessed June 8, 2010. [12] The GHC Team. GHC. http://haskell.org/ghc, 2009. [13] R. van Dijk. Ann: Reverse dependencies in hackage (demo). http://www.haskell.org/pipermail/haskell/2009October/021691.html, 2009. Last accessed June 8, 2010. [14] P. Wadler and S. Blott. How to make ad-hoc polymorphism less ad hoc. In POPL ’89, pages 60–76, Austin, Texas, United States, 1989. ACM. [15] A. K. Wright. Simple imperative polymorphism. Lisp and Symbolic Computation, 8(4):343–355, 1995.

Related work. This paper describes a use of the Hackage repository for language design; we believe it is one of the first such descriptions. However, there have been several similar projects. We were strongly guided by Duncan Coutts’ description of using Hackage for regression testing [2]. Another inspiration came from Andrew Wright’s study of the value restriction in Standard ML [15], which studied a wide variety of ML programs to determined whether a language design choice was justified. Future work. As discussed in Sections 2 and 3, there are numerous ways that our survey could be improved, and were we to perform the survey now we would have access to significantly more data. Despite this, we believe the survey as performed captured a representative sample of the use of overlapping instances on Hackage. Therefore, we are not currently intending to revisit this survey. We are, however, hoping to find other language design questions amenable to our general approach. Should we do so, there are several aspects of the survey that would be improved by additional automation. In particular, although we did parts of the separation of Hackage into consistent package sets manually, we imagine that it would be possible to automate it entirely. That would make updating the results of future surveys relatively painless. Another interesting problem has to do with the generation of instrumented compilers. Despite the existing GHC API, we had two reasons for modifying GHC itself: first, because the data we needed was already computed while checking for duplicate instances, instrumenting the compiler there was particularly painless. Second, while telling the Cabal build process to use a particular (instrumented) GHC is quite simple, adding additional steps to the compilation process (such as running a separate program, built using the GHC API, to collect overlap information) is more complex. However, this also leads to disadvantages: the output from our instrumentation process is intertwined with the regular output from GHC, and modifying and building GHC is a heavyweight process for relatively simple instrumentation.

66

Nikola: Embedding Compiled GPU Functions in Haskell Geoffrey Mainland and Greg Morrisett Harvard School of Engineering and Applied Sciences {mainland,greg}@eecs.harvard.edu

interpreter that expects to be handed a program represented as data, e.g., a string or an abstract syntax tree. For example, Nikola re-uses the CUDA compiler, which takes care of the lowest-level details of mapping C-like programs onto the GPU instruction set. Deep embeddings that generate code in a target language that is callable from Haskell allow functional programming to be used in new domains without the overhead of writing a complete parser, type checker and compiler. This style of embedding not only provides the syntactic convenience and aesthetic satisfaction of combinator libraries like those for parsing and pretty-printing, but it allows programmers to express computations that cannot be expressed practically in Haskell. These computations may be impractical to express in Haskell because they take place off-CPU on devices such as GPUs or FPGAs, or expressing them using an embedding may admit a much more efficient compilation strategy than a pure Haskell implementation. Haskell’s FFI provides one way to integrate external code, but using it means losing the convenience of writing only Haskell. Ideally, code-generating EDSLs should integrate with Haskell as smoothly as pure-Haskell EDSLs. We advocate code-generating EDSLs that are first-class in the sense that EDSL functions are compiled to Haskell-callable functions and function compilation and function invocation can occur in the same run of a program, i.e., stages can be freely mixed. This allows functions to be either compiled once and for all or specialized to their arguments. For example, a routine that calculates the product of several matrices could optimize the order of matrix-multiply operations based on the dimensions of the matrices. Embedding DSLs in this way retains many of the benefits of staged languages like MetaML (Taha and Sheard 1997) with the added advantage that the object language can differ from the meta-language.

Abstract We describe Nikola, a first-order language of array computations embedded in Haskell that compiles to GPUs via CUDA using a new set of type-directed techniques to support re-usable computations. Nikola automatically handles a range of low-level details for Haskell programmers, such as marshaling data to/from the GPU, size inference for buffers, memory management, and automatic loop parallelization. Additionally, Nikola supports both compiletime and run-time code generation, making it possible for programmers to choose when and where to specialize embedded programs. Categories and Subject Descriptors ming Languages General Terms Keywords

1.

D.3.3 [Software]: Program-

Languages, Design

Meta programming, CUDA, GPU

Introduction

A domain-specific language (DSL) captures knowledge unique to a specialized problem domain, allowing programmers to write concise, understandable programs tailored to a specific class of problems. By embedding a domain-specific language in a rich host language, yielding an embedded domain-specific language (EDSL), system designers can leverage the existing type system, syntax and libraries available in the host language, freeing them to focus on the issues unique to the problem at hand instead of the details of language implementation. Haskell has been a particularly popular vehicle for EDSLs with domains as varied as parsing (Hutton 1992), pretty-printing (Hughes 1995), efficient image manipulation (Elliott 2003), robotics (Pembeci et al. 2002) and hardware circuit design (Bjesse et al. 1998). Broadly speaking, there are two styles of EDSLs. Shallow embeddings make little or no effort to represent the syntax of the embedded language, using host language functions to represent function in the embedded language and host language values to represent embedded language values. Deep embeddings manifest the abstract syntax of the embedded language as data that can be manipulated. The former style of EDSL is suited to applications like prettyprinting and parsing, where the DSL serves as a “short-hand” for a program that could be written directly in the host language. DSLs that denote programs in a language other than the host language require a deep embedding because they interface to a compiler or

Contributions We demonstrate the power of deep embeddings using Nikola, an EDSL for efficient array manipulation using GPUs with an interface in the style of the Haskell vector package (Leshchinskiy 2010). Our contributions are: • We demonstrate how a deep embedding’s abstract syntax rep-

resentation can preserve sharing of lambda expressions even when they occur in an application. EDSL functions are translated to target language functions, and EDSL function applications are translated to target language function calls. Existing work on observable sharing shows how Haskell-level sharing of expressions can be preserved in a deep embedding’s abstract syntax (Claessen and Sands 1999; Gill 2009), but to our knowledge we are the first to demonstrate how to observe sharing of functions within function applications, where it was previously assumed that full inlining was inevitable (see Elliott et al. 2003, Section 11). The programmer chooses the lambda expressions for which sharing is preserved.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

• We show how to compile functions of arbitrary arity in an em-

bedded, first-order array manipulation language to a target lan-

67

and is limited to the maximum number of threads that a CUDA thread block can legally contain (512 on current CUDA-capable hardware). The embedded GPU language presented by Lee et al. (2009) is higher-level than Obsidian; one could imagine implementing some of the primitives provided by this language in Obsidian. A full compiler from the embedded language to CUDA is incomplete, so it is unclear how it maps the high-level language onto CUDA. The representation used for the DSL abstract syntax (Chakravarty et al. 2009) does not admit functions; it is only able to represent computations. This necessitates specializing a function to its inputs before compilation. Although specialization can enable additional argument-specific optimizations, our focus is on compiling general functions. Libraries that share some of our goals exist for other languages. Perhaps the most widely used is PyCUDA (Klöckner et al. 2009), a Python library for accessing CUDA-enabled hardware. It provides run-time code generation facilities, allowing CUDA functions to be compiled on-the-fly and called from Python. CUDA code is represented either as a string or by using a (partially) data type-like representation in which some components of the abstract syntax are represented using Python classes and others using strings. Unlike Nikola, the programmer must explicitly specify how to marshal data to the CPU, and no size inference is performed, so memory management is manual. PyCUDA is also dynamically typed. Nikola is a high-level language for array computations, similar in that way to the DSL described by Lee et al. (2009). Its contributions relative to the discussed related work include:

guage that runs directly on GPUs. Furthermore, because we can compile functions and not just computations, functions in our embedded language can be compiled once and applied to many inputs. Values move fluidly and automatically between the host language and embedded language. Our compilation target has the added constraint that all memory needed by a function must be pre-allocated, which we also handle automatically. In general, the size of the outputs will depend upon the values of the inputs, which requires support for size inference in the compiler. • Our compilation strategy permits compilation at either run-time

or at Haskell compile time. That is, the embedded language can be compiled during the host language’s compilation phase. This requires no changes to GHC.

• One key advantage of our embedding is that although program-

mers can use the higher-level abstractions provided by Nikola, they can also directly embed CUDA functions. Thus, if the Nikola compiler does not provide needed functionality, one can always drop down and write CUDA directly. Calling a directlyembedded CUDA function requires only an appropriate type signature and a small amount of glue code to pre-allocate any memory the function needs.

The rest of this paper is organized as follows. We begin in Section 2 by providing background and discussing related work. Section 3 describes how we embed the Nikola language in Haskell and describes how to rewrite a pure-Haskell array manipulation function in Nikola so that it can be executed efficiently on a GPU. Our strategy for translating Nikola to CUDA is discussed in Section 4, and in Section 5 we describe how a Nikola function is compiled and called. In Section 6 we evaluate the performance of Nikola. We describe future work and conclude in Section 7. Nikola is available at http://www.eecs.harvard.edu/~mainland/ projects/nikola.

2.

• Minimal syntactic overhead. As shown in Section 3, Nikola

requires minimal changes to Haskell code in order to compile it for execution on a GPU as long as the functionality of the Haskell code falls within Nikola’s domain. Programmers do not need to write in a monadic style or use new combinators, and Haskell’s binding constructs are sufficient for expressing binding in Nikola. We also show how to use Haskell’s function application to represent function application in an embedded language which to our knowledge has not been done before.

Background and related work

Embedding code-generating domain-specific languages in Haskell was originally advocated by Leijen and Meijer (1999). They developed a DSL for describing database queries that could be translated to SQL and handled marshalling data between Haskell and the SQL execution engine. Pan (Elliott 2003), a DSL for describing image manipulation, generates C code. The techniques we use to convert higher-order abstract syntax (HOAS) to a first-order representation were pioneered by Pan and described in detail by Elliott et al. (2003). Our past work, Flask (Mainland et al. 2008), was a domainspecific language for sensor networks. Flask allowed programmers to mix code in a high-level first-order functional language with code written in NesC (Gay et al. 2003), a low-level C-like language designed explicitly for sensor networks. Sensor network programs were translated to NesC which was then compiled to binaries that could be installed on individual sensor nodes. Flask required explicit staging; programmers could not mix the execution of Haskell and sensor network programs. In the the GPU domain, Vertigo (Elliott 2004) is an EDSL for programming 3D graphics that compiles to GPU code. Obisidian (Svensson et al. 2008, 2010) is a vector-manipulation DSL embedded in Haskell similar in style to the Lava (Bjesse et al. 1998) circuit description DSL; computations are written using combinators provided by a library. It provides relatively low-level primitives for describing GPU computations. All Obsidian computations are functions from a single input array to a single output array, although the types of the values contained in the input and output arrays may vary. The size of the input array is statically known at compile time

• General function compilation. We do not require that func-

tions have type Array α → Array β, artificially limit the size of the arrays a function may use, or specialize a function to its arguments before compilation. Nikola functions may be of any arity, and after functions are compiled, they may be called an arbitrary number of times with differing arguments, e.g., a Nikola function that increments every element in a vector of floats is compiled once and the compiled function can be called many times, each time with a vector of floats having any length.

• Choice between compile-time and run-time compilation.

Nikola functions can be compiled at run-time, permitting specialization to a particular piece of hardware, or at Haskell compile-time. The trade-off is between the overhead of compiling a function every time a program is invoked vs. the flexibility to specialize a function to a device.

• Ability to directly embed CUDA code. The programmer al-

ways has the option to drop down to pure CUDA code. As long as the function obeys the Nikola calling convention, the programmer only needs to add a Haskell type signature and a small amount of glue code to enable calling the CUDA function directly from Haskell.

3.

Embedding Nikola

We begin by showing how to attain a deep embedding in Haskell, allowing programmers to write in a subset Haskell that is eventually compiled and loaded not by the Haskell compiler, but by

68

pute the appropriate value. This use of multi-stage programming (Taha and Sheard 1997) in Haskell was pioneered by Conal Elliott in his work on Pan (Elliott 2003). Fortunately, Haskell’s pervasive use of type classes to overload standard mathematical operators lets us accomplish this rather easily without having to change program syntax, which we demonstrate using a small functional language with the following first-order representation for its abstract syntax:

blackscholes :: Vector Float -- Stock prices → Vector Float -- Option strikes → Vector Float -- Option years → Vector Float blackscholes ss xs ts = zipWith3 (λs x t → blackscholes1 s x t r v) ss xs ts where r = ... v = ...

type Var = String data DExp = VarE Var | LetE Var DExp DExp | LamE Var DExp | AppE DExp DExp | FloatE Float | IfThenElseE DExp DExp DExp | BinopE Binop DExp DExp data Binop = LessThan | GreaterThan | ... | Add | Mul | Sub | ... newtype Exp a = E {unE :: DExp}

blackscholes1 :: Float -- Stock price → Float -- Option strike → Float -- Option years → Float -- Riskless rate → Float -- Volatility rate → Float blackscholes1 s x t r v = s ∗ normcdf d1 − x ∗ exp (−r ∗ t) ∗ normcdf d2 where d1 = (log (s / x) + (r + v ∗ v / 2) ∗ t) / (v ∗ sqrt t) d2 = d1 − v ∗ sqrt t

Here DExp is the type of dynamic (untyped) expressions. In practice we wish to maintain the ability to assign meaningful types to the abstract syntax trees we build in a DSL. To simplify the presentation, we use phantom types here and discuss the use of GADTs in Section 3.3. Exp wraps a DExp while adding a phantom type parameter, a, that represents the (embedded) type of the wrapped abstract syntax. Because addition and multiplication are overloaded and integer literals are desugared into calls to the overloaded function fromInteger, we can define an appropriate instance of the Num type class so that + and ∗ operate over abstract syntax. Note that the seemingly recursive call to fromInteger on the right-hand side of our definition of fromInteger is actually a call to the instance that converts an Integer to a Float; the instance we define converts an Integer to an Exp. Instances for the other numeric type classes are defined similarly.

normcdf :: Float → Float normcdf x = if x < 0 then 1 − w else w where w = 1.0 − 1.0 / sqrt (2.0 ∗ π) ∗ exp (−l ∗ l / 2.0) ∗ poly k k = 1.0 / (1.0 + 0.2316419 ∗ l) l = abs x poly = horner coeff coeff = [0.0, 0.31938153, − 0.356563782, 1.781477937, − 1.821255978, 1.330274429] horner coeff x = foldr1 madd coeff where madd a b = b ∗ x + a

instance Num (Exp Float) where e1 + e2 = E $ AddE (unE e1) (unE e2) e1 ∗ e2 = E $ MulE (unE e1) (unE e2) e1 − e2 = E $ SubE (unE e1) (unE e2) fromInteger n = E $ FloatE (fromInteger n)

Listing 1: Black-Scholes call option valuation in Haskell the DSL library. Our running example is Black-Scholes call option valuation. A Haskell implementation, utilizing the vector library, is shown in Listing 1. This implementation is similar to the CUDA implementation included in NVIDIA’s CUDA SDK and uses Horner’s algorithm in computing a polynomial approximation to the cumulative distribution function of the standard normal distribution. Our high-level goal is to maintain the syntactic convenience of Haskell, allowing the programmer to write a Nikola version of blackscholes much as the Haskell implementation is written while still allowing the function to be converted to a first-order representation suitable for compilation. To do this we utilize higher-order abstract syntax (HOAS) (Pfenning and Elliot 1988), which represents binders in our embedded language using Haskell’s binders. Ideally, we want also to be able to represent let bindings in our embedded language using Haskell’s let bindings, and function application in our embedded language using Haskell’s function application. We describe how to accomplish these two tasks in Section 3.1 and Section 3.2. Instead of directly computing values, we will trick our Haskell functions into computing program fragments that, when run, com-

While overloading numeric operators enables us to use the same syntax for an embedded language as we would for Haskell as long as we are writing numeric expressions, expressions involving control flow require new syntax. Consider the normcdf function in Listing 1, which tests whether or not x is less than zero and branches in the result of the test. If we rewrite this as a Nikola function, then x is not a value, but an expression. Performing the comparison test on x would require writing a decision procedure that can in general determine whether or not the expression x is less than zero. Clearly no such decision procedure exists. Instead, we can construct a term in the embedded language that compares the sub-expression x to zero and executes the proper branch; we delay the comparison-and-branch so that it is executed not when the embedded term representing normcdf x is built, but when this term is later evaluated. In general, any expression that scrutinizes a value must be re-written to incorporate the scrutinization into the expression. Since we cannot overload the if-then-else construct in Haskell, we have to introduce new operators specific to our embedding. This allows us to write a Nikola version of normcdf as follows:

69

define a family of functions, reify, that rewrites terms in our embedded language to express sharing using the LetE constructor just as we have here for the term representing square (1 + 2). This family of functions will have a member at each type (Exp a1 → ... → Exp an ) → IO DExp. There have been a number of approaches in the literature to making sharing observable. Pan attempted to recover some sharing post hoc by performing common sub-expression elimination on the embedded language’s abstract syntax (Elliott et al. 2003). Another solution, proposed by O’Donnell (1993), is to require that the programmer label each expression in the embedded language with an explicit tag. This burdens the programmer with ensuring that different terms have different tags. Lava (Bjesse et al. 1998) lifts this burden by requiring that embedded terms be written in monadic style so that fresh names can be gensym’ed. However, forcing the programmer to write in a monadic style is undesirable; our goal is to require as few syntactic changes as possible relative to Haskell when writing in Nikola. Claessen and Sands (1999) add a reference type Ref a to Haskell, along with the following operations:

(?) :: Exp Bool → (Exp a, Exp a) → Exp a test ? (e1, e2) = E $ IfThenElseE (unE test) (unE e1) (unE e2) (.<.) :: Exp Float → Exp Float → Exp Bool e1 .<. e2 = E $ BinopE LessThan (unE e1) (unE e2) normcdf :: Exp Float → Exp Float normcdf x = (x .<. 0) ? (1 − w, w) where ... Note that the only necessary syntactic changes to make normcdf a Nikola function are to rewrite the if-then-else construct and to change the type signature of normcdf. With these changes, normcdf now operates not on Float values, but on program fragments of type Exp Float. 3.1

let-sharing

Consider the simple function square, defined as: square :: Exp Float → Exp Float square x = x ∗ x

type Ref a = ... ref :: a → Ref a deref :: Ref a → a (<=>) :: Ref a → Ref a → Bool

The expression square (1 + 2) evaluates to the following abstract syntax for our embedded language: E (MulE (AddE (FloatE 1.0) (FloatE 2.0)) (AddE (FloatE 1.0) (FloatE 2.0)))

Although this avoids forcing the programmer to write in a monadic style, it introduces a non-conservative extension to the language and requires explicitly marking any value for which one might want to recover sharing. The first truly satisfying solution to the let-sharing problem is given by Gill (2009), which allows the sharing implicit in Haskell’s let-bindings to be observed from within the IO monad. Although the observation of sharing occurs in a monad, the term under scrutiny does not need to have been written in a monadic style. Observation of sharing occurs once as part of the process of converting HOAS to a first-order representation. Gill’s technique relies on stable names, a feature provided by GHC that enables what is essentially pointer equality. The primitives we need are provided by the GHC-specific module System.Mem.StableName and have the following interface:

When this term is eventually evaluated, the value of the subexpression 1 + 2 will be computed twice, even though we expect by looking at the definition of square that it would only be computed once. Of course GHC knows to only calculate the term representation of 1 + 2 once, so the in-memory representation of our embedded language expressions is: E MulE

AddE 1

data StableName a makeStableName :: a → IO (StableName a) hashStableName :: StableName a → Int instance Eq (StableName a)

2

The problem is that we cannot observe the sharing introduced by Haskell bindings. When we try to do a code generation pass, we will therefore end up processing the expression twice, losing the sharing, and the code generated for the expression 1 + 2 will do twice the work it needs to. In this simple example the duplicated work is minimal, but for normcdf this kind of loss of sharing causes k to be re-evaluated five times in the expression generated by poly k, leading to a substantial increase in the cost of calling normcdf. Ideally we would like to find a way to make the sharing implicit in our term representation explicit. This would allow us to use Haskell’s let bindings to represent let bindings in our embedded language and yield the following alternate representation for square (1 + 2):

Critically, given x and y, if the StableName values returned by applying makeStableName to x and y are equal, then x and y are equal. The implication is one-way; equality of x and y does not imply equality of their stable names. Stable names therefore only allow us to conservatively approximate the true sharing in a Haskell expression, although in practice the approximation is not very conservative and we observe exactly the sharing we expect to see. Note that makeStableName is not strict in its argument; we can force values to weak-head normal form using strict application (the $! operator) to maximize sharing. We can use these tools to write reify. We begin with a function reifyR :: DExp → R DExp that uses the reification monad R to maintain a map from the stable name of a DExp to the rewritten version of the same DExp in which implicit sharing has been made explicit. As reifyR recurses over a term, sub-expressions that have not been seen before are bound to a gensym’ed variable and an entry mapping the sub-expression’s stable name to the gensym’ed variable is added to the map; a sub-expression whose stable name already exists in the map is replaced by the variable with which it is associated in the map.

E (LetE "x" (AddE (FloatE 1) (FloatE 2)) (MulE (VarE "x") (VarE "x"))) We call this type of sharing let-sharing because by properly detecting it we can construct an embedded language term where shared sub-terms are replaced with a let-bound expression. Our goal is to

70

Functions are reified by gensym’ing a fresh variable for each parameter and passing them as the arguments to the function. We use type classes to define a family of functions, reifyFunR, at each type Exp a1 → ... → Exp an .

We call this kind of sharing λ-sharing because by properly detecting it we can construct an embedded language term where all sub-terms that result from the application of the same lambda share that lambda. This allows us to use Haskell’s function application to represent function application in Nikola. Lack of λ-sharing does not cause any additional work to be done when the embedded language expression is evaluated, but it does lead to an increase in code size. Looking at the definition of blackscholes1 in Listing 1, we see that normcdf would be inlined twice in the first-order representation of the Nikola version of the function if we couldn’t detect λ-sharing, substantially increasing the size of the Nikola term for blackscholes1. Our own past experience with Flask (Mainland et al. 2008) indicates that code explosion resulting from loss of λ-sharing can be a serious issue. The only way we can hope to recover λ-sharing is to gain control over β-reduction. If we could overload application, that would give us the tool we need to get started. Since that is not possible in Haskell, we use a different approach: we will write a family of variadic application functions, vapply, that allow us to control β-reduction. Consider the expression vapply square. We would like this expression to return a function like square0 , where "square" is somehow bound to the proper abstract syntax for the original definition of the embedded square function.

class (Typeable a, Typeable b) ⇒ ReifiableFun a b where reifyFunR :: (a → b) → R DExp instance (Typeable a, Typeable b) ⇒ ReifiableFun (Exp a) (Exp b) where reifyFunR = ... instance (Typeable a, ReifiableFun b c) ⇒ ReifiableFun (Exp a) (b → c) where reifyFunR = ... With these definitions and the function evalR :: R a → IO a, writing the reify function that was our original goal is trivial. class Reifiable a where reify :: a → IO DExp instance Reifiable (Exp a) where reify = evalR ◦ reifyR ◦ unE instance ReifiableFun a b ⇒ Reifiable (a → b) where reify = evalR ◦ reifyFunR Using this technique to recover let-sharing does not require adding a reference type or writing in a monadic style; sharing is recovered using a single rewriting step that takes place in the IO monad. Our reify is a special case of Gill’s more general technique that rewrites a tree as a graph that reflects shared structure; we rewrite a tree (abstract syntax) as a new tree that reflects shared structure using the abstract syntax’s let binding construct. 3.2

square0 :: Exp Float → Exp Float square0 (E e) = E $ AppE (VarE "square") e To control β-reduction, we want vapply to perform as follows. The “first” time vapply is passed square, it should reify the function square and add a top-level binding for its reified definition to the abstract syntax. It then returns a new function, which we call square0 in this example, that takes an Exp Float and returns abstract syntax for the application of the new top-level binding it just created for square to the Exp Float argument. Subsequent applications of vapply to square re-use the previously generated binding. Note that the type we have given for vapply is not monadic; how can vapply then tell when it is “first” applied to square and keep track of top-level bindings? The answer is that we delay the monadic portion of vapply’s responsibilities until reification. This requires altering the DExp data type to add a new constructor (recall that R is the monad we use to maintain the state needed for reification):

λ-sharing

There is another kind of sharing that, to our knowledge, no previous techniques allow us to observe. Consider the expression square 1.0 + square 2.0 which evaluates to the following abstract syntax: E (AddE (MulE (FloatE 1.0) (FloatE 1.0)) (MulE (FloatE 2.0) (FloatE 2.0))) Again, we show the in-memory representation of this value, which reflects the physical sharing that results from square’s binding of its argument x.

data DExp = DelayedE (R DExp) | ...

E

Using this new definition, vapply constructs a DelayedE value whose argument is the monadic action that either creates or finds the binding for square and then generates an application of this value to the Exp Float that is the argument to square. We use the same type class technique from the previous section to generate our family of functions, vapply, at each of the types:

AddE MulE MulE 1

(Exp a1 → ... → Exp an ) → Exp a1 → ... → Exp an

2

class (ReifiableFun a b) ⇒ VApply a b c d | a → c, b → d, c → a, d → b where vapply :: (a → b) → c → d instance (Typeable a, Typeable b) ⇒ VApply (Exp a) (Exp b) (Exp a) (Exp b) where vapply = ... instance (Typeable a, VApply b c d e) ⇒ VApply (Exp a) (b → c) (Exp a) (d → e) where vapply = ...

Although we can use any of the techniques described in Section 3.1 to recover the physical sharing in this expressions, there is no way to use them to observe that the two MulE sub-terms both resulted from a function application. The problem is that Haskell’s β-reduction effectively inlines the definition of square. The term we want to construct binds square to its definition and then applies it in each sub-term of the addition: E (LetE "square" (LamE "x" (MulE (VarE "x") (VarE ("x"))) (AddE (AppE (VarE "square") (FloatE 1.0)) (AppE (VarE "square") (FloatE 2.0))))

71

Note that we do not have to use vapply every time we write an application and want to avoid inlining. Instead, the vapply can occur in the definition of square, in which case square will never be inlined:

blackscholes :: Exp (Vector Float) -- Stock prices → Exp (Vector Float) -- Option strikes → Exp (Vector Float) -- Option years → Exp (Vector Float) blackscholes ss xs ts = zipWith3 (λs x t → blackscholes1 s x t r v) ss xs ts where ...

square :: Exp Float → Exp Float square = vapply $ λx → x ∗ x 3.3

Alternative embedding strategies

Encoding our first-order representation of Nikola using GADTs would allow us to leverage Haskell’s type system to help guarantee that our manipulation of abstract syntax maintains type safety of embedded language terms. Atkey et al. (2009) describe how to convert HOAS to a GADT-based first-order representation. However, using GADTs would substantially complicate the implementation because it requires the use of de Bruijn indices, coercion rules for weakening and contraction, etc. The only consumer of the firstorder representation is our compiler, and the difficult part of the compiler is the portion that translates our first-order representation to CUDA which would not be helped by the use of GADTs in any case. We therefore leave to future work a GADT-based implementation of Nikola. We also note that our solution to the λ-sharing problem is applicable to any method for converting HOAS to a first-order representation, including those based on GADTs. Furthermore, the techniques for compiling and calling Nikola functions that we describe in Section 4 and Section 5 also apply in the GADT setting. 3.4

blackscholes1 :: Exp Float -- Stock price → Exp Float -- Option strike → Exp Float -- Option years → Exp Float -- Riskless rate → Exp Float -- Volatility rate → Exp Float blackscholes1 s x t r v = s ∗ normcdf d1 − x ∗ exp (−r ∗ t) ∗ normcdf d2 where ... normcdf :: Exp Float → Exp Float normcdf = vapply $ λx → (x .<. 0) ? (1 − w, w) where ... horner coeff x = foldr1 madd coeff where madd a b = b ∗ x + a

Summary

We can now complete our translation of the pure Haskell function in Listing 1 to a corresponding Nikola implementation, shown in Listing 2 (we have elided some unchanged code). Beyond altering a few type annotations, rewriting a conditional and using vapply to control inlining, the Nikola code is identical to the Haskell code. The up side to making these modifications is that the blackscholes function can now be run on a GPU. The machinery in this section, including our new technique for detecting λ-sharing, allows us all the advantages of embedding Nikola, e.g., not having to write a custom parser or implement a type system, while retaining the ability to produce the same first-order representations of Nikola programs that a custom parser would produce. Specifically, we can use Haskell let bindings and application to represent Nikola let bindings and application and thereby avoid the duplicated work and code explosion that result when let-sharing and λ-sharing remain unobserved. However, this only solves the embedding problem; the remaining sections show how to take the first-order representation our embedding provides, compile it to GPU binary code, and make it callable from Haskell.

4.

Listing 2: Black-Scholes call option valuation in Nikola

Figure 1: CUDA thread hierarchy

Translating Nikola to CUDA

Thus far, we’ve seen how to embed a first-order language such as Nikola into Haskell in a convenient way while still maintaining both let-sharing and λ-sharing. In this section, we discuss the rest of the implementation of Nikola including size inference and loop translation. We use GHC’s quasiquoting feature (Mainland 2007) in translating Nikola to CUDA. This feature provides the syntactic convenience of writing CUDA code directly in a Haskell program while maintaining the benefits of using a data type representation— in contrast to a string representation—for CUDA; the GHC frontend takes care of converting CUDA syntax to nested data constructor applications. Quasiquoting also makes it easy to splice together CUDA program fragments. The translation is simplified by the fact that Nikola is first-order except for the small set of higher-order functions that are made explicit in the language.

CUDA itself has a single instruction, multiple data (SIMD) style execution model in which 1, 2 or 3-dimensional arrays of threads forming thread blocks are grouped into a 1 or 2-dimensional grid, as shown in Figure 1. An m × n grid of k × l thread blocks will lead to m · n invocations of a thread block, each of which runs k · l threads; the m · n thread blocks are executed in an unspecified order, and the threads in each thread block are run in groups whose size depends on the number of threads the CUDA-enabled device on which they are running can support. Each thread has access to (constant) variables that specify its index within its thread block and the index of its thread block in the grid. This provides a natural way to express (nested) parallel loops. All threads execute the same kernel.

72

τ ν ρ e

::= ::= ::= | | ::= | | | | | | | |

are bound, i.e., every time a lambda is encountered, the type environment is "wiped clean." This disallows closures and partial applications. A Nikola function can still return another function, but the returned function may not close over variables bound by the outer function’s lambda. For example, the Nikola function λ(x :: f loat) . λ(y :: f loat) . x + y is not typeable because x + y cannot be typed in the type environment y : f loat. The only higher-order functions available in the language are “baked in,” so the memory used by an expression can always bounded above by a function of the memory used by its subexpressions. In the simplified version of the language presented here, this bound is strict. These features allow us to pre-allocate all memory a computation needs before it is invoked, as described in Section 4.2. Baking-in higher-order functions that operate on vectors permits us to compile these operations to efficient parallel code; we detail this process in Section 4.3.

float |i| | min(ν1 , . . . , νk ) τ Vec ν τ {ρ1 , . . . , ρn } → ρ i x let x = e1 in e2 λ(x1 :: ρ1 ) . . . (xn :: ρn ) . e e x1 . . . xn e 1 + e2 e 1 ∗ e2 map zipWith Figure 2: Nikola language

4.2 Γ ` i : float

C ONST

Γ, x : ρ ` x : ρ

VAR

We guarantee that we can bound the memory requirements of a Nikola function by avoiding general looping constructs and restricting all vector operations to those for which a static bound can be calculated. Vector variables bound by a lambda are assigned a vector type that includes a tag indicating the index of the bound variable in the (ordered) set of variables bound by the lambda (see the A BS rule). The typing rule for each vector operation calculates a bound for the result of the operation that is a function of the sizes of its inputs (see M AP and Z IP). In practice this is only an upper bound, e.g., consider a filter operation on vectors. Determining the size of the outputs is in general undecidable until run time, but bounding the size of the outputs can be done statically. This upper bound is sufficient for pre-calculating the GPU memory needed by a Nikola function. For the subset of the language shown here, the bound is exact. The size of the memory needed by a Nikola function is a mathematical function of the sizes of its inputs. For example, map needs as much memory for its output as is occupied by its second argument, so the size index to the right of the arrow in the type of map is |2|. However, every time map is applied, we must substitute the size index of the actual second argument for |2|. This is the purpose of the function ϕ. From an algorithmic point of view, if the ith argument (ei in rule A PP) in an application has type Vec νi τ , then we take ϕ(|i|) = νi . The function ϕ ¯ in Figure 3 is the lifting of ϕ to types; it rewrites every size index using ϕ. Note that the type inference rules guarantee that |i| can appear in a type-level size only if the lexically enclosing lambda has at least i arguments. Our size inference is deliberately restricted so that a Nikola function can be compiled to a single GPU kernel. This rules out many functions one may wish to write. For example, consider a vector operator replicate such that replicate n x returns a vector of size n where each element is x. This might seem like as easy operation to add, but what if the programmer now writes replicate (fold (+) xs) 1, where xs is a vector; how can we perform size inference on this expression? Because we must allocate all memory needed by a CUDA function before we call it, compiling this expression would require that we either perform the initial fold on the CPU before executing replicate on the GPU, or that we compile the expression to a sequence of multiple calls to the GPU. In general, this kind of dependency requires either moving computation off the GPU and back to the CPU, or intelligently splitting computation between the CPU and GPU to properly interleave memory operations and GPU operations. We believe the former is a non-starter because it nullifies much of the benefit of using Nikola in the first place, and the latter is more complexity that we wished to take on here, so our size inference simply disallows this kind of dependency. We leave

Γ ` e 1 : ρ1 Γ, x : ρ1 ` e2 : ρ2 L ET Γ ` let x = e1 in e2 : ρ2 x1 : ρ1 , . . . , xn : ρn ` e : ρ if ρi = Vec ν τ , then ν = |i| A BS Γ ` λ(x1 :: ρ1 ) . . . (xn :: ρn ) . e : {ρ1 , . . . , ρn } → ρ Γ ` e1 : ϕ(ρ ¯ 1 ) . . . Γ ` en : ϕ(ρ ¯ n) Γ ` f : {ρ1 , . . . , ρn } → ρ A PP Γ ` f e1 . . . en : ϕ(ρ) ¯ Γ ` e1 : float Γ ` e2 : float P LUS Γ ` e1 + e2 : float

Γ ` e1 : float Γ ` e2 : float M UL Γ ` e1 ∗ e2 : float

Γ ` map : {τ1 → τ2 , Vec |2| τ1 } → Vec |2| τ2

M AP

Γ ` zipWith : {{τ1 , τ2 } → τ, Vec |2| τ1 , Vec |3| τ2 } → Vec min(|2|, |3|) τ

Z IP

Figure 3: Nikola type system See Section 4.2 for a description of the ϕ¯ function.

The execution model underlying CUDA imposes a number of additional severe constraints, making it a challenging target for compilation. Recursion is disallowed, as are function pointers; any memory used by threads must be allocated before kernel invocation; and the GPU cannot access arbitrary host CPU memory, so any data used or returned by a GPU computation must be explicitly transferred between the host CPU and GPU. 4.1

Size inference

The Nikola Language

A simplified version of the Nikola language is shown in Figure 2, and its type system in Figure 3. The language is fairly standard, but has a few features that allow it to be compiled to CUDA code. The full language includes additional scalar and vector operators, but the subset presented here is sufficient to demonstrate the interesting aspects of translation to CUDA. The A BS rule requires that the body of a function be typeable in a type environment where only the arguments to the function

73

The type of reifyAndCompile is therefore

as future work enhancing the compiler so that it can automate the CPU/GPU interleaving required for this sort of operation. 4.3

reifyAndCompile :: ReifiableFun a b ⇒ (a → b) → IO (F (a → b))

Loop translation

As described earlier, the CUDA execution model allows implicit parallelism to be nested to a depth of (in practice) at most two. In this section we describe how to map Nikola’s parallel looping constructs onto the model provided by CUDA. The simplest case is a top-level Nikola function whose entire body consist of a single loop over a vector of length n. We decompose this kind of computation by picking a static width w for the thread block and then using an execution grid of size dn/we (with appropriate guards on the loop body to avoid overrun) to invoking the kernel. The translated code does not contain an outer loop construct because the loop is implicit in the grid and thread block dimensions that are specified as part of kernel invocation. When a kernel contains sequentially executed parallel constructs, we cannot use this simple strategy because in general it would require that all loops somehow be fused into a single, massive loop. Instead, we use strip-mining to parallelize each loop. Because some CUDA-enabled devices can run more than one thread block at a time, this can be much less efficient than decomposing a parallel loop to run on a grid, but it allows us to handle the more general case. Assuming that the width of the thread block is given as threadBlockWidth and the index of a thread in this block by threadIndex, strip-mining transforms a loop of the form

For a type to be capable of marshalling to the GPU, it must be an instance of the Embeddable type class. The instance for a type provides a small amount of glue code to allow passing a value of the type to and from a CUDA kernel. Calling a CUDA function requires manually copying its arguments to GPU memory and explicitly building a call. Any additional memory used by a kernel is allocated prior to its invocation and passed as additional arguments, as is space needed for its results. This is all done using the driver API. After building the call stack, the kernel is invoked. When it returns, the results are copied back to CPU memory and all memory previously allocated for the kernel is released. Building the call stack and invoking the kernel is done in the Ex monad using the ExState described previously; running this computation yields a result in the IO monad and is done using evalEx. evalEx :: ExState → Ex a → IO a

5.1

for (int i = 0; i < n; ++i) { ... } to a loop of the form for (int is = 0; is < n; is += threadBlockWidth) { const int i = is + threadIndex; if (i < n) { ... } } A third possibility is that looping constructs are nested. Because the same thread index cannot be used to strip-mine multiple nested loops, any nested loop is translated to a standard for loop. All of these cases are abstracted away from the bulk of the translator through the use of a single combinator that wraps its body in the proper CUDA nesting construct depending on whether the looping construct occurs as the body of a top-level function and on whether there is an available loop index. This combinator also computes the mapping from a function’s arguments’ sizes to the proper thread block and grid dimensions for invoking the kernel.

5.

Generating GPU binary code

Given a Nikola function of type Exp a1 → ... → Exp an−1 → Exp an , the compilation process guarantees that it compiles to a Haskell function of type a1 → ... → an−1 → IO an . This invariant isn’t witnessed by a Haskell type—indeed it cannot be—so we must reason outside the type system about the compilation process itself to conclude that the invariant holds. Assuming that it does hold, we can ensure type safety when calling GPU object code. Our goal is to define a family of functions, call, that converts a Nikola function (written using HOAS) into a callable Haskell function. Calling a Nikola function requires reification (to convert from HOAS to a first-order representation), compilation (to convert the first-order representation to CUDA and then compile the CUDA to object code), call stack construction, memory allocation, function invocation, and cleanup. We use type classes to define the family of functions call at each type: (Exp a1 → ... → Exp an−1 → Exp an ) → a1 → ... → an−1 → IO an The straightforward way to handle building up the call stack while retaining the ability to build our instances inductively is to use continuation-passing style. The auxiliary callk takes a continuation that pushes the previously seen arguments on the stack, invokes the function and performs cleanup. There will be one call to callk for each argument; at each step in the call chain, the monadic action that will push arguments onto the CUDA call stack is built up. The final call to callk will perform this monadic action, creating the call stack, and invoke the kernel.

Compiling and calling a Nikola function

Once a Nikola function is translated to CUDA, it is compiled to binary object code by invoking nvcc, the NVIDIA CUDA compiler. Calling a CUDA function requires manually copying its arguments to GPU memory and explicitly building a call stack. All state related to calling a compiled CUDA function, including a handle to the compiled function, the call stack and a list of temporary allocations needed by the kernel, is kept in a data structure of type ExState. The entire process of converting a Nikola function written using HOAS to an ExState is handled by the function reifyAndCompile. We maintain type information about the compiled version of the function by wrapping an ExState in a phantom type using the type constructor F:

class Callable a b c | a b → c where call :: (a → b) → c call f = callk (reifyAndCompile f) id callk :: IO (F (a → b)) → (∀a.Ex a → Ex a) → c As we can see, the base case performs the actual compilation, invokes the monad action that pushes all arguments other than the first onto the stack, pushes the first argument onto the stack, and then executes the kernel and cleans up.

newtype F a = F {unF :: ExState}

74

a top-level binding g = compile f and then use g in Haskell code. However, even this approach requires that f is compiled every time the program is exececuted. What we would really like it to compile f once, at the same time the rest of the Haskell code is compiled. Template Haskell (Sheard and Peyton Jones 2002) provides exactly the features we need. The CUDA object code produced as the result of compilation is reified in Template Haskell as a string constant. The first time the compiled function is invoked, this object code is loaded into the GPU and the rest of the function’s execution state is constructed. The glue function that builds the call stack is generated using the same technique we saw before. We can create a top-level binding for a pre-compiled Nikola function f as follows:

instance (Embeddable a, Embeddable b) ⇒ Callable (Exp a) (Exp b) (a → IO b) where callk compileF pushArgs = λx → do F f ← compileF evalEx f $ pushArgs $ pushArg x $ launchKernel $ returnResult The inductive case requires casting the compilation action so that it has the proper type; this type drives the type class resolution algorithm to choose the correct instance of Callable for the nested call to callk.

g = $ (compileTH f) As we show in Section 6, compilation of each function takes about 500ms. By using Template Haskell, every time we run a Haskell program that embeds Nikola functions, we save approximately 500ms for every Nikola function that is used at least once by the program.

instance (Embeddable a, Reifiable (Exp a) (b → c), Callable b c d) ⇒ Callable (Exp a) (b → c) (a → d) where callk compileF pushArgs = λx → callk compileF0 (pushArgs ◦ pushArg x) where compileF0 :: IO (F (b → c)) compileF0 = castF <$> compileF castF :: F a → F b castF = F ◦ unF 5.2

6.

Evaluation

In this section we use two benchmarks, Black-Scholes call option evaluation and radix sort, to help characterize the performance trade-offs among various implementation platforms for the same algorithm. All benchmarks were done on a machine with two quadcore (64-bit) E5462 Xeon processors running at 2.80GHz, 16GB of CPU RAM, and 4 NVIDIA Tesla T10 GPUs. All Haskell code, including the vector library, was compiled with optimization 1 . For comparison, we wrote two implementations of the BlackScholes method. The first uses efficient unboxed vectors provided by the Haskell vector library; we feel at the time of writing that this library provides best-in-class performance for Haskell array manipulation code. The second version was implemented in Nikola. The code for both implementations is identical to what appears in Section 3. We also used Nikola’s ability to embed CUDA code directly in a Haskell program as a callable function to appropriate the hand-written implementation from the SDK. By embedding the SDK version directly, we can compare the quality of the CUDA code Nikola generates in isolation from factors such as function call overhead. Figure 4 shows four runs of the Black-Scholes function where the run-time is a function of the size of the input. The Haskell version using the vector library and the CUDA SDK version were each run once. The Nikola version was run twice; in one run, we compiled the Nikola version at Haskell compile-time, and in the other we compiled the function every time it was called. The latter gives an indication of the overhead imposed by an embedded DSL compiler that always specializes a function to its argument. In our runs, this overhead amounted to about a 500ms per function call. A lot of computation can occur in 500ms, and it takes a large number of inputs to amortize this cost. There is one source of overhead than cannot be avoided when running on a GPU: the time needed to shuffle data back-and-forth from the main processor. This cost is quickly offset by the performance increase obtained by moving computation onto the GPU. The performance of the pre-compiled Nikola version is identical to the performance of the hand-written CUDA version. Implementing a radix sort on a GPU requires substantially more effort than Black-Scholes option evaluation because it utilizes more than a single parallel loop. Our Nikola version is a fairly direct translation of the algorithm given by Blelloch (1990), which requires executing multiple passes over the data. Because of limita-

Compile-time vs. run-time compilation

The astute reader will notice one glaring problem with the code in the previous section; every time a Nikola function is called using call, it will be re-compiled! From the perspective of the caller, Nikola functions are pure, so it makes sense for us to treat them as such. This requires the use of unsafePerformIO, but referential transparency is maintained; note that in the rewritten version the first argument to callk is no longer a monadic action. class Compilable a b c | a b → c where compile :: (a → b) → c compile f = compilek (unsafePerformIO (reifyAndCompile f)) id compilek :: F (a → b) → (∀a.Ex a → Ex a) → c instance (Embeddable a, Embeddable b) ⇒ Compilable (Exp a) (Exp b) (a → b) where compilek f pushArgs = λx → unsafePerformIO $ evalEx (unF f) $ pushArgs $ withArg x $ launchKernel $ returnResult instance (Embeddable a, Reifiable (Exp a) (b → c), Compilable b c d) ⇒ Compilable (Exp a) (b → c) (a → d) where compilek f pushArgs = λx → compilek f (pushArgs ◦ withArg x) where f 0 :: F (b → c) f 0 = castF f This alternative implementation allows us to use Nikola functions in pure code. It has the added benefit that given a Nikola function f, compilation is only performed once for f if we create

1 -O2

-funbox-strict-fields -fvia-C -optc-O3 -optc-march=core2

75

106

CUDA Nikola (compile-time codegen) Nikola (argument specialization) Data.Vector

105 104 Time (msec)

Nikola functions can even be compiled at GHC compile time. We obtain all these features without any modification of GHC. One of the main motivations for doing a deep embedding of a DSL is to avoid having to write a parser and type checker by instead re-using Haskell’s syntax and type inference engine. Now that both let- and λ-sharing can be observed, there is even less of a reason to reinvent the wheel for each new domain-specific language since a deep embedding can yield the same first-order abstract syntax—complete with let-bound lambdas—that a custom parser would yield. Our technique for observing λ-sharing is applicable to any DSL that uses HOAS to represent DSL functions. There are several language enhancements that would make Haskell an even better host for domain-specific languages. If we could overload all of Haskell’s syntax, including pattern matching and if-then-else expressions, then rewriting a Haskell function as a Nikola function would require at most a change in its type signature. We could increase our confidence in the back-end by assigning better types to the CUDA fragments it constructs; however, even GADTs are not sufficient for capturing the details of C’s type system. Extending Haskell’s type system so that it could integrate smoothly with the type system of an arbitrary embedded object language—including C-like languages—is an interesting research challenge. We plan to add additional array operations to Nikola; obvious candidates include matrix and signal processing functions. It would be interesting to see how much of NESL’s (Blelloch et al. 1994) functionality we can successfully provide using only an embedding. Another possible avenue of research is integrating Nikola’s functionality directly with Data Parallel Haskell (Peyton Jones et al. 2008). Finally, because it presents such a high-level interface, we should be able to easily re-target Nikola’s back-end to languages such as OpenCL.

103 102 101 100 10−1

10−2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Number of options

Figure 4: Performance of Black-Scholes implementations. Note that the CUDA and pre-compiled Nikola implementations overlap point-for-point.

104

Nikola Data.Vector

Time (msec)

103 102 101

Acknowledgments

100

This research is supported by the Office of Naval Research under grant number N00014-09-1-0770.

10−1

References 10−2 0 2

22

24

26

28

210

212

214

Number of elements

216

218

220

Robert Atkey, Sam Lindley, and Jeremy Yallop. Unembedding domainspecific languages. In Proceedings of the 2nd ACM SIGPLAN Symposium on Haskell (Haskell ’09), pages 37–48, Edinburgh, Scotland, 2009. ACM. Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava: hardware design in haskell. In Proceedings of the 3rd ACM SIGPLAN International Conference on Functional Programming (ICFP ’98), pages 174–184, Baltimore, Maryland, United States, 1998. ACM. Guy E Blelloch. Prefix sums and their applications. Technical Report CMUCS-90-190, School of Computer Science, Carnegie Mellon University, 1990. Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. Implementation of a portable nested data-parallel language. Journal of Parallel and Distributed Computing, 21(1):4–14, 1994. Manuel Chakravarty, Gabriele Keller, and Sean Lee. accelerate, October 2009. URL http://www.cse.unsw.edu.au/~chak/project/ accelerate/. Koen Claessen and David Sands. Observable sharing for functional circuit description. In Proceedings of the 5th Asian Computing Science Conference on Advances in Computing Science, pages 62–73. Springer-Verlag, 1999. Conal Elliott. Functional images. In The Fun of Programming, “Cornerstones of Computing” series. Palgrave, March 2003. Conal Elliott. Programming graphics processors functionally. In Proceedings of the 2004 ACM SIGPLAN Workshop on Haskell (Haskell ’04), pages 45–56, Snowbird, Utah, USA, 2004. ACM.

222

Figure 5: Performance of radix sort implementations. tions of Nikola, this necessitates invoking multiple kernels instead of a single fused kernel. Nonetheless, the Nikola version, generating unoptimized CUDA code, still substantially outperforms a native Haskell version for data sets larger than about 32kB, as shown in Figure 5. The performance data demonstrates that Nikola can generate code that is as efficient as hand-written CUDA. Admittedly, we have engineered Nikola so that it can be compiled to efficient GPU code, but we make no apology for this—DSLs are, after all, domain specific.

7.

Summary and Future Work

Nikola provides a lightweight, low-effort way for Haskell programmers to offload computation onto a GPU without leaving Haskell. In some cases, Nikola can outperform a pure Haskell implementation by several orders of magnitude. Programmers retain the benefits of a high-level language while still obtaining many of the performance benefits of low-level CUDA code. Writing a Nikola function is almost as easy as writing an equivalent Haskell function, and

76

Conal Elliott, Sigbjörn Finne, and Oege De Moor. Compiling embedded languages. Journal of Functional Programming, 13(3):455–481, 2003.

SIGPLAN International Conference on Functional Programming (ICFP ’08), page 335–346, New York, NY, USA, 2008. ACM.

David Gay, Philip Levis, J. Robert von Behren, Matt Welsh, Eric A. Brewer, and David E. Culler. The nesC language: A holistic approach to networked embedded systems. In Proceedings of the ACM SIGPLAN 2003 conference on Programming Language Design and Implementation (PLDI ’03), page 1–11. ACM, 2003.

John T. O’Donnell. Generating netlists from executable circuit specifications. In Proceedings of the 1992 Glasgow Workshop on Functional Programming, pages 178–194. Springer-Verlag, 1993.

Andy Gill. Type-safe observable sharing in haskell. In Proceedings of the 2nd ACM SIGPLAN Symposium on Haskell (Haskell ’09), pages 117– 128, Edinburgh, Scotland, 2009. ACM.

Izzet Pembeci, Henrik Nilsson, and Gregory Hager. Functional reactive robotics: an exercise in principled integration of domain-specific languages. In Proceedings of the 4th ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming, pages 168–179, Pittsburgh, PA, USA, 2002. ACM.

John Hughes. The design of a pretty-printing library. In J. Jeuring and E. Meijer, editors, Advanced Functional Programming, pages 53–96. Springer Verlag, LNCS 925, 1995.

Simon L. Peyton Jones, Roman Leshchinskiy, and Manuel Chakravarty. Harnessing the multicores: Nested data parallelism in haskell. In Programming Languages and Systems, page 138. 2008.

Graham Hutton. Higher-order functions for parsing. Journal of Functional Programming, 2(3):323—-343, July 1992.

F. Pfenning and C. Elliot. Higher-order abstract syntax. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language Design and Implementation (PLDI ’88), pages 199–208, Atlanta, Georgia, United States, 1988. ACM.

Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. PyCUDA: GPU Run-Time code generation for High-Performance computing. 0911.3456, November 2009.

Tim Sheard and Simon L. Peyton Jones. Template meta-programming for haskell. In Proceedings of the 2002 ACM SIGPLAN Workshop on Haskell (Haskell ’02), pages 1–16, Pittsburgh, Pennsylvania, 2002. ACM.

Sean Lee, Manuel Chakravarty, Vinod Grover, and Gabriele Keller. GPU kernels as Data-Parallel array computations in haskell. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods (EPAHM 2009), 2009. Daan Leijen and Erik Meijer. Domain specific embedded compilers. In Proceedings of the 2nd conference on Domain-specific languages, pages 109–122, Austin, Texas, United States, 1999. ACM. Roman Leshchinskiy. vector: Efficient arrays, February 2010. URL http: //hackage.haskell.org/package/vector. Geoffrey Mainland. Why it’s nice to be quoted: quasiquoting for haskell. In Proceedings of the ACM SIGPLAN Workshop on Haskell (Haskell ’07), page 73–82, New York, NY, USA, 2007. ACM. Geoffrey Mainland, Greg Morrisett, and Matt Welsh. Flask: Staged functional programming for sensor networks. In Proceeding of the 13th ACM

Joel Svensson, Koen Claessen, and Mary Sheeran. Obsidian: A domain specific embedded language for parallel programming of graphics processors. In Proceedings of 20th International Symposium on the Implementation and Application of Functional Languages (IFL ’08), Hatfield, UK, 2008. Joel Svensson, Koen Claessen, and Mary Sheeran. GPGPU kernel implementation using an embedded language: a status report. Technical Report 2010:01, Chalmers University of Technology, January 2010. Walid Taha and Tim Sheard. Multi-stage programming with explicit annotations. In Proceedings of the 1997 ACM SIGPLAN symposium on Partial Evaluation and Semantics-Based Program Manipulation (PEPM ’97), pages 203–217, Amsterdam, The Netherlands, 1997. ACM.

77

Concurrent Orchestration in Haskell John Launchbury

Trevor Elliott

Galois, Inc. {john,trevor} at galois.com

Abstract

The mechanisms and libraries provided by GHC are now quite mature, and at Galois we have used concurrent Haskell to write many systems, including a webDAV server, network stacks, and a library of virtual machine infrastructure. Our experience is that Haskell is very effective for these purposes, allowing us to write intricately concurrent programs with surprising ease, and have them highly performant at run-time. Having said that, we still need to care about many details to get these concurrent programs right. There are many concurrent applications where we would want to work at a yet higher level, not having to worry about forks, thread identifiers, or race conditions, etc. Concurrent scripting or orchestration is an example, by which we mean any situation where we wish to orchestrate multiple external (possibly remote) actions whose timing and interleaving is unpredictable, i.e. scripting in a setting where concurrency is prevalent, and indeed dominant. Having a robust and composable approach is highly desirable. For this we turned to the Orc domain-specific language for our inspiration [MC07, KQCM09]. Orc was introduced to address the challenges of highly concurrent scripting, with particular reference to internet programming. To echo an example from the Orc literature, we might wish to contact two airlines simultaneously seeking price quotes. If either quote comes back below a threshold price, say $300, then let’s buy a ticket immediately. On the other hand if both quotes exceed the threshold, then buy the cheapest ticket. Additionally, buy a ticket if the other airline does not give a timely quote, or notify the user if neither airline provides a timely quote. The original version of Orc is a stand-alone domain-specific language (DSL), with mechanisms enabling any Java class instance to be called. As with any DSL, there are advantages and disadvantages to providing it as a stand-alone language. On the one hand, syntax can be specialized to the task at hand, the error messages may be specifically designed, additional analysis tools can be provided, and new users are not challenged with learning more than the DSL itself. On the other hand, different benefits accrue from embedding the DSL in a host language (i.e. an EDSL). These benefits include the ability to mix and match the EDSL programs with other tasks, and a shorter learning curve as much is inherited from the host language. As the ideas of concurrent orchestration are largely non-specific to any particular language, we wanted to take the Orc ideas and adapt them to fit naturally within the Haskell setting. Hence this paper! We describe an adaptation of Orc as an EDSL in Haskell. Many aspects of the embedding were so straightforward and natural that they would be scarcely worth writing about—in fact we have heard of a number of people who have each embedded portions of Orc into Haskell. But a few aspects turn out to be subtle and tricky, meriting a more careful examination, and these form the main contributions of the paper.

We present a concurrent scripting language embedded in Haskell, emulating the functionality of the Orc orchestration language by providing many-valued (real) non-determinism in the context of concurrent effects. We provide many examples of its use, as well as a brief description of how we use the embedded Orc DSL in practice. We describe the abstraction layers of the implementation, and use the fact that we have a layered approach to demonstrate algebraic properties satisfied by the combinators. Categories and Subject Descriptors D.3.3 [PROGRAMMING LANGUAGES]: Language Constructs and Features General Terms Keywords

1.

Design, Languages

Concurrency, Haskell, Orc, DSL, Monad

Introduction

Concurrent programming continues to grow in importance, both because of the prevalence of multicore processors, and because of the distributed nature of internet and enterprise systems. Because different concerns dominate in the different settings, there will never be one single way to program these concurrent systems, just as sequential programming continues to benefit from a multiplicity of distinct approaches. Furthermore, because concurrent programming is far less mature than sequential programming, the area is still ripe for exploring new paradigms, identifying their strengths and weaknesses in different settings. This exploration proceeds par excellence in the context of Haskell. Haskell (and, in particular, the premier Haskell compiler GHC) provides a very powerful and effective set of concurrency primitives together with relevant support in the run-time system. For many purposes, the level of abstraction of these primitives is appropriate but, as their designers state, while these primitives “are expressive ... we do not advocate programming with them directly; instead we hope to build a library of robust abstractions, layered on top of the primitives, that express common programming paradigms.” [MPMR01]. True to that aim, the Haskell community now has higher level libraries, including software transactional memory (STM) [HMPH05] which provides one approach to the holy grail of composable concurrency (though STM also required additional support in the run-time system).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

• The first concerns control of concurrency. In an Orc-like setting

it is easy to generate abundant concurrency; what is more tricky is trimming and controlling the concurrency when it is no longer needed.

79

• A second aspect concerns laziness. The original Orc language

is just the same as

uses concurrent non-strictness to manage synchronization. It is appealing to duplicate this in Haskell, but we have concluded that a different approach fits more properly in the context of Haskell. We relegate the lazy version as a design alternative.

do x <- p stop Formally, stop is a left-zero of bind (but not a right-zero, as we shall see later). Now we come to the parallel operators. Orc provides three different kinds of parallelism. First, there is a parallel choice operator <|> (pronounced par), which has no intrinsic left-right bias— unlike the list monad. Rather, (p <|> q) will perform all the actions of (and return all the results of) both p and q, whenever they become available. This is true non-determinism in that the semantics of Orc does not specify any ordering between p and q. Secondly, there is a biased-choice operator <+> (pronounced and then). In contrast to <|>, an expression of the form (p <+> q) will perform all the actions of (and return all the results of) p, and only when p is completely done will it then go on to perform all the actions of (and return all the results of) q. It may be a misnomer to list <+> as a parallel operator. We do so because its form is almost identical to <|>, overlooking the fact that it imposes an ordering on its arguments. In both of these operators, p and q are independent of each other. Unless they share a common state element, they will not affect or communicate with each other. The third form of parallelism provides explicitly for communication. The Orc term (eagerly p) will fire up p in a parallel thread, and immediately return a handle for accessing the first result that p produces. The handle is itself an Orc term, so can be used anywhere an Orc term (of the appropriate type) would be used. The final two operators relate the Orc monad and the IO monad. Any IO operation can be lifted into the Orc monad using liftIO, and it will behave just like it does in IO, returning the single result that it would as an IO operation. The liftIO function is actually the overloaded method of the MonadIO class, making Orc an instance of MonadIO, and thus providing access to any functions defined over that class. In particular, by replacing the standard module Control.Concurrent with the overloaded replacement module Control.Concurrent.MonadIO we can use MVars and the like directly in Orc, as the module gives these accessor functions the following overloaded types:

• Third, Orc comes with a number of algebraic laws as part of

its specification. We have made significant progress in proving that our implementation of the Orc EDSL satisfies these laws, including identifying the requirements we need to establish from the underlying concurrent Haskell foundation. • Finally, the Haskell setting has provided a fertile ground to

explore different choices of combinators, leading us to propose an alternative primitive from that chosen in the original Orc. This paper contains quite a bit of code. We have been careful to show the code that is useful in explaining the ideas, and elide any that is simply boilerplate. All the code has been released as a Hackage library, and may be installed using cabal. A note about naming. We use the name Orc to refer to both the original stand-alone DSL, and also to our embedding of the ideas in Haskell. Mostly the context makes clear which we mean, but we try to be explicit whenever there may be confusion.

2.

The Orc Language

Orc is the combination of three things: many-valued concurrency, external actions (effects), and managed resources, all packaged in a high-level set of abstractions that feel more like scripting rather than programming. Unsurprisingly, like most EDSLs, Orc is a monad. We introduce a type constructor Orc to represent Orc computations. An expression p::Orc A may perform many actions, and may produce many results of type A. Orc terms are constructed using the following primitive operators: return (>>=) stop (<|>) (<+>) eagerly liftIO runOrc

:: :: :: :: :: :: :: ::

a -> Orc a Orc a -> (a -> Orc b) -> Orc b Orc a Orc a -> Orc a -> Orc a Orc a -> Orc a -> Orc a Orc a -> Orc (Orc a) IO a -> Orc a Orc a -> IO ()

newEmptyMVar :: MonadIO io => io (MVar a) takeMVar :: MonadIO io => MVar a -> io a putMVar :: MonadIO io => MVar a -> a -> io ()

As usual, the monad operators return and bind (>>=) allow us to use the do-notation to build Orc terms. Monads are often thought as sequential, but it will soon be clear that this is not the case here. A better intuition for the Orc monadic bind will be nested iteration, rather like in the list monad. Thus an expression like

Some later examples will take advantage of this. The runOrc function works in the other direction to liftIO, allowing an Orc computation to be executed within the IO monad. Note that there is no canonical way to reduce the many results of an Orc computation into the single result that would be required of an IO computation. Discarding the results is canonical, however, so this is what the primitive does. We will later be able to produce a result of (IO [a]) built in terms of this. We often use a version of runOrc called printOrc that prints each output on a separate line. It can be defined in Orc as follows.

do x <- p y <- q x h y x is best read as a sequence of “for each” statements. In particular, “for each x drawn from the execution of p, and then for each y drawn from the execution of (q x), produce all the values created in the execution of (h x y).” Thus the Orc monad sets up a kind of nested loop structure, except that the various “loops” are all run concurrently. The stop operation finishes a local thread of operations; anything that was sequenced after a stop will not get executed, and no value will be returned from this particular computation. So, for example,

printOrc :: Show a => Orc a -> IO () printOrc p = runOrc $ do x <- p liftIO $ putStrLn ("Ans = " ++ show x) That is, for each value x obtained from the Orc computation p, print a line displaying the answer. Note that the order that the results are printed is dependent on the order that they are produced; the results do not necessarily print in a predictable order.

do x <- p y <- stop h y x

80

2.1

Examples

The argument to the extend function represents a partial solution to the problem by recording the row positions of the queens in some initial number of columns. Then, for each value of j from 1 to 8 (explored in some indeterminate order), we consider whether the position j will conflict with the previous partial solution. We omit the code for the conflict testing function as it is a standard boolean test and not Orc-specific. Note that the body of extend is very similar to what one would write within the list monad to solve the same problem. That is no coincidence, as Orc may be seen as the merger of the list monad with the IO monad (except that the order of results is indeterminate). In the case of 8 Queens, the output produced is:

To give a flavor of the Orc monad in practice, we’ll walk though a series of examples. First, a trivial one. If we define, fplang = do w <- return "Haskell" <|> return "ML" <|> return "Scheme" return (w ++ " is great!") then executing it (in the interactive Haskell environment GHCi) proceeds as follows: *Main> printOrc fplang Ans = "Haskell is great!" Ans = "Scheme is great!" Ans = "ML is great!"

*Main> printOrc queens Ans = "Computing 8-queens..." Ans = "[5,7,1,3,8,6,4,2]" Ans = "[6,4,2,8,5,7,1,3]" Ans = "[4,2,7,3,6,8,5,1]" Ans = "[2,7,3,6,8,5,1,4]" :

where the order of the answers is somewhat indeterminate. Now consider a slightly richer example. metronome = return () <|> (delay 2 >> metronome) In parallel, metronome both returns a value (), and starts to wait 2 seconds before doing the whole thing all over again. The delay function is obtained just by lifting the IO threadDelay operation into the Orc monad (and we choose to use fractional seconds rather than microseconds as our unit of time).

We immediately ran it again and got the following: *Main> printOrc queens Ans = "Computing 8-queens..." Ans = "[4,2,7,3,6,8,5,1]" Ans = "[3,6,8,1,4,7,5,2]" Ans = "[2,7,3,6,8,5,1,4]" Ans = "[5,7,1,3,8,6,4,2]" :

delay :: (RealFrac a) => a -> Orc () delay w = liftIO $ threadDelay (round (w * 1000000)) Here’s what we get when we print the result: *Main> printOrc metronome Ans = () Ans = () ^CInterrupted.

Note that the order of the results is different because there is genuine non-determinism going on here. Note also that in each case the first answer given happens to be “Computing 8-queens”. There is nothing in the semantics that says it will be the first answer, but operationally it is likely to be the first answer because it can be produced so quickly. If we wanted to ensure the ordering, we could have written:

where each line was produced a couple of seconds after the previous one. Note that the <|> operator is actually an overloaded operator from the standard Alternative class, of which Orc is an instance. Additionally, Orc can be made instances of other standard classes, including the MonadPlus and Applicative classes, which provides some useful standard combinators for free, such as: guard pure <*> <$>

:: :: :: ::

queens = return ("Computing 8-queens...") <+> extend []

Bool -> Orc () a -> Orc a Orc (a->b) -> Orc a -> Orc b (a->b) -> Orc a -> Orc b

using the sequentializing operator <+>. Our next examples demonstrate the interplay of effects and concurrency. First, scan. On lists, a scan function passes over a list calculating and returning all the partial foldl or foldr results (depending which scan function we define). The corresponding function in Orc will accumulate the partial fold results in whatever order the values become available. We use a TVar within Orc to store the ongoing accumulator, having written an atomic modify operation in STM to increment it1 . The code is as follows:

Depending on its boolean argument, the guard function acts either as stop or (return ()). The <*> operator provides function application between Orc valued computations. The pure function lifts values (and hence also functions) into the Orc monad, and <$> acts like application ($) lifted into the monad. These latter two are each (sometimes helpful) renaming of return and fmap respectively. We will see guard in use in the 8 Queens example.

scan :: (a -> s -> s) -> s -> Orc a -> Orc s scan f s p = do accum <- newTVar s x <- p (w,w’) <- modifyTVar accum (f x) return w’

queens = extend [] <|> return ("Computing 8-queens...") extend :: [Int] -> Orc String extend xs = if length xs == 8 then return (show xs) else do j <- liftList [1..8] guard $ not (conflict xs j) extend (j:xs)

where the type of modifyTVar is modifyTVar :: MonadIO io => TVar a -> (a -> a) -> io (a,a) 1 Just

like MVars, overloaded versions of TVar accessor functions are available on Hackage, in this case in the module Control.Concurrent.STM.MonadIO. This allows us to have direct access to TVars from Orc and IO (and from any other monad in the MonadIO class).

liftList :: [a] -> Orc a liftList = foldr (<|>) stop . map return

81

cated to define <+> in terms of , whereas it is pretty straightforward the other way around.

Note that at first blush scan looks like linear sequential code. But recall that we need to read the monadic <- as for each. As most of these lines are simply liftings from IO, they will produce just a single answer anyway, but the line x <- p could produce zero, one, or many answers. So the code reads as follows: create an accumulator containing the initial value s, and then for each value x produced from p atomically modify the accumulator by combining its value with x (through a partial application of f), and then return the new accumulated value. In a similar style, but this time using the bias of the <+> operator, we can write a function which counts how many results another Orc program produces.

3.

Managed Concurrency

With <|> and <+>, we have a non-deterministic multi-valued monad that includes IO actions. These parallel operations are superb at generating concurrency, but not so good at limiting it when it is no longer required. This is where eagerly comes into its own. The purpose of the eagerly combinator is twofold: it sparks off computations early, and it cuts down the set of results to a single result (the first one that happens to be produced). Many times we only care about the latter capability. For this we can use eagerly to define a cut combinator, which returns just the first answer its Orc argument produced.

count :: Orc a -> Orc (Either a Int) count p = do accum <- newTVar 0 (do x <- p modifyTVar accum (+1) return $ Left x) <+> (do c <- readTVar accum return $ Right c)

cut:: Orc a -> Orc a cut = join . eagerly or for those less comfortable with monad magic, cut p = do ox <- eagerly p ox

For each value x produced by p, we increment the accumulator, and then return a tagged version of the value x. Once everything in the inner do is completed, we will read the accumulator, and return that value too (appropriately tagged). Here’s another variant: this collects the values produced by an Orc computation, and delivers them as a list when all are completed.

i.e. fire up the Orc expression p to get a handle to the trimmed single result, and then immediately wait for that handle to deliver. Only one value result will be produced. Using cut, we can specify a simple timeout combinator, as follows: butAfter :: Orc a -> (Float, Orc a) -> Orc a p ‘butAfter‘ (t,def) = cut (p <|> (delay t >> def))

collect :: Orc a -> Orc [a] collect p = do accum <- newTVar [] (do x <- p modifyTVar accum (x:) stop) <+> readTVar accum

The butAfter combinator sets up a race. If the Orc expression p returns a result before the time t is expired, then it will be the first result, and so the only one allowed through the cut. Any other computations initiated in p will be terminated, along with the delay and the default computation. On the other hand, if the delay turns out to conclude sooner than the Orc expression, and if def then produces a value first, it will be the sole value returned. Note that if ever p terminates without producing any results, the default computation def will execute (after the delay has completed). To see the other role of eagerly (that of sparking a parallel computation), consider parallel or—if either argument is true, we want to return true even if the other argument has not declared a result. Parallel or is not definable in the sequential lambda calculus, but in Orc we define it as follows:

Note that (collect p) will only return a result if p itself has only finitely many results, and also completes all its execution in a finite time. Similarly, count will return a Right value—the count—only when and if its argument completes. We can also program a variant of <+>, which we write as (pronounced or else). An expression of the form (pq) will perform all the actions of (and return all the results of) p, and only if p produced no answers will it then go on to perform all the actions of (and return all the results of) q.

parallelOr :: Orc Bool -> Orc Bool -> Orc Bool parallelOr p q = do ox <- eagerly p oy <- eagerly q cut ( (ox >>= guard >> return True) <|> (oy >>= guard >> return True) <|> (pure (||) <*> ox <*> oy))

() :: Orc a -> Orc a -> Orc a p q = do tripwire <- newEmptyMVar (do x <- p tryPutMVar tripwire () return x) <+> (do triggered <- tryTakeMVar tripwire case triggered of Nothing -> q Just _ -> stop)

Both p and q are sparked as computations, and ox and oy are bound to Orc valued computations that will return the first values p and q produce. Within the cut, we attempt three different computations in parallel, corresponding to the three cases of parallel or. The first two wait on the results of p and q respectively, and if True, return True immediately. The third case applies the standard or function (||) to the results of both p and q, which covers the case when both are False. Whichever of these computations is the first to succeed becomes the single result of the whole parallelOr function. The sync combinator shows a more general use of eagerly. It captures the idea of fork-join. The function sync launches two Orc computations in parallel, and then waits for a result to come back

For any value x produced by p, we set the tripwire variable. Once p has completely finished, we then try to read the tripwire variable to see if it was triggered. If not, we execute q, otherwise we simply stop this alternative action. The operator is provided as a primitive in the standalone Orc DSL. We choose to provide <+> instead as it is quite compli-

82

The combinators stop and <|> satisfy laws as follows:

from both before continuing. In this case we have parameterized over the function used to combine the results.

L AW 2 (Par Laws). For all k, p, q, and r, Left-Zero: (stop >>= k) = stop Stop-Identity: p <|> stop = p Par-Commutativity: p <|> q = q <|> p Par-Associativity: p <|> (q <|> r) = (p <|> q) <|> r Par-Bind: ((p<|>q) >>= k) = ((p>>=k) <|> (q>>=k))

sync :: (a->b->c) -> Orc a -> Orc b -> Orc c sync f p q = do po <- eagerly p qo <- eagerly q pure f <*> po <*> qo Just as in parallelOr, (pure f) lifts the function f into the Orc monad, whereupon <*> applies it to each argument as their values become available. Here’s an easy use of sync:

Note that these are not the same as the laws typically suggested for the MonadPlus class. In particular, Orc calls for the parallel combinator to be commutative, a property violated in many instances of the MonadPlus class, including the classic instance, List, in which <|> is ++ (list append). On the other hand, a commonly suggested law for the MonadPlus class is missing from the list here, namely the Right-Zero law:

notBefore:: Orc a -> Float -> Orc a p ‘notBefore‘ w = sync const p (delay w) Unlike delay which delays the start of a computation, notBefore delays the result of the computation (though like delay it also returns just a single value). We now have enough machinery to do the example described in the introduction. Assume that we have functions

p >> stop = stop

{- Not true -}

We do not want this law to hold in Orc, as we want the effects of p to occur, even though it produces no value results. A key point here is that Orc is not only a multiple-value monad, but is equally a concurrent-effect monad. Here’s an example where we might use the (p >> stop) paradigm.

getQuote :: Query -> Orc Quote price :: Quote -> Int which, respectively, attempt to obtain each individual quote (with an HTTP query, for example), and to extract the price of the quote. Then we can code up the logic of the query simply as follows.

hassle = (metronome>>email("Simon","Hey!")>>stop) ‘butAfter‘ (60, return ())

quotes :: Query -> Query -> Orc Quote quotes srcA srcB = do quoteA <- eagerly $ getQuote srcA quoteB <- eagerly $ getQuote srcB cut ( (pure least <*> quoteA <*> quoteB) <|> (quoteA >>= threshold) <|> (quoteB >>= threshold) <|> (delay 25 >> quoteA <|> quoteB) <|> (delay 30 >> return noQuote))

Over the course of a minute, this will send Simon an irritating email every 2 seconds. Note that we pipe the result of the email operator into stop in order to ensure that the butAfter combinator continues to wait until the timeout is achieved. We saw a similar use of stop in the collect example earlier. The Par-Bind law also deserves a mention. By extrapolation, this law tells us that code which follows a bind (k, here) gets reexecuted for each value that is produced by the left hand argument. That is, all its effects are re-performed, and all its results are rereturned. But what about the dual law? That is, are 1. p >>= (\x -> h x <|> k x)

least x y = if price x < price y then x else y threshold x = guard (price x < 300) >> return x

2. (p >>= h) <|> (p >>= k)

The two quotes are launched eagerly, and then whichever of the various clauses in the cut is completed first, that’s what the result will be.

4.

equal? The answer is a resounding No. In general these two expressions are quite different. In the first term, the effects of p are performed once, whereas they may be performed twice (or more) in the second term. The laws for <+> are similar: <+> is associative, and has stop as a left and right identity. However, it is not commutative, and neither does it satisfy a corresponding version of the Par-Bind law (as the subsequent computations may re-arrange the order in which the results are produced).

Semantics

With any DSL, it is useful to provide laws to help the user understand the behavior without having to think operationally. Providing laws also helps ensure that the design is clean. Unsurprisingly given its origin as a process calculus, the Orc language stipulates a set of laws. A number of the Orc laws are just the monad laws (which incidentally provides yet more evidence that a monadic formulation of Orc is very natural):

4.1

Eagerly Laws

For the pruning operator eagerly we have again taken the corresponding laws from Orc, and translated them into our monadic Haskell setting. The first we consider is called Distributivity over >>. Translating into our setting, we would express this property as follows: for all (let-bound variables) q, k, and h,

L AW 1 (Monad Laws). For all x, k, p, and h, Left-Return: (return x >>= k) = k x Right-Return: (p >>= return) = p Bind-Associativity: ((p >>= k) >>= h) = (p >>= (k >=> h))

eagerly q >>= (k >=> h) = (eagerly q >>= k) >>= h

Note that the >=> operator is monadic (Kleisli) composition. That is, k >=> h = \x -> k x >>= h. In writing laws such as these, we assume the variables to act as if they were let-bound variables in Haskell (so we don’t have to worry about variable capture), and of the appropriate type. The main value of the monad laws is that they allow flexible use of the do-notation. We can abstract a sub-portion of a sequence of Orc operations, and understand them in isolation from the rest of the context.

Now we can see that this law is just an instance of bind associativity, where (eagerly q) substitutes for p in the expression of the law. It is this property that demonstrates that eagerly can be just a combinator, and does not need to have its own binding construct (as it does in the original Orc DSL). The next law about eagerly is a weak dual to the Par-Bind law we saw earlier.

83

import Orc import System.Random

L AW 3 (Par-Eagerly). For all p, k, and h, eagerly p >>= (\x -> k x <|> h) = (eagerly p >>= k) <|> h

data Bidder = Bidder { name :: String , logic :: Item -> Price -> Orc Price } type Item = String type Price = Int

Extrapolating from this law, we learn that later computations (h in this case) do not wait for an Orc term guarded by eagerly to produce a result before being fired up themselves. That is, the expression (eagerly p) places p outside of a sequential flow of control, to be performed concurrently according to some undetermined schedule. The next law says something similar:

auction :: Item -> Price -> [Bidder] -> Orc () auction item price members = do (bid,bidder) <- cut (seekBid item price members) continue item bid bidder members

L AW 4 (Eagerly-Swap). For all p, q, and k, do y x k = do

<- eagerly p <- eagerly q x y x <- eagerly q y <- eagerly p k x y

continue :: Item -> Price -> Bidder -> [Bidder] -> Orc () continue item price bidder members = do liftIO $ putStrLn (name bidder++ " bids $"++show price) mb <- (Just <$> seekBid item price members) ‘butAfter‘ (5, return Nothing) case mb of Nothing -> purchase item price bidder Just (bid’,bidder’) -> continue item bid’ bidder’ members

It doesn’t matter in what order the eager computations are launched: the effects and the result will be the same. Of course, what actually gets produced in each case on any given run will depend on undetermined scheduling choices. Finally, in the original Orc setting, there is a law called Elimination of Where. In our setting, this corresponds to lifting IO operations into the Orc monad using liftIO, so the law becomes:

seekBid :: Item -> Price -> [Bidder] -> Orc (Price, Bidder) seekBid item price members = foldr (<|>) stop [consider item price m | m <- members]

L AW 5 (Eagerly-IO). For all p, m, eagerly (liftIO m) >> p = (liftIO m >> stop) <|> p Like Par-Eagerly, this law translates a sequential use of eagerly into a parallel usage, demonstrating the non-blocking concurrent nature of eagerly. Again, just because we pipe any value result into stop (the Orc equivalent of /dev/null) doesn’t mean that we want to lose the effects of m. A stronger law, in which liftIO m is replaced by an arbitrary Orc term q seems plausible at first, i.e. that

consider :: Item -> Price -> Bidder -> Orc (Price, Bidder) consider item price member = do bid <- logic member item price guard (bid > price) return (bid,member)

(eagerly q >> p) = ((q >> stop) <|> p) Unfortunately, this is false. It expresses well that q is done concurrently, but misses the fact that eagerly also trims its argument to produce a single result only, and kills any remaining effects. No such trimming is done with (q >> stop). To capture this we would need the law to have the form

purchase :: Item -> Price -> Bidder -> Orc () purchase item price bidder = do liftIO $ putStrLn (name bidder++" wins " ++item++" for $"++show price)

(eagerly q >> p) = ((cut q >> stop) <|> p)

Figure 1. The “Orction”

but then we haven’t fully eliminated eagerly from the right hand side.

5.

Example

butAfter combinator, allowing a Nothing to be put in place of the bid if none of the bidders decide to act in time. If there is no response to the latest bid, the current highest bidder will be awarded the item, and the computation terminates. If there was a new bid, the auction will continue with the whole bidding process repeating again.

Let’s work a slightly larger example to see Orc blended with other parts of Haskell. Imagine that we are holding an auction. We start by selecting an item and determining a good opening price for it, assemble a group of bidders who are interested in competing to purchase this item. We start the auction by asking the group for something above the initial asking price. After someone makes a bid, we give everyone the opportunity to raise the price, accepting a higher bid if it arrives within a time limit. Once no additional bids are received, we stop the bidding, awarding the item to the bidder with the highest bid. The code for this is in Figure 1. The main function is auction. We start the auction by giving all of the members of the auction opportunity to place an initial bid via the seekBid function. We apply cut in order to take only the quickest response. Next, the auction enters a phase of repeated bidding, requesting that members of the auction make bids, and timing out after 5 seconds if no bid is received. This timeout is accomplished with the use of the

5.1

Orc in Practice

One of the interesting aspects of the auction example is that the elements of concurrency are all very short-lived. There is very little deeply concurrent backtracking-style computation of the form found in 8-queens, for example. In our experience, different applications call for quite different blends of deep and shallow concurrency, along with other kinds of effects. The original motivating application for Orc at Galois was writing concurrent tests for virtual machines running in the Xen hypervisor. These machines need to be prompted to talk to each other, and their communications had to be monitored to see if they were

84

is simply forkIO on the IO monad. Similarly, in liftIO, we lift any IO computation into Orc by executing the computation and applying the continuation to the result. This >>= is in the IO monad, or rather in the IO-like monad that we will later use instead of IO (hence the inner liftIO). The eagerly combinator launches its Orc argument in a separate forked thread, and immediately returns with a single value that is itself an Orc computation. This result computation, when executed, will return just the first result of the original Orc argument. Here’s a simplified definition:

correct. At any time there is the possibility that one of these machines will die, or fail to start up in the first instance. Again, many of the test scripts show quite shallow parallelism, but having it available and well integrated with IO actions was very important. Our first implementation of Orc had some subtle concurrency bugs which meant that the test harness would sometimes hang, maybe after running all the tests, or maybe not. Of course, we know now that it was a bug in the Orc implementation; at the time it was very difficult to find out what was wrong. Fixing the test framework was what prompted this more rigorous examination of Orc.

6.

eagerly :: Orc a -> Orc (Orc a) eagerly p = Orc $ \k -> do res <- newEmptyMVar fork (p ‘saveOnce‘ res) k (liftIO $ readMVar res)

Implementing the Orc Monad

There are a number of different ways to implement Orc. In earlier versions we used resumptions over the IO monad, but now have a much more efficient implementation using continuations over the IO monad. The result is disarmingly simple, partly because we hide some of the resource abstraction one level down. We define:

saveOnce :: Orc a -> MVar a -> IO () p ‘saveOnce‘ r = do ticket <- newMVar () p # \x -> (takeMVar ticket >> putMVar r x)

newtype Orc a = Orc {(#) :: (a -> IO ()) -> IO ()} We will later change the IO monad to an IO monad with an environment, but considering it as IO for now will be sufficient. The functor and monad instance definitions for Orc are just the standard continuation instances, where the answer type is itself a monad. We use the record selector # as an infix operator to apply an Orc term to its continuation. Thus:

The function eagerly executes p in a forked process with a continuation that writes p’s result in an MVar. It then invokes its own continuation on a simple Orc process that reads the result from p when it becomes available. As we may need to access the result value many times (recall the ticketing function quotes, for example), we use readMVar to allow the result to be read many times, rather than using takeMVar which would block after the first access. As p may well invoke its continuation many times, we have to make sure that only the first of the writes succeeds, so we use the MVar ticket as a gating operation.

instance Functor Orc where fmap f p = Orc $ \k -> p # (k . f) instance Monad Orc where return x = Orc $ \k -> k x p >>= h = Orc $ \k -> p # (\x -> h x # k) fail _ = stop

7.

Thread Leaks

We now turn our attention to the IO substrate on which the Orc combinators are built. We have plenty of opportunities for creating work, whether through the parallel construct <|> or with the eagerly combinator, but we have no particular capability for controlling and shutting down work when it is no longer needed. In quotes, for example, if the B-source delivers an acceptable quote, we have no need to continue analyzing the A-source quote, nor continuing with the timeout computations, as their results cannot affect the outcome of the composite query. In these cases they are simple computations, so perhaps it’s not a problem, but in general the alternative computations could represent an arbitrary amount of work, creating a multiplicity of threads perhaps, none of which are required. This is what might be styled a thread-leak: when unneeded threads are not closed properly, and the number of unused threads grow with time. Fortunately, the Orc combinators provide sufficient guidance about the programmer’s intent to allow us to build in automatic thread management. The Orc programmer can avoid thinking about thread management to about the same extent that a functional programmer can avoid having to think about space management. That is, for most purposes, the Orc programmer can just assume that the implementation does The Right Thing. But just as with space, there are times when the threads themselves become the critical resource, and then the Orc programmer will need to give more careful thought as to how many threads are being created and when they are being retired. This kind of advanced thread management is the topic of current research, and beyond the scope of this paper. About the only change we will make to introduce automatic thread management is to change the monad underlying Orc.

stop :: Orc a stop = Orc $ \_ -> return () In the bind (>>=) we execute p with the continuation that will take its result (x), and pass it to the h function—which is itself handed the continuation of the the whole expression, namely k. The fail method says what happens when pattern matching fails in the donotation. In this case, we simply finish the thread, discarding any computations that may have been scheduled in the continuation. The plan for p<|>q is that both p and q are executed (with any effects they have), passing any results they produce to the computations that follow them. We model many values being returned by calling the continuation repeatedly. Thus, par :: Orc a -> Orc a -> Orc a par p q = Orc $ \k -> do fork (p # k) fork (q # k) return () instance Alternative Orc where empty = stop (<|>) = par instance MonadIO Orc where liftIO io = Orc $ \k -> (liftIO io >>= k) We can optimize the definition of <|> to avoid the second fork, and instead execute (q#k) in the current thread; but for now we will work with the symmetric version as it makes the examination of the laws more straightforward. Also, as noted earlier, we will be using a monad other than just IO, so we introduce fork as an overloaded operator, which

newtype Orc a = Orc {(#)::(a -> HIO ()) -> HIO ()}

85

We introduce a hierarchical IO monad, HIO, which is just the IO monad augmented with an environment that tracks the current thread group. Whenever a new thread is forked, we will register its thread identifier with the current thread group, so that when the computations of a group are no longer needed, they can all be killed en masse. We will also track how many threads are active within the group, which will allow us to tell when a group has finished naturally. We will need this capability to define <+>. newtype HIO a

g’ <- newPrimGroup register (Group g’) g return g’ local :: Group -> HIO a -> HIO a local g p = liftIO (p ‘inGroup‘ g) close :: Group -> HIO () close g = liftIO $ killGroup g

= HIO {inGroup :: Group -> IO a}

finished :: Group -> HIO () finished g = liftIO $ isZero g

type Group = (TVar Int, TVar Inhabitants) data Inhabitants = Closed | Open [Entry] data Entry = Thread ThreadId | Group Group newPrimGroup register killGroup

These functions provide the capability we require for removing thread leaks from eagerly, which we now redefine as follows: eagerly :: Orc a -> Orc (Orc a) eagerly p = Orc $ \k -> do res <- newEmptyMVar g <- newGroup local g $ fork (p ‘saveOnce‘ (res,g)) k (liftIO $ readMVar res)

:: IO Group :: Entry -> Group -> IO () :: Group -> IO ()

increment, decrement, isZero :: Group -> IO () instance MonadIO HIO where liftIO io = HIO $ \_ -> io

saveOnce :: Orc a -> (MVar a,Group) -> HIO () p ‘saveOnce‘ (r,g) = do ticket <- newMVar () p # \x -> (takeMVar ticket >> putMVar r x >> close g)

As the type declarations indicate, groups contain both thread identifiers and sub-groups, providing a hierarchical structure to the groups. They also include a count of the number of their active threads. To co-opt the earlier definition of par we make HIO an instance of the HasFork class, by providing a definition of fork in which a freshly forked thread will register itself within the current group, and then go on to execute its body in that same group.

The execution of p takes place within a fresh sub-group w. The first time p returns a result (i.e. invokes its continuation), the group is closed down, and all ongoing work is terminated. The group infrastructure is sufficient for us to now define <+> also, as follows:

instance HasFork HIO where fork hio = HIO $ \g -> block $ do increment g fork (block (do tid <- myThreadId register (Thread tid) g unblock (hio ‘inGroup‘ g)) ‘finally‘ decrement g)

(<+>) :: Orc a -> Orc a -> Orc a p <+> q = Orc $ \k -> do g <- newGroup local g $ fork (p # k) finished g q # k Here, the new group w is not used to prematurely shut the work down, but rather to scope what work is active and identify when it all completes. To blend well with this framework, users of liftIO should take into account the possibility that their IO operations will be summarily killed off, and include appropriate bracketing or finalizers to close down any resources they control [MPMR01].

We use some GHC-specific aspects of thread implementation here. The block function prevents the thread registration code from being interrupted by an asynchronous exception, but once we enter the body of the thread (hio executing in group g) we use unblock to reenable exceptions. When the thread terminates (either naturally by running out of code, or through being killed with an exception), the decrement code is executed, to record that there is one fewer thread in the group. Note that, unlike some approaches, we don’t automatically generate a new sub-group for each forked thread. We tried that at first, but our experience was that it is an unhelpful conflation of ideas. In fact, we concluded that the concept of fork and the concept of group are quite distinct, and should be handled separately. There are echoes here with Scheme’s custodians, but we are not nearly so comprehensive [FFKF99]. Keeping the Group type abstract, we can define accessor functions for groups. The newGroup function creates a new sub-group within the existing group. The associated function local sets the current group environment within the HIO monad, close shuts down all the threads in the group (and sub-groups) at the end, and finished hangs until the group has completed (again, either naturally or through being killed).

8.

Demonstrating the Orc Laws

We wanted to explore to what extent our implementation satisfied the laws provided earlier. Unfortunately, we cannot do formal proofs as the foundation of concurrent Haskell has not stabilized sufficiently: the published transition semantics for concurrency and asynchronous exceptions [MPMR01] are not what GHC currently implements. We have a draft semantics for Orc in the same style, but until the underlying semantics stabilizes, it is hard to say what the Orc extensions would mean. What we have done instead, is reduce the laws on Orc to laws that we expect the underlying monad to satisfy (whether IO or HIO). We currently have justifications for almost all the monad and Orc laws, but had to assume certain properties from the underlying concurrency layer to do so. In the case of IO, it will be hit and miss whether they will be satisfied. In the case of HIO, we have an opportunity to build the monad so that it can satisfy the appropriate laws, perhaps with proof

newGroup :: HIO Group newGroup = HIO $ \g -> do

86

obligations to the user when additional IO actions are lifted into Orc. We will look at a couple of examples. First, the monad laws themselves may be shown by simple equational reasoning. For example, the associative monadic law goes as follows (we have dropped the Orc constructor—and # deconstructor—to make the presentation simpler):

fork (return ()) return () {Fork-Floating} = \k -> do fork (p k) fork (do fork (q k) fork (r k) return ()) return () = \k -> do fork (p k) fork ((q <|> r) k) return () = p <|> (q <|> r)

(p >>= g) >>= h = \k -> (\k’ -> p (\y -> g y k’)) (\x -> h x k) = \k -> p (\y -> g y (\x -> h x k)) = \k -> p (\y -> (g y >>= h) k) = \k -> p (\y -> (g >=> h) y k) p >>= (g >=> h) This same reasoning works for any continuation monad—there is nothing Orc-specific here. For the Par-Commutativity law (p <|> q = q <|> p), simple equational reasoning again shows us exactly what we need to know—in this case, what key property we need of the underlying system (again we drop the Orc constructor):

We needed two lemmas for moving code in and out of threads, and for eliminating null threads. L AW 7 (Fork-Floating). For all p and q fork (fork q >>= p) = (fork q >>= (fork . p))

p <|> q = \k -> do fork (p k) fork (q k) return () = {Fork-Swap} \k -> do fork (q k) fork (p k) return () = q <|> p

L AW 8 (Fork-Empty). For all p fork(return()) >> p = p

9.

Design Alternatives

The key step requires the following equivalence

9.1

Redoing Eagerly

L AW 6 (Fork-Swap). For all ioA and ioB of type HIO (),

The original Orc DSL has explicit roles for both laziness and strictness. The primitive value-operators are all strict in their arguments, but just about everything else is non-strict. In particular, the pruning relies explicitly on laziness: the single value result of the eager computation is bound lazily, and the subsequent Orc computation will pause only at a point that the value of the previous computation is actually required (e.g. by a strict primitive function). Given Haskell’s laziness, it was very appealing to build a corresponding design in Haskell. We defined a combinator

Assuming these laws about the concurrency layer allows us to to do most of our reasoning about the Orc combinators at the level of Haskell code, rather than having to do low-level concurrency proofs directly. This was very helpful. In fact, moving towards an algebraic theory of threads seems quite promising as a generally applicable proof technique.

fork ioA >> fork ioB = fork ioB >> fork ioA At one level, it is hard to imagine any true concurrent system violating this law. Indeed, speaking loosely for a moment, this law might be able to be taken as the definition of real (rather than simulated) concurrency. By real concurrency we intuit it to be where the concurrent operations are acting in distinct and unsynchronized clock or time domains. On the other hand, the law will only be true subject to some appropriate equivalence where the underlying thread machinery is guaranteed not to be visible. The Par-Associativity law shows a similar pattern:

val :: Orc a -> Orc a that executes its Orc argument, returning immediately with a pointer to a sparked thunk that contains the single (trimmed) result of the computation.

(p <|> q) <|> r = \k -> do fork ((p <|> q) k) fork (r k) return () = \k -> do fork (fork (p k) >> q k) fork (r k) return () {Fork-Floating} = \k -> do fork (p k) fork (q k) fork (r k) return () {Fork-Empty} = \k -> do fork (p k) fork (q k) fork (r k)

val :: Orc a -> Orc a val p = Orc $ \k -> do res <- newEmptyMVar w <- newGroup local w $ fork (p ‘saveOnce‘ (res,w)) k (unsafePerformIO $ readMVar res) The definition is identical to the definition of eagerly, except that we replace the liftIO with unsafePerformIO. Despite being “unsafe”, this is a very mild use of unsafePerformIO, akin to its use within the GHC function unsafeInterleaveIO. Here’s the parallel-or example redone using val parallelOr p q = do x <- val p y <- val q cut ( (guard x >> return True) <|> (guard y >> return True) <|> publish (x || y))

87

In this formulation, x and y are bound to lazy thunks that will evaluate to the boolean values themselves, rather than the Orc computations we had in the previous version. So we can apply guard and || directly to these values, and not have to do the application within the Orc monad. However, for this to work, we need to introduce a new function publish,

takeOrc :: Int -> Orc a -> Orc a takeOrc n p = do vals <- newEmptyMVar end <- newEmptyMVar echo n vals end <|> silent (sandbox p vals end)

publish :: NFData a => a -> Orc a publish x = deepseq x $ return x

dropOrc :: Int -> Orc a -> Orc a dropOrc n p = do countdown <- newTVar n x <- p join $ atomically $ do w <- readTVarSTM countdown if w==0 then return $ return x else do writeTVarSTM countdown (w-1) return stop

The publish function is a hyperstrict form of return, hence the use of deepseq from the NFData class. A result is returned only after its argument is completely evaluated. Had we used return in the parallelOr example instead, then the expression return (x||y) would have immediately succeeded, despite the values of x and y not being available, and the parallel nature of the computation would be lost. Similarly, consider redoing the quotes function from earlier: quotes :: Query -> Query -> Orc Quote quotes srcA srcB = do resultA <- val $ getQuote srcA resultB <- val $ getQuote srcB cut ( publish (least resultA resultB) <|> (threshold resultA) <|> (threshold resultB) <|> (delay 25 >> publish resultA <|> publish resultB) <|> (delay 30 >> return noQuote))

zipOrc :: Orc a -> Orc b -> Orc (a,b) zipOrc p q = do pvals <- newEmptyMVar qvals <- newEmptyMVar end <- newEmptyMVar zipp pvals qvals end <|> silent (sandbox p pvals end) <|> silent (sandbox q qvals end)

Again, we have to be quite careful about when we have to force evaluation and when we don’t need to. Stepping back, it is certainly pleasant to be able to use a function like least directly, and not have to lift it into the monad in order to extract the results from the pruned computation. The solution is cute, but also ultimately unpredictable as it relies on knowing when expressions are evaluated. In the design we adopted, we have all the same capabilities and with more control and predictability, at the cost of a slightly more explicit stepping between the two worlds.

--------------------------------------- Auxilliary definitions

9.2

echo :: Int -> MVar (Maybe a) -> MVar () -> Orc a echo 0 _ end = silent (putMVar end ()) echo j vals end = do mx <- takeMVar vals case mx of Nothing -> silent (putMVar end ()) Just x -> return x <|> echo (j-1) vals end

sandbox :: Orc a -> MVar (Maybe a) -> MVar () -> Orc () sandbox p vals end = ((p >>= (putMVar vals . Just)) <+> putMVar vals Nothing) ‘onlyUntil‘ takeMVar end

Other Combinators

As we program with the Orc combinators, we find that various patterns of use crop up repeatedly. For example, we may want to execute an Orc expression, allowing it to perform its effects continually, until some termination condition arises, such as a response from a remote request. We can capture this pattern as follows: onlyUntil :: Orc a -> Orc b -> Orc b p ‘onlyUntil‘ done = cut (silent p <|> done)

zipp :: MVar (Maybe a) -> MVar (Maybe b) -> MVar () -> Orc (a,b) zipp pvals qvals end = do mx <- takeMVar pvals my <- takeMVar qvals case mx of Nothing -> silent (putMVar end () >> putMVar end ()) Just x -> case my of Nothing -> silent (putMVar end () >> putMVar end ()) Just y -> return (x,y) <|> zipp pvals qvals end

silent :: Orc a -> Orc b silent p = p >> stop That we don’t care about the value results of p is shown by the use of silent, where the results are fed into stop. On the other hand, as soon as done returns any result, the cut will shut down the expression, including any subtasks of p. Figure 2 shows a use of onlyUntil, this in the context of a fairly intricate Orc combinator. The combinator takeOrc is rather like the list function take: it returns the first n results its argument produces, and then terminates the computation. Two MVars are used to communicate between two parallel Orc operations, one of which is running the Orc argument p, the other of which is counting and transmitting the results. The same technique applies within the definition of zipOrc, we just have to double up the communications.

Figure 2. List-like combinators defined within Orc

88

The dropOrc function is able to use a simpler technique, because it has no need to prematurely terminate the Orc computation that is producing the results. Note the use of readTVarSTM, which is our name for the TVar read operation within the STM monad itself (we reserve the less explicit readTVar for reading TVars in any MonadIO instance). We shall close this section by noting a couple of neat relationships. The cut combinator uses the trimming power of eagerly but not the concurrency. That trimming power is also provided by takeOrc. Therefore we have that

are not. The non-strictness allows control structures to be coded up as functions. For example, the function: def Timeout(n,default,m) = v > default) | m) makes it possible to stop long-running jobs, and provide a default value back in the event of timeout. It’s worth noting that if the default value never produces a value itself, then the long-running job will continue to run, even after the timeout has occurred. Bringing these concepts together, here is an example from the Orc distribution.

cut = takeOrc 1

def isPrime(n) = def primeat(i) = val b = i * i <= n if(b) >> (n % i /= 0) && primeat(i+1) | if(~b) >> true primeat(2)

Similarly, the synchronization function uses both the trimming and the parallel execution provided by eagerly, with a synchronization on the timing of the results. Thus, sync (,) p q = cut (zipOrc p q)

10.

Related Work

10.1

The Orc DSL

def Metronome(i) = i | Rtimer(500)>>Metronome(i+1) Metronome(2) >n> if(isPrime(n)) >> n

The original Orc DSL is a concurrent, impure, functional language [K08]. It started life as a process calculus, but has evolved to become a fully fledged scripting language, designed to interact closely with Java. The primitives in Orc are called sites. These are (effectful) functions that are typically defined outside of Orc. Examples include Email(addr,mess) or Prompt("Name:") etc. In addition, basic arithmetic functions are also considered sites. There were originally three main combinators in Orc: sequential composition, parallel composition, and pruning. The fourth combinator, for biased choice, has recently been added. Sequential composition is written with an >> operator. If the result of a previous expression needs to be named, the programmer may place an identifier or pattern in the midst of the operator.

This example uses a user-defined recursive function Metronome to generate a sequence of increasing values starting at 2. Each of these values (n) are checked for primality, and if so are returned as the result of the computation. As the Metronome function uses the bar operator, each prime candidate will be checked in parallel, with a 500ms delay introduced by the call to Rtimer having been provided to pace the output to a human scale. As for the laws, in the original Orc the distributivity law is expressed as follows: if H is x-free, then ((K >> H) <x< Q)=(K <x< Q) >> H Here K, H, and Q range over Orc syntactic terms (hence the need to discuss freeness of variables). Similarly, the law called Elimination of Where states that if Q is x-free, for site M

A(b,c) >x> C(x)

(Q <x< M) = Q |(M >> stop)

This syntax strongly echoes the two variations on the monadic bind operator in Haskell, either discarding or naming the result in a sequence of operations. Parallel composition is written with a bar operator (|). It causes the two expressions to be executed in parallel. Thus, the expression 1 | 2 will cause execution to branch, with any successive operations being done twice—once for the value 1 and once for the value 2. To take a simple example, finding all the combinations of numbers from 1 to 10 that add up to 11 is written in Orc as follows

By reimplementing the Orc process calculus in Haskell, we provide Haskell programmers with the flexibility of the Orc calculus directly. But the Haskell embedding introduces new dimensions too. First of all, this is the first strongly typed implementation of Orc, with all the usual benefits a type system provides. Moreover, the Haskell distinction between value and computation makes it much easier for a user to write new combinators than when the two worlds are merged. Also, building on the monadic framework provides access to existing libraries of monad abstractions which can be re-used in Orc directly.

def Iterate(l) = l >h:t> (h | Iterate(t)) val ls = [1,2,3,4,5,6,7,8,9,10] Iterate(ls) >x> Iterate(ls) >y> if(x+y = 10) >> (x,y)

10.2

Other Related Work

There have been a number of previous partial implementations of Orc in Haskell. One recent effort [CB09] was the result of a senior undergraduate project. Like in this paper, the implementation used a monad to structure the sequential binding operation. Unlike here, though, the implementation had significant communication overhead between threads and, more significantly, did not provide an implementation of the critical Orc pruning construct Where, which we modeled in two ways with eagerly and val. The instances of MonadPlus that are most similar to the Orc implementation here are those given in the LogicT library [KSFS05]. The presentation is different, and the underlying structure for the top level monad is more like resumptions than continuations. More significantly, the LogicT work places particular emphasis on the backtracking aspect of the monad. In particular, the various implementations substituted alternative backtracking strategies, particularly as a mechanism for search. One way to view this paper is an

Note the use of pattern matching in the midst of the sequential combinator to separate the head and tail of the list. The prune combinator (called Where in early versions of Orc) is written <<. This operator behaves syntactically like the sequential composition operator, but operationally has some key differences. First, in the expression F << G, both arguments execute in parallel and, second, only the first successful result from G is passed to F, and other results are discarded. This is useful when dealing with functions that could potentially produce infinite data, or for timing out long-running computation. As the Iterate example showed above, Orc programs are built by writing functions, using recursion in order to provide loop-like behavior. Whereas sites are all strict, functions (and combinators)

89

relatively little syntactic overhead. In other languages it may be appropriate to make the other choice. Orc is a concurrency EDSL, but so is STM. Both are minilanguages that are sprinkled around mostly-IO code, and both try to remove some of the complexity that is present in a fully-fledged concurrent setting. This is an ongoing and major challenge, and there is much still to be understood here. For us, limiting the complexity of the underlying thread infrastructure was a critical step to undertake the task of establishing the laws. There will be other concurrency EDSLs over time as new needs emerge, and they too will be built on top of IO one way or another. We now advocate overloading IO operations on MVars etc. to make for a smoother integration (using the MonadIO class). In our experience, MVars are best used for unidirectional communication between threads whereas TVars shine when used as shared state elements with atomic operations—in our preliminary performance measurements MVars degrade dramatically when in high contention, especially over multiple cores, whereas TVars are robust across many different configurations. As regards the Orc calculus itself, we are left wondering whether eagerly can be factored into two separate components: the cut (which limits work but is not parallel); and an eager memo operator (which sparks the work, and returns a reusable handle to all the results). The purpose of this factorization would be to enable more laws, perhaps even a complete axiomatization of Orc.

exploration of what happens to the ideas of LogicT when the underlying monad is deliberately concurrent and effectful, and how those effects are managed in the presence of pruning. Another relevant approach is the MonadLib library of monad transformers that allow complex monads to be constructed from relatively simple layers [D08]. The ChoiceT transformer resembles the Orc monad in that it allows for choice points to be introduced, though the evaluation of these choices is left-biased. Functional reactive programming (FRP) is an arrows-based paradigm for reactive programming that has been used in a variety of settings, including animated graphics and robotics [EH97]. In contrast to the explicit concurrency of Orc, input is viewed as a time-varying stream of events and/or values. It is not yet clear what the expressive tradeoff is between FRP and Orc, but they seem to be contrasting approaches to similar kinds of domains. The mechanism we used for managing and controlling the forking of concurrent processes was a simple version of Scheme custodians [FFKF99]. A custodian is responsible for managing threads, ports, sockets, and so on, and whenever a thread or port is created, it is handed to the current custodian for management. Our notion of current group in HIO is similar, though all we manage are the threads. When a custodian is shut down, it closes all the resources it manages, and terminates its threads. Moreover, a closed custodian cannot manage any new objects. Similarly here, any attempt to register a thread with a closed group will cause the thread to die. The success of custodians suggests we ought to consider going further with our groups so that they too can manage objects other than threads, for example by registering finalizers when liftIO is used to lift more primitive IO operations into Orc.

11.

Acknowledgments Many people have helped us in this endeavor, and we offer them our sincere thanks. Notable among them are Andy Adams-Moran, Magnus Carlsson, William Cook, Iavor Diatchki, Kathleen Fisher, Simon Marlow, Eric Mertens, and Don Stewart. We are particularly indebted to Simon Peyton Jones for suggesting using explicit continuations as a way to simplify the implementation.

Conclusion

In our experience with practical examples, the concurrency portions of Orc form a small part of the overall programs we write, and we need a rich language to do the other parts. The original Orc effort seems to have observed the same phenomenon, as it now comes with a large expression language to complement the concurrency parts. This experience suggests that Orc really should be thought of as a calculus that exists within the context of other languages, rather than a language in its own right. It is very naturally an embedded DSL. Here we have embedded Orc in Haskell, but in principle it could be embedded in other languages too. The main elements of Haskell we exploited were monads (to be able to work with effectful terms as first-class objects), definitional laziness (to write control structures, including recursive control structures), higher-orderness (to provide continuations), and concurrency (to be concurrent (!)). Languages lacking any of these elements may be able to simulate them sufficiently to provide something comparable; macros, for example, go a long way in replacing laziness as a way to write control structures. One place we did not use laziness was the very place where the original Orc DSL did—in the Where pruning construct. Instead we opted for the eagerly combinator, rather than val (which does echo the original Orc choice directly). The reason we chose this design is that Haskell works hard to make the order of evaluation something that the programmer doesn’t need to think about, except when working on performance improvements. In fact, the major driver for monads was a desire to ensure that the choice and order of computational effects did not depend on the particular order of evaluation chosen by the compiler. For Orc to use an explicit control of laziness (using publish) to control which concurrent computations will continue performing their effects would be somewhat contrary to this deep design philosophy. So we went with eagerly which is more honest with respect to the monad structure, and with

References [CB09] M. D. Campos, and L. S. Barbosa, Implementation of an Orchestration Language as a Haskell Domain Specific Language. Electronic Notes in Theoretical Computer Science, Volume 255, Nov 2009 [D08] I. Diatchki. MonadLib http://www.purely-functional.net/monadLib [EH97] C. Elliott and P. Hudak, Functional Reactive Animation. ACM Conference on International Conference on Functional Programming (ICFP), 1997. [FFKF99] M. Flatt, R. Findler, S. Krishnamurthi, and M. Felleisen Programming Languages as Operating Systems (or, Revenge of the Son of the Lisp Machine). In: ACM SIGPLAN International Conference on Functional Programming, ICFP 1999 [HMPH05] T. Harris, S. Marlow, S. Peyton Jones, and M. Herlihy. Composable Memory Transactions. ACM Conference on Principles and Practice of Parallel Programming (PPoPP), 2005. [K08] D. Kitchin, A User’s Guide To Orc. http://orc.csres.utexas.edu/userguide.pdf [KQCM09] D. Kitchin, A. Quark, W. Cook and J. Misra. The Orc Programming Language. Proceedings of FMOODS/FORTE, Springer Verlag, LNCS 5522, 2009. [KSFS05] O. Kiselyov, C. Shan, D. Friedman, and A. Sabry. Backtracking, interleaving, and terminating monad transformers. In: ACM SIGPLAN international conference on Functional programming (ICFP), 2005. [MC07] J. Misra and W. Cook. Computation Orchestration: A Basis for Wide-Area Computing. Journal of Software and Systems Modeling, March 2007 [MPMR01] S. Marlow, S. Peyton Jones, A. Moran, and J. Reppy. Asynchronous Exceptions in Haskell. In ACM Conference on Programming Languages Design and Implementation, PLDI 2001.

90

Seq no more: Better Strategies for Parallel Haskell Simon Marlow

Patrick Maier

Hans-Wolfgang Loidl

Microsoft Research, Cambridge, UK [email protected]

Heriot-Watt University, Edinburgh, UK [email protected]

Heriot-Watt University, Edinburgh, UK [email protected]

Mustafa K. Aswad

Phil Trinder

Heriot-Watt University, Edinburgh, UK [email protected]

Heriot-Watt University, Edinburgh, UK [email protected]

Abstract

and parallelism can co-exist in a coherent programming model, and non-strictness even has some advantages for parallel languages [Loidl et al. 1999; Trinder et al. 2002]. Strategies have been used for some 15 years in a number of parallel variants of Haskell [Harris et al. 2005; Loogen et al. 2005; Trinder et al. 1998].

We present a complete redesign of evaluation strategies, a key abstraction for specifying pure, deterministic parallelism in Haskell. Our new formulation preserves the compositionality and modularity benefits of the original, while providing significant new benefits. First, we introduce an evaluation-order monad to provide clearer, more generic, and more efficient specification of parallel evaluation. Secondly, the new formulation resolves a subtle space management issue with the original strategies, allowing parallelism (sparks) to be preserved while reclaiming heap associated with superfluous parallelism. Related to this, the new formulation provides far better support for speculative parallelism as the garbage collector now prunes unneeded speculation. Finally, the new formulation provides improved compositionality: we can directly express parallelism embedded within lazy data structures, producing more compositional strategies, and our basic strategies are parametric in the coordination combinator, facilitating a richer set of parallelism combinators.

This paper presents a complete redesign of the strategy abstraction. Our reformulation preserves the key compositionality and modularity benefits of the original strategies (Section 4), together with their low time and space overheads (Section 6), and the formal semantics [Baker-Finch et al. 2000] is unchanged. The new formulation provides the following additional benefits: • Clearer, more generic and more efficient specification of par-

allel evaluation. Describing a parallel algorithm requires specifying an order of evaluation, something which the Haskell language deliberately, and rightly, leaves unspecified. In the new strategies we introduce an evaluation-order monad, allowing the ordering of a set of evaluations to be specified in a perspicuous and compositional way (Section 4). Moreover, by using Applicative functors and the Traversable class [McBride and Paterson 2008], we can define generic regular strategies over data structures (Section 4.5). Our framework also supports fusion, which allows the intermediate lists introduced by modular strategies to be eliminated by the compiler (Section 4.8).

We give measurements over a range of benchmarks demonstrating that the runtime overheads of the new formulation relative to the original are low, and the new strategies even yield slightly better speedups on average than the original strategies. Categories and Subject Descriptors D.1.1 [Programming Techniques]: Applicative (Functional) Programming; D.1.3 [Programming Techniques]: Concurrent Programming General Terms Keywords

1.

• The new strategies resolve a subtle space management issue

Performance, Measurement

where the original strategies retain heap unnecessarily (Section 3). The crux of the space management challenge is to preserve parallelism (sparks), while being able to reclaim the heap associated with superfluous parallelism. Our measurements demonstrate improved space behaviour for existing parallel programs simply by switching to the new strategies (Section 6). Furthermore, the new strategies support speculative parallelism with unnecessary speculative tasks being pruned automatically by the garbage collector, something which was not possible with the original strategies.

Parallel Functional Programming, Strategies

Introduction

Evaluation strategies [Trinder et al. 1998], or “strategies” for short, are a key abstraction for adding pure, deterministic, parallelism to Haskell programs. Using strategies, parallel specifications can be built up in a compositional way, and the parallelism can be specified independently of the main computation. Despite the apparent conflict between lazy evaluation and the eagerness implied by parallelism, evaluation strategies show that non-strictness

• There is a class of important parallel coordination abstractions

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

that cannot be expressed as original strategies, but can be expressed in the new formulation. The feature that this class of abstractions has in common is that they all embed parallelism within lazy components of a data structure, a technique that is essential for parallelising stream-processing pipelines. In the original strategies we could write these functions, but they were not instances of the strategy abstraction and so could not be used compositionally. These drawbacks are resolved by the new framework (Section 5.1).

91

to be able to say what it is to be evaluated in parallel with. Haskell neither specifies nor requires a particular order of evaluation, so normally the programmer has no control over this aspect of their program’s execution. In particular, the programmer has no control over when a particular call to par will be evaluated, or what will be evaluated before or after it (or indeed in parallel with it). This is the reason for pseq: a call pseq a b introduces an order-of-evaluation requirement that a be evaluated before b. The denotational semantics of pseq is

• Motivated by wanting to have different versions of par to con-

trol locality in large architectures, the new formulation allows for abstracting over the coordination combinator used (Section 4.4). Sadly, however, we must all pay for our lunch, and the new formulation raises three issues. • There is some extra complexity in the implementation of strate-

gies. However, many casual users of the library are insulated from the changes: using and composing strategies works exactly as before, modulo some renaming. Only users who need to define their own strategies will have to become familiar with the new idioms, and there should now be fewer such users given that we provide generic strategies over any Traversable data type.

pseq a b = ⊥ = b

and the operational semantics is that a must be evaluated to weak head normal form before returning b [Baker-Finch et al. 2000]. An example to illustrate the usage of par and pseq follows, using the traditional nfib function. More examples can be found in the literature [Jones Jr. et al. 2009; Trinder et al. 1998].

• The original strategies provided a strong identity safety prop-

erty, namely that (‘using‘ s) is always an identity function for any strategy s. The new strategies cannot provide the same guarantee, although the library strategies are identities, and the combinators preserve the property. Safety can be regained at the expense of expressiveness by making the strategy type abstract, giving the programmer a choice of expressiveness/safety levels (Section 5.4).

nfib :: Int -> Int nfib n | n <= 1 = 1 | otherwise = let x = nfib (n-1) y = nfib (n-2) in x ‘par‘ (y ‘pseq‘ x + y + 1)

• To express control parallelism an original strategy may freely

spark expressions. The corresponding new strategy must carefully preserve any sparked expressions (Section 4.6).

The computation is shaped like a binary tree. At each node of the computation we combine par and pseq to evaluate one branch in parallel with the other branch. The pattern here is a common one: in x ‘par‘ (y ‘pseq‘ e), typically e involves both x and y. The effect of this pattern is to cause x to be evaluated in parallel with y. When the evaluation of y is complete, computation proceeds by evaluating e. Here the pseq is used to control evaluation order.

The new strategies are incorporated in the Haskell parallel package (Version 3.1)1 . All the code for our benchmarks is available online (Section 6), and the results were obtained with a recent GHC development snapshot (6.13 as of 20.5.2010). The latest official release, GHC 6.12.3, achieves similar speedups.

2.

if a = ⊥ otherwise

The parallelism here is independent of the number of processors; every time par is evaluated it creates a new opportunity for some work to be evaluated in parallel (a spark), but the implementation is free to ignore these opportunities. Indeed typical usage of par creates many more sparks than there are processors available to execute them, and the surplus sparks are simply discarded by the runtime system.

Original Strategies

Pure parallelism in Haskell is achieved using only two primitives, par and pseq, with the following types2 : par :: a -> b -> b pseq :: a -> b -> b

The par combinator introduces a potential for parallel evaluation. When par is applied to two arguments, it returns the value of its second argument, while its first argument is possibly evaluated in parallel. We say “possibly”, because as far as the program is concerned, the result of par a b is always b; it makes no difference to the meaning of the program whether a is evaluated in parallel or not. We should think of par as an annotation; it merely hints to the Haskell implementation that it might be beneficial to evaluate the first argument in parallel.

2.1

Strategies

The basic programming model described above provides the raw material for expressing parallelism in Haskell. Building on this, a Strategies module affords an abstraction layer over par and pseq to allow larger-scale parallel algorithms to be expressed. Strategies are a remarkably simple idea. In the original formulation, a strategy is a function of type a -> () for some a: type Strategy a = a -> ()

What if the computation evaluated in parallel has the value ⊥, or an error? Surely then it makes a difference to the meaning of the program whether it is evaluated, or not? In fact it does not — the system is required to ensure that the semantics of par a b is always b, regardless of the value of a, ⊥ or otherwise. In practise, this isn’t a problem for typical Haskell implementations, as a lazy computation can already have value ⊥.

Thus, a Strategy may evaluate its argument either in full or in part, and it may only return () (or diverge). Crucially, using par and pseq, a strategy may specify a recipe for evaluating its argument in parallel. Some basic strategies can be defined as follows. r0 :: Strategy a r0 x = ()

It is not enough to provide par alone, because generally when suggesting that something is to be evaluated in parallel, it is useful

rwhnf :: Strategy a rwhnf x = x ‘pseq‘ ()

1 http://hackage.haskell.org/package/parallel 2 The

original presentation used seq rather than pseq [Trinder et al. 1998]; however, Haskell later adopted a seq operator but without the order-ofevaluation property required for parallel execution [Marlow et al. 2009]. Hence, to avoid confusion with Haskell’s seq, we now use pseq for expressing sequential ordering of evaluation.

rnf :: NFData a => Strategy a -- rnf is a method in the class NFData

r0 is a strategy that evaluates nothing of its argument, rwhnf evaluates its argument to weak-head normal form, and rnf eval-

92

creation [Mohr et al. 1991]. More details on the implementation of spark pools can be found in [Marlow et al. 2009]; the particular implementation details are not important here.

uates its argument completely. The definition of rnf depends on the structure of its argument, so it is defined using a type class NFData, which has to be instantiated separately for each data type (the strategies library provides instances for common types such as Booleans, integers, lists and tuples).

How should the storage management system, in particular the garbage collector, treat the spark pool? There are two main alternatives, which we call ROOT and WEAK respectively, following the terminology of [Marlow et al. 2009]:

Strategies are applied with the using combinator: using :: a -> Strategy a -> a x ‘using‘ s = s x ‘pseq‘ x

1. ROOT: entries in the spark pool should be considered implicitly live. That is, the spark pool is a source of roots for the garbage collector.

So far we haven’t presented any strategies containing actual parallelism. A simple one is parList, which applies a strategy to each element of a list in parallel:

2. WEAK: an entry in the spark pool is only alive if the object to which it points is independently reachable. That is, the spark pool contains weak pointers in the usual terminology.

parList :: Strategy a -> Strategy [a] parList strat [] = () parList strat (x:xs) = strat x ‘par‘ parList strat xs

In fact, both of these policies lead to problems with original strategies. First, let us consider WEAK, and examine how it works with the definition of parList in the previous section. The sparks created by parList are all expressions of the form strat x for some strategy strat applied to some list element x. Now, every such expression is uniquely allocated for the sole purpose of being passed to par; the spark pool will contain references to many expressions of the form strat x, and in every case, the reference from the spark pool is the only reference to that expression in the heap. So, by definition, if we adopt the WEAK policy then every spark created by parList will be discarded by the garbage collector, and we lose all the parallelism.

The function parList illustrates the compositional nature of the strategies abstraction: it takes as an argument a strategy to apply to each list element, and returns a strategy for the whole list. The strategy argument is typically used to specify the evaluation degree, that is, how much each list element should be evaluated. For instance, parList rwhnf causes each spark to evaluate its list element as far as the top-level constructor, whereas parList rnf evaluates the elements completely. Various evaluation degrees between these two extremes are possible, such as evaluating the spine of a list (we’ll give examples later in Section 4.7). The parList function can also be used to illustrate the modular nature of strategies; for example:

Moreover, there is no definition of parList that can avoid this problem. The only value that the parList strategy can return is (), so the only way that parList can create a reachable spark is by sparking part of the structure it was originally given, such as the list elements. For example, we can define a non-compositional variant of parList that works:

parMap :: Strategy b -> (a -> b) -> [a] -> [b] parMap strat f xs = map f xs ‘using‘ parList strat

The parMap function takes a strategy strat, a function f, and a list xs as arguments and maps the function f over the list in parallel, applying strat to every element. Note how the construction of the result with map, on the left of using, is separate from the specification of the parallelism, on the right. This is a small-scale example, but the idea also scales to much more elaborate settings [Loidl et al. 1999].

parListWHNF :: Strategy [a] parListWHNF [] = () parListWHNF (x:xs) = x ‘par‘ parListWHNF xs

But unfortunately we lose the compositional nature of strategies that was so appealing about the original formulation.

The key to the modularity is lazy evaluation. The argument to a strategy can be a complex data structure with lazy components, or even a lazily-created data structure, and this allows the algorithm that creates the data structure to be separated from the strategy that specifies how to evaluate it. It’s not a panacea: not all algorithms lend themselves to being decomposed in this way, and the intermediate lazy data structure has costs of its own. Nevertheless, in many cases the modularity benefits outweigh the costs, and sometimes the intermediate data structure can be automatically eliminated by the compiler (Section 4.8).

3.

So what about the alternative garbage collection policy, ROOT, where we treat the spark pool as a source of roots? Considering the parList example again, the spark pool would still contain references to expressions of the form strat x in the heap, but this time all the expressions will be retained by the garbage collector, and no parallelism is lost. However, another problem arises: what happens when there are not enough parallel processors to evaluate all the sparks? The spark pool retains references to all the strat x expressions, perhaps long after each x is no longer required by the program and would otherwise be reclaimed by the garbage collector.

Space Management: Preserving Parallelism, not Garbage

In an attempt to retain potential parallelism, the storage manager is retaining memory that should have been released: this is a space leak, and can and does have dramatic performance implications (we’ll tell that story in Section 6.4). Even an innocuous parList or parMap can turn a program that ran in constant space into one that requires linear heap. The adverse effects tend to manifest when running parallel programs on a single processor, because there are no spare processors to evaluate the sparks and hence allow them to be removed from the spark pool. However, effects are felt even when multiple processors are available: the garbage sparks occupy space in the spark pool that could be used for real parallelism, and processors waste time evaluating garbage sparks which erodes the overall speedup achieved.

In this section we describe the main problem in the original strategies formulation that prompted the redesign described in this paper. The problem we are about to describe only came to light recently [Marlow et al. 2009]. To understand the problem we need to consider how par is implemented. When the Haskell program evaluates the expression par a b, the runtime system saves a pointer to the heap node representing a in a data structure that we call a spark pool. For our purposes, the spark pool is simply a set of pointers to heap objects representing computations that have been sparked by par. The runtime system from time to time removes objects from the pool in order to evaluate them using idle processors, so-called lazy task

93

It is tempting to think that perhaps we can solve the space leak by only retaining sparks that share some data with the main program. This is difficult to achieve, however, and in any case it is not clear that it would be a robust solution to the problem: how much data should be shared before we consider the spark to be alive? 3.1

The key idea in our reformulation is that a strategy returns a new version of its argument, in which the sparked computations have been embedded. For example, when sparking a new parallel task of the form strat x, rather than discarding this expression, the strategy will now build a new version of the original data structure with strat x in place of x. The caller will consume the new data structure and discard the old, so that the parallel task strat x remains reachable as long as the consumer requires it. Furthermore, if the consumer evaluates strat x before it is evaluated by a parallel thread, then the spark fizzles; superfluous parallelism is discarded by the garbage collector, which is exactly what we need.

Fizzled Sparks

It is possible that a spark in the spark pool can refer to a computation that has already been evaluated by the program. Perhaps there were not enough processors to evaluate the spark in parallel, and another thread ended up evaluating the computation during the normal course of computing its results.

Perhaps our strategies should be identity functions. However, the simplest identity type, a -> a, is not a suitable strategy type candidate. Functions of this type are necessarily strict3 , so we cannot express r0, the strategy that performs no evaluation of its argument, as a function of this type. To accommodate r0, the result must be lifted. We use a trivial lifting, Eval, and provide a way to unlift, runEval:

When a spark in the spark pool refers to a value, rather than an unevaluated computation, we say the spark has fizzled; this potential for parallel execution has expired [Marlow et al. 2009]. The runtime system can, and should, remove fizzled sparks from the spark pool so that the storage manager can release the memory they refer to, to avoid the mutator wasting time evaluating useless sparks, and to make more room for real potential parallelism in the spark pool.

type Strategy a = a -> Eval a data Eval a = Done a

This is all well and good, but note that in the original strategies formulation, most sparks will never fizzle because they are expressions of the form strat x that are unshared and hence can never be evaluated by the main program. In contrast, the sparks generated by the simpler non-compositional operation parListWHNF above can fizzle, because in that case par is applied directly to a part of the data structure, rather than to a new unshared expression, and presumably the main program will proceed by evaluating the same data structure itself. 3.2

runEval :: Eval a -> a runEval (Done a) = a

The rationale for the names will become clear shortly. Now we can define some basic strategy combinators using the new type: r0 :: Strategy a r0 x = Done x rseq :: Strategy a rseq x = x ‘pseq‘ Done x

Speculative Parallelism

Sparking ought to support speculative parallelism, by which we mean sparking an expression whose value is not known for certain to be eventually required by the computation as a whole. Ideally, speculative parallelism should be automatically pruned by the system when it can be proven to be never needed.

rpar :: Strategy a rpar x = x ‘par‘ Done x rdeepseq :: NFData a => Strategy a rdeepseq x = rnf x ‘pseq‘ Done x

Speculative parallelism can be created using par; the question is whether speculative sparks are ever discarded. Under the ROOT policy, a speculative spark that is never evaluated will become a space leak, whereas under the WEAK policy unreachable speculative sparks will be discarded and their heap reclaimed. In short, only the WEAK policy supports speculation. 3.3

The new basic strategies r0, rseq and rdeepseq are analogues to the original strategies r0, rwhnf and rnf respectively (in fact, rdeepseq uses the original rnf). 4.1

We can declare Eval to be a monad. There are two choices here: either it is the standard identity monad, or it is a strict identity monad. The latter turns out to be a much more useful choice:

Summary

For reference, the following table summarises the interaction between the choice of strategy abstraction (original strategies, Section 2, versus new strategies, Section 4), nature of parallelism (speculative or not), and GC policy (ROOT versus WEAK). Strategies original original new new

4.

Parallelism non-speculative speculative non-speculative speculative

ROOT space leaks space leaks OK space leaks

The Evaluation-order Monad

instance Monad Eval where return x = Done x Done x >>= k = k x

The strict identity monad4 gives us a convenient and flexible notation for expressing evaluation order, i.e. the ordering between applications of rseq and rpar, which is exactly what we need for expressing basic parallel evaluation. For example, the following fragment of nfib

WEAK lost parallelism lost parallelism OK OK

let x = nfib (n-1) y = nfib (n-2)

A New Formulation of Strategies

in

The difficulties with managing the space behaviour of sparks described in Section 3 are rooted in the choice of the type for strategies: if a strategy is a function returning the unit type (), then there is no way for it to spark new expressions and to return them to the caller, thus ensuring that the sparked expressions remain reachable from the caller’s heap.

x ‘par‘ (y ‘pseq‘ x + y + 1) 3 such

functions may only return their argument or ⊥, hence when applied to ⊥ the result is always ⊥ 4 this is in fact isomorphic to the Lift monad in the MonadLib package, http://hackage.haskell.org/package/monadLib

94

data Eval a = Done a instance Monad Eval where return x = Done x Done x >>= k = k x runEval :: Eval a -> a runEval (Done a) = a type Strategy a = a -> ()

type Strategy a = a -> Eval a

using :: a -> Strategy a -> a x ‘using‘ s = s x ‘pseq‘ x

using :: a -> Strategy a -> a x ‘using‘ s = runEval (s x) dot :: Strategy a -> Strategy a -> Strategy a s2 ‘dot‘ s1 = s2 . runEval . s1

r0 :: Strategy a r0 x = ()

r0 :: Strategy a r0 x = return x

rwhnf :: Strategy a rwhnf x = x ‘pseq‘ ()

rseq :: Strategy a rseq x = x ‘pseq‘ return x

rnf :: NFData a => Strategy a -- rnf is a method in the class NFData

rdeepseq :: NFData a => Strategy a rdeepseq x = rnf x ‘pseq‘ return x rpar :: Strategy a rpar x = x ‘par‘ return x

seqList :: Strategy a -> Strategy [a] seqList s [] = () seqList s (x:xs) = s x ‘pseq‘ (seqList s xs)

evalList :: Strategy a -> Strategy [a] evalList s [] = return [] evalList s (x:xs) = do x’ <- s x xs’ <- evalList s xs return (x’:xs’)

parList :: Strategy a -> Strategy [a] parList s [] = () parList s (x:xs) = s x ‘par‘ (parList s xs)

parList :: Strategy a -> Strategy [a] parList s = evalList (rpar ‘dot‘ s)

Figure 1. Like-for-like comparison of original strategies (left column) versus new strategies (right column). can be rewritten as

instance Functor Eval where fmap f x = x >>= return . f

runEval $ do x <- rpar (nfib (n-1)) y <- rseq (nfib (n-2)) return (x + y + 1)

instance Applicative Eval where pure x = return x (<*>) = ap

This means that we can use applicative notation for threading “evaluation order” through an expression. Here’s a simple example: in one of our benchmarks (Coins in Section 6.3), a result value is defined as

which clearly expresses the ordering between rpar and rseq, using a notation that Haskell programmers will find familiar. Programmers using the new strategies API no longer need to use par and pseq to construct new strategies, instead they use the Eval monad with rpar and rseq. The Eval monad raises the level of abstraction for pseq and par; it makes fragments of evaluationorder first class, and lets us compose them together. We should think of the Eval monad as an Embedded Domain-Specific Language (EDSL) for expressing evaluation order, embedding a little evaluation-order-constrained language inside Haskell, which does not have a strongly-defined evaluation order.

res = append left right

and we want to spark left in parallel with right. We could use the monadic syntax as we did for the nfib example above, but sometimes even the monadic syntax is too heavy, and obscures the structure of the original code. The Applicative operators <$> and <*> let us rewrite the expression to include the parallelism, without losing its structure: res = runEval $ append <$> rpar left <*> rseq right

Figure 1 summarises the differences between the API for the original strategies and the new strategies. Note that we have redefined a few combinators using the monadic style consistently, using return in place of Done, for example. 4.2

One might object that this is not a modular specification of parallelism, and that would be a fair criticism. However, note that apart from the introduction of rpar and rseq, the translation to applicative style is mechanical, so this is a minimal and yet precise way to add a little parallelism to an existing expression. We will discuss how to recover modularity in cases like this in Section 4.6.

Eval, applicatively

Applicative notation fixes the ordering to be depth-first, so in cases where depth-first is not appropriate the monadic syntax has to be used.

An evaluation order is often something we want to impose on an existing expression. Since Eval is a monad, it is also an Applicative functor [McBride and Paterson 2008]:

95

4.3

regular data structures such as lists and trees: a means of traversing the data structure using Eval, applying a strategy at the leaves, and building a new structure to return.

Using Strategies

As with the original strategies, a strategy application operator is provided:

The method traverse has the following type:

using :: a -> Strategy a -> a x ‘using‘ s = runEval (s x)

traverse :: (Traversable t, Applicative f) => (a -> f b) -> t a -> f (t b)

The using function is defined to have lowest precedence and associate to the left, that is e ‘using‘ s1 ‘using‘ s2 stands for (e ‘using‘ s1) ‘using‘ s2. This stacking of strategies being similar to the stacking of function applications, there is a strategy composition dot such that

This function is so generic it is not immediately obvious how it can be applied in our setting. However, if we specialise a -> f b to Strategy a, then we get: evalTraversable :: Traversable t => Strategy a -> Strategy (t a) evalTraversable = traverse

(e ‘using‘ s1) ‘using‘ s2 = e ‘using‘ (s2 ‘dot‘ s1)

Just like function composition, dot has highest precedence and associates to the right, so the parentheses can be dropped from the above equation. 4.4

This is a generic parametrised strategy for any Traversable data type. It has evalList as an instance, and gives us strategies for types like Maybe and Array for free. Adding parallelism to the generic strategy is straightforward:

Compositional Strategies over Data

parTraversable :: Traversable t => Strategy a -> Strategy (t a) parTraversable s = evalTraversable (rpar ‘dot‘ s)

We build strategies over data types by first constructing a basic strategy for the data type, parametrised over strategies for the components of the type. The basic strategy traverses the data type in the Eval monad, applies the argument strategies to the components (usually in depth-first order), and builds a new instance of the type.

4.6

The key modularity property we have is that e ‘using‘ s is observably equivalent to e, at least in so far as it is defined (the former may be less defined than the latter). The point of this guarantee is that someone who only wants to understand the algorithm can ignore the strategies, i.e. every ‘using‘ s.

As an example, consider the Strategy combinator evalList, which walks over a list and applies the argument strategy s to every element: evalList :: Strategy a -> Strategy [a] evalList s [] = return [] evalList s (x:xs) = do x’ <- s x xs’ <- evalList s xs return (x’:xs’)

Of course, this property is only useful in cases where we can actually make use of using. Some of the examples we have already seen are not easily expressed with using; consider for example nfib from Sections 4.1:

The evalList combinator generalises both parList and seqList of original strategies, and more besides5 . For example, parList is obtained by composing the element strategy s with rpar:

runEval $ do x <- rpar (nfib (n-1)) y <- rseq (nfib (n-2)) return (x + y + 1)

parList :: Strategy a -> Strategy [a] parList s = evalList (rpar ‘dot‘ s)

This kind of parallelism is known as control or task parallelism, where the parallelism follows the control structure of the program. However, we cannot consider this a modular specification of parallelism, as it clearly interleaves the algorithm with the coordination.

The original strategies had a seqList combinator, whereas the new strategies do not provide a seqList. In fact, evalList is the new strategies’ equivalent to seqList, but it is not immediately obvious why this should be so — seqList is defined in terms of pseq, but there is no pseq to be found in the definition of evalList. The purpose of seqList is to apply the strategy s to each element of the list in left-to-right order, and it achieves this ordering by using pseq at each step. In evalList, we achieve the same ordering, but by using the Eval monad instead: the Eval monad explicitly sequences the application of the strategy s to each list element in order — pseq is no more required.

We can write a modular version: x + y + 1 ‘using‘ strat where x = fib (n-1) y = fib (n-2) strat v = do rpar x; rseq y; return v

This strategy looks odd. We aren’t using the result of rpar, which should raise the red flags: normally the result of rpar should be embedded in the result returned, otherwise the spark is likely to be discarded by the garbage collector, or become a space leak. However, it is acceptable to discard the result of rpar if the argument is a variable, and that variable is already shared by the result, as it is in this case.

We can specialise evalList in more ways. A number of new parallel primitives are envisioned, for instance, a bounded par that restricts locality, e.g. a spark with a low bound should be executed “nearby”. An advantage of the new strategies is that all these primitives can be passed as parameters, thus avoiding code replication. 4.5

Modularity

This is a somewhat subtle rule-of-thumb, and the user may well prefer the original direct definition using runEval. Note that the same technique was possible with original strategies, although there we had no option to use the more direct runEval style. This technique will be applied to a more realistic example in Section 5.3.

Generic Strategies

The Traversable class provides a convenient way to thread any Applicative computation through the components of a data structure in a depth-first manner, performing any effects on the way whilst building a new data structure [McBride and Paterson 2008]. This is exactly what we need for defining strategies over

4.7

Sequential Strategies

An important class of strategies specify only evaluation degree, i.e. do evaluation only, and introduce no parallelism. Since they create no sparks, there is no need for these strategies to rebuild the data

5 evalList is a specialisation of the mapM associated with the Eval monad,

as an anonymous reviewer observed.

96

structure that they are passed. For example, if we were to define a strategy that evaluates a list sequentially as

parMap :: Strategy b -> (a -> b) -> [a] -> [b] parMap s f xs = map f xs ‘using‘ parList s

forceList = evalList rseq :: Strategy [a]

The list produced by map is consumed by parList, which generates another list to return to the caller of parMap. Furthermore, there is an extra traversal: both map and parList traverse the complete list.

then the result is a strategy that is not only needlessly inefficient, but worse, may overflow the stack on long lists because evalList is not tail-recursive6 .

Ideally we would like to have this intermediate structure and the extra traversal eliminated by the compiler. Fortunately, using GHC it is almost trivial to arrange that this optimisation occurs: GHC provides user-defined transformation rules, which are used to implement list fusion between many of the standard list-producing and consuming library functions. Our parList is defined in terms of parTraverse, which is defined in terms of traverse, and the list instance of traverse happens to be defined in terms of foldr. The intermediate list between map and foldr is automatically removed by GHC’s transformation rules, so in fact parMap compiles to an efficient single-traversal loop.

Hence we dedicate a separate module Seq to the class of strategies that do evaluation only. These sequential strategies have type Seq.Strategy, the same type as the original strategies: type Seq.Strategy a = a -> ()

We make Seq.Strategy into a “subtype” of Strategy by providing an explicit upcast evalSeq, which evaluates a sequential strategy before returning the evaluated argument into the Eval monad. evalSeq :: Seq.Strategy a -> Strategy a evalSeq ss x = ss x ‘pseq‘ return x

Basic sequential strategies and sequential strategy combinators are similar to the respective original strategies and combinators; for example:

The measurements we report in Section 6 are without the benefit of fusion. Separate measurements with fusion enabled, which require an extra annotation in the Data.Traverse library, exhibited a small improvement in speedup of +1.40% across most of the applications. Since the overhead of data structure traversal in strategies is fairly small (see Section 6.2) we cannot expect a major improvement from this conceptually important optimisation.

r0 :: Seq.Strategy a r0 x = () rseq :: Seq.Strategy a rseq x = x ‘seq‘ ()

5.

seqList :: Seq.Strategy a -> Seq.Strategy [a] seqList ss [] = () seqList ss (x:xs) = ss x ‘seq‘ seqList ss xs

Advanced Strategies

This section discusses how advanced features such as clustering, buffering and parallel patterns, can be expressed in the new strategies. Such features are essential for real parallel applications, and are used in the kernels measured in Section 6.3.

As the order of evaluation of substructures is irrelevant here, these combinators may use the ordinary Haskell seq operator instead of pseq, granting the compiler more freedom to optimise the order of evaluation. In contrast, the upcast evalSeq must use pseq to force evaluation of the sequential strategy before returning.

5.1

Embedded Strategies: Rolling Buffers

Some parallel abstractions that are important for parallel performance tuning rely on embedding parallelism inside a lazy data structure, such that opportunities for parallel evaluation are created “on demand” by the consumer of the data structure. The most commonly encountered example is a parallel buffer [Trinder et al. 1998]:

Finally, seqFoldable is the sequential strategies’ counterpart to the generic strategy evalTraversable. seqFoldable :: Foldable t => Seq.Strategy a -> Seq.Strategy (t a) seqFoldable ss = foldl’ (const ss) ()

parBuffer :: Int -> Strategy a -> Strategy [a]

seqFoldable strictly applies a strategy to all elements of a data structure. Given the simpler return type of sequential strategies, seqFoldable is already applicable to all Foldable data structures, which form a super class of the Traversable data structures.

Informally the idea is that parBuffer n s xs yields a list in which evaluation of the ith element induces parallel evaluation of the (i + n)th element with the first n elements being evaluated in parallel immediately. The result list must therefore be lazy, at least beyond the first n elements.

Sequential strategies are widely used. The example below transposes a list of matrices mats, each represented as a list of lists, in parallel without evaluating the matrix elements. The sequential strategy seqList (seqList r0) evaluates just the shape of a matrix while the parMap computes the parallel transpose:

In the original strategies, while the parBuffer functionality could be defined perfectly well, it could not be expressed as a strategy, because it returns a new list containing parallelism embedded in the lazy components. That is, the original type was

parMap (evalSeq (seqList (seqList r0))) transpose mats

parBuffer :: Int -> Strategy a -> [a] -> [a]

The detailed control of evaluation degree provided by sequential strategies may also be useful for tuning sequential programs. In effect the module Seq complements existing abstractions provided by the module DeepSeq.

This was an unfortunate wart, because it meant that parBuffer could not be used as the argument to a strategy combinator and thus compositionality was diminished.

4.8

Fortunately embedded parallelism can be directly expressed in the new strategy formulation, and so parBuffer and functions like it are instances of the Strategy type.

Fusion

Using strategies in a modular way often implies that an intermediate data structure is generated by the computation, filtered by the strategy, and finally consumed upstream. Consider once again parMap:

A fully compositional implementation of parBuffer can be found below. It implements a rolling buffer (with amortised constant overhead) by means of a highly optimised functional queue data structure provided by module Data.Sequence. The rolling buffer functionality is provided by roll, which takes a functional queue (the buffer) and a list of elements yet to go into the buffer, and

6 One would typically not use parList on long lists as too many sparks would be created, instead parBuffer tends to be more practical.

97

which takes a cluster size parameter and generically transforms a strategy by performing clustering and declustering behind the scenes (using the methods of appropriate Cluster instances). Unfortunately, the cluster type c shows up only in the class context, which means it could be ambiguous — in fact, it should be: there may well be multiple reasonable ways of clustering a given type.

returns a list (via the Eval monad). Whenever the result list is demanded, roll applies the strategy s to the first element z to go into the buffer and sticks the result to the end of the queue (by calling q |> z’). Then it pulls the first element y’ out at the front of the queue (by matching viewl (...) against y’:
To disambiguate the cluster type, we need to expose c in the signature of evalCluster by passing it as an extra argument (which requires wrapping it with a fresh type variable w). This extra argument serves purely as a “type parameter”; it is never evaluated and will be optimised away by the compiler.

evalBuffer :: Int -> Strategy a -> Strategy [a] evalBuffer n s xs = roll (fromList (ys ‘using‘ evalList s)) zs where (ys,zs) = splitAt n xs roll q [] = return (toList q) roll q (z:zs) = do z’ <- s z let y’: z’) return (y’ : runEval (roll q’ zs))

evalCluster :: forall a c w . Cluster a c => w c -> Int -> Strategy a -> Strategy a evalCluster _ n s x = return (decluster (cluster n x ‘using‘ cs)) where cs = evalTraversable s :: Strategy (c a)

parBuffer :: Int -> Strategy a -> Strategy [a] parBuffer n s = evalBuffer n (rpar ‘dot‘ s)

5.2

Thanks to the Traversable context (inherited from Cluster), we can lift the strategy s to a strategy cs which is applicable to the clustered input. Note that the type annotation in the where clause necessitates the explicit forall in the signature.

Clustering

When tuning the performance of parallel programs it is often important to increase the size of parallel computation, i.e. to use a coarser granularity, in order to achieve a better ratio of computation versus coordination costs. Implementations often contain mechanisms to automatically use coarser granularity on loaded processors. The scenario of fizzling sparks discussed in Section 3.1 is such an example, because the work of a spark is performed by an already running computation. However further improvements can be obtained by explicitly controlling thread granularity, and in the context of the original strategies we developed a range of clustering techniques [Loidl et al. 2001]. This section adapts these techniques for the new strategies and extends them.

With this infrastructure we can define a generic parMapCluster, a variant of parMap performing implicit clustering (based on the Cluster class) behind the scenes. parMapCluster :: forall a b c w . Cluster [b] c => w c -> Int -> Strategy b -> (a -> b) -> [a] -> [b] parMapCluster _ n s f xs = map f xs ‘using‘ evalCluster (__ :: w c) n (rpar ‘dot‘ evalList s)

Observe how a type annotation is used to emulate passing the (wrapped) cluster type c as a “type argument” to evalCluster; the double underscore __ is short for the bottom value undefined.

One way to obtain a coarser granularity is to collect computations on related elements of a data structure into “clusters.” To this end, we define a class Cluster containing cluster and decluster methods, as well as a method lift that turns an operation over the original data structure into one over such a clustered data structure.

To improve readability, instead of wrapping the type argument with fresh type variables, we can use a properly named phantom type: data ClusterWith :: (* -> *) -> *

class (Traversable c, Monoid a) => Cluster a c where cluster :: Int -> a -> c a decluster :: c a -> a lift :: (a -> b) -> c a -> c b

Now it is intuitive that parMapCluster (__ :: ClusterWith []) uses lists for clustering. 5.3

lift = fmap -- c is a Functor, via Traversable decluster = fold -- c is Foldable, via Traversable -- we require: decluster . cluster n == id

A Divide-and-conquer Pattern

One of the main strengths of strategies is the possibility of constructing abstractions over patterns of parallel computation. Thereby all code specifying the coordination of the program is confined to the pattern. Concrete applications can then instantiate the function parameters to get parallel execution for free. Such patterns are commonly known as algorithmic skeletons [Cole 1989].

By assuming the Traversable and Monoid contexts we get several operations for free. Through the implicit Functor context, we can use fmap to lift an operation over the base type to one in the cluster type. And through the Monoid and Foldable contexts (the latter implicit), we can use fold as the default for decluster — provided it is an inverse of cluster.

As an example we give the implementation of a divide-and-conquer pattern. It is parametrised by a function that specifies the operation to be applied on atomic arguments (f), a function to (potentially) divide the argument into two smaller values (divide), and a function to combine the results from the recursive calls (conquer). Additionally, we provide a function threshold that is used to limit the amount of parallelism, by using a sequential strategy for arguments below the threshold.

As an example we provide an instance for lists, clustered into lists of lists. Notably, we only have to provide a definition for the cluster method. instance Cluster [a] [] where cluster _ [] = [] cluster n xs = ys:cluster n zs where (ys,zs) = splitAt n xs

divConq :: -> -> -> -> ->

We aim to define a strategy combinator evalCluster :: Cluster a c => Int -> Strategy a -> Strategy a

98

(a a (a (b (a b

-> b) -> Bool) -> b -> b) -> Maybe (a,a))

------

compute the result the value par threshold reached? combine results divide

that programmers who do need to “hand-roll” their own strategies may want to wrap them in MkStrat. Thus, MkStrat marks the pieces of code where programmers incur “proof obligations” to establish identity safety.

divConq f arg threshold conquer divide = go arg where go arg = case divide arg of Nothing -> f arg Just (l0,r0) -> conquer l1 r1 ‘using‘ strat where l1 = go l0 r1 = go r0 strat x = do r l1; r r1; return x where r | threshold arg = rseq | otherwise = rpar

6.

• For all programs, the speedup and runtime results with original

and new strategies are very similar, giving us confidence that they specify the same parallel coordination for a range of programs and parallel paradigms (Figure 2).

All coordination aspects of the function are encoded in the strategy strat, which describes how the two subcomputations l1 and r1 should be evaluated. The thresholding predicate threshold provided by the caller places a bound on the depth of parallelism, and this is used by strat to decide whether to spark both l1 and r1 or to evaluate them directly. The definition of divConq achieves separation between the specifications of algorithm and parallelism, the latter being confined entirely to the definition of strat. 5.4

• The speedups achieved with the new strategies are slightly

better compared to those with the original strategies: a mean of 4.96 versus 4.83 on 7 cores across all applications (columns 3 & 2 of Table 2). • The new strategies fix the space leak outlined in Section 3, re-

ducing heap residency on a single core by 56.43% across all applications, and better support speculative parallelism (Section 6.4).

Improving Safety

The original strategy type a -> () embodies the key modularity goal of separating computation and coordination. As any original strategy can only ever return (), it can never change the result of a computation, up to divergence. Unfortunately, the new strategy type gives up this type safety. Strategies of the new type a -> Eval a should be identity functions, i.e. only evaluate their argument but never change its value; we term this property identity safety. However the type system cannot enforce this behaviour and it is all too easy to accidentally write flawed strategies, for instance:

• The overheads of the new strategies are low: mean sequential

run-time overhead is 3.84% (Table 1), and memory overheads are low for most programs (columns 8 – 11 of Table 2). 6.1

Apparatus

Our measurements are made on an eight-core, 8GB RAM, 6MB L2 cache, HP XW6600 Workstation comprising two Intel Xeon 5410 quad-core processors, each running at 2.33GHz. The benchmarks run under Linux Fedora 7 using a recent GHC development snapshot (6.13 as of 20.5.2010), and parallel packages 1.1.0.1 and 3.1.0.0, for original and new strategies, respectively. The data points reported are the median of 3 executions. We measure up to 7 cores as measurements on the eighth core are known to introduce some variability.

x:xs ‘using‘ \ _ -> parList rdeepseq xs

The intention of the programmer is to evaluate the tail of the list in parallel when the list is demanded. The strategy will do that, but then returns only the tail of the list. We propose a way of trading expressiveness for type checked identity safety. For this purpose, the module SafeStrategies7 clones the functionality and interface of Strategies, except for wrapping the strategy type with a newtype and providing an explicit strategy application operator.

Our benchmarks are 10 parallel applications from a range of application areas; 8 of these have been taken from existing benchmarks suites [Aswad et al. 2009; Loidl et al. 1999; Marlow et al. 2009] and 2 benchmarks, Coins and TransClos, have been developed afresh with the new strategies module. The programs are the computational kernels of realistic applications, cover a variety of parallel paradigms, and employ several important parallel programming techniques, such as thresholding to limit the amount of parallelism generated, and clustering to obtain coarser thread granularity.

newtype Strategy a = MkStrat (a -> Eval a) ($$) :: Strategy a -> a -> Eval a (MkStrat strat) $$ x = strat x

By hiding the constructor MkStrat when importing the module SafeStrategies, programmers can choose to treat Strategy as an abstract type, thereby restricting themselves to use only strategies constructed by the predefined and trusted (identity safe) strategy combinators (like evalList and evalTraversable). Since MkStrat is not available, the type checker will prevent programmers from “hand-rolling” their own strategies (e.g. the flawed strategy above will be rejected), thereby eliminating the danger of accidentally violating identity safety.

Genetic aligns RNA sequences, using divide-and-conquer parallelism and nested data parallelism. MiniMax performs an alphabeta search in a 2-player game tree, using divide-and-conquer parallelism and exploiting laziness to prune unnecessary subtrees. Queens solves the n-queens problem, using divide-and-conquer parallelism with an explicit threshold. LinSolv finds an exact solution to a set of linear equations, employing the data parallel multiple homomorphic images approach. Hidden performs hidden-line removal in 3D rendering and uses data parallelism via the parList strategy. Maze searches for a path in a 2D maze and uses speculative data parallelism. Sphere is a ray-tracer from the Haskell nofib suite, using nested data parallelism, implemented as parMaps. TransClos finds all elements that are reachable via a given relation from a given set of seed values, using a parBuffer strategy over an infinite list. Coins computes the number of ways of paying a given value from a given set of coins, using divide-and-conquer parallelism. MatMult performs matrix multiplication using data parallelism with implicit clustering.

Yet, programmers can still use the Eval monad freely. For instance, the non-modular example of task parallelism from Section 4.6 can be ported to SafeStrategies by inserting $$ after rpar and rseq. Of course, careless use of rpar may cause space leaks or lost parallelism, depending on the GC policy (Section 3), but that is a lesser concern than identity safety because it does not compromise functional correctness. Why does SafeStrategies export the constructor MkStrat at all, rather than making Strategy a proper abstract type? The reason is 7 currently

Evaluation

This section discusses our measurements in detail, but first we summarise the key results:

not distributed with the parallel package

99

Program

LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze Geom. Mean

Seq. Runtime (seconds) 23.40 83.12 21.11 36.98 42.49 25.51 18.85 33.46 4.61 40.93

Original Strategies +0.90 +0.77 +4.78 +0.87 +1.11 +1.37 +16.87 +2.96 +5.86 -2.22 +3.21

∆ Time (%) New Paradigm Strategies +1.97 Nested Data par +2.24 Data par +3.32 Nested Data par +3.22 D&C +2.12 D&C +6.12 D&C +18.14 Data par +3.97 D&C Data par +2.17 Data par -3.59 Nested Data par +3.84

Interestingly, the performance of the new strategies in the Queens and Sphere programs is better than in the original strategies. Examining the heap consumption reveals that with the new strategies the heap residency is significantly reduced: −24.11% for Queens and −14.53% for Sphere. This results in a lower total garbage collection time, which contributes to about half of the reduction in runtime. The reduction in residency is accounted for by the improved space behaviour of the new strategies: the space retained by superfluous sparks is being reclaimed. Granularity improvement: The comparison of generated versus converted sparks in Table 2 demonstrates the runtime system’s effective handling of potential parallelism (sparks). Even when an excessive number of sparks is generated, for example in Coins, the runtime-system converts only a small fraction of these sparks. As with any divide-and-conquer program, a thread generated for a computation close to the root will itself evaluate potential child computations, causing their corresponding sparks to fizzle. Hence the granularity of the generated sparks is automatically made coarser, reducing overheads, as can be seen from the speedups achieved. In general, the new strategies provide more opportunities for sparks to fizzle, as discussed in Section 3. This shows up in a lower number of converted sparks for all divide-and-conquer and nested data parallel programs. For single-level data parallelism as in Sphere, where sparks never share graph structures, there is little or no reduction in the number of converted sparks.

Table 1. Sequential runtime overheads.

6.2

Sequential Overhead

Table 1 shows the sequential runtime as baseline, and the difference of the 1 processor runtime with both original and new strategies. For the new strategies, we encounter a runtime overhead of at most +18.14% for the MatMult program, which is mainly due to the additional work in performing clustering. For all other programs the strategy overhead is significantly lower. Notably, the data parallel programs have a fairly low overhead, despite the additional traversal of a data structure to expose parallelism. Comparing the geometric mean of the runtime overheads imposed by both strategy versions we encounter only a slightly higher overhead for the new strategies: +3.84% compared to +3.21% with the original strategies. This justifies the new strategy approach of high-level generic abstractions. 6.3

6.4

Memory Management

Fixing the space leak: The new strategies fix the space leak outlined in Section 3. For example, for the parallel raytracer that exhibited the space leak8 with the original strategies, the heap residency drops from 1.6GB to 5.8MB with the new strategies on 1 core, and the runtime correspondingly drops by about a factor of 3. Comparing single core executions for all benchmark programs shows a mean reduction in residency of 56.43% with the new strategies. However, for multiple cores the heap measurements in Table 2 do not show a consistent reduction in residency for the new strategies. There are a number of factors contributing to the observed behaviour here:

Parallel Performance

Speedups: Figure 2 compares the absolute speedup curves (i.e. speedup relative to sequential runtime) for the applications with the original and new strategies. Both the runtime curves (not reported here) and speedup curves for the original and new strategies are very similar. The pattern is repeated in more detailed analysis, e.g. in columns 2 and 3 of Table 2. We conclude that the original and new strategies specify the same parallel coordination for a variety of programs representing a range of parallel paradigms, and several tuning techniques.

• With parallel processors available, garbage sparks tend to be

evaluated by other cores and hence fizzle, avoiding the space leak (but wasting some cycles). • Parallel evaluation itself tends to change the residency profile,

Performance: Table 2 analyses in detail the speedups, number of sparks and memory consumption of all applications, running on 7 cores of an 8 core machine with the original strategies and the new strategies, always using a ROOT garbage collection policy. The number of generated sparks was in all cases virtually identical between original strategies and new strategies, giving us further confidence that the two formulations are expressing the same parallelism. Small differences in the number of generated sparks arise because GHC has a non-deterministic execution model in which a particular expression may be evaluated multiple times at runtime [Harris et al. 2005].

in most cases increasing the residency compared to sequential execution. • Residency is recorded by sampling and hence the measured

value is noisy. Speculation: To assess the effectiveness of the garbage collection policies ROOT and WEAK, described in Section 3, for managing speculation we use a program that applies drop to a parallelised list, computing the number of primes up to a given value, thereby rendering the sparks on the dropped list elements speculative: sum $ drop ((m1-m0) ‘quot‘ 2) $ ([ length (primes n) | n <- [m0..m1] ] ‘using‘ parList rdeepseq)

In the cases where the new strategies exhibit poorer performance, the reduction in speedup is still very small: from 5.67 to 5.48 in the worst case for MiniMax. This reflects the low overhead associated with the new strategies, quantified in the previous sub-section.

With the WEAK policy almost all sparks of the original strategies are discarded, as expected. With the new strategies 3404 out of 10001 are converted, 32% fewer than with the ROOT policy, although this value changes considerably between executions. Most importantly, the WEAK policy prunes 4796 sparks, almost all of the 5000 speculative sparks. In contrast, the ROOT policy prunes only 3193 sparks, all of them due to fizzling.

In the case of MatMult heap residency roughly doubles with the new strategies. This is due to the new strategies composing the clusters to return the result value. The original strategies only use the clusters to express parallelism, but do not compose them into the final result. Despite the higher residency, however, the new strategies achieve a better speedup.

8 http://hackage.haskell.org/trac/ghc/ticket/2185

100

Original Strategies 7

5 4

LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze

6 5 Speedup

6

Speedup

New Strategies 7

LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze

3

4 3

2

2

1

1

0

0 1

2

3

4

5

6

7

1

2

3

Number of Cores

4

5

6

7

Number of Cores

Figure 2. Speedup graphs of the application kernels with original and new strategies. Speedup

LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze Geom. Mean

Orig. 6.59 6.04 4.95 5.67 5.61 4.58 5.04 4.95 4.66 2.05 4.83

New 6.44 5.81 5.67 5.48 5.53 5.49 5.39 5.02 4.66 2.01 4.96

Generated Sparks Orig. New 7562 7562 1041 1041 160 160 1464 1464 145925 146853 1589 1563 100 100 659 674 324 324 2723 2835

Converted Sparks Orig. New 7562 7562 1041 1040 160 160 1464 163 2702 1060 1589 636 100 100 659 166 324 324 2525 481

Allocated Heap Orig. (MB) 6050.10 80174.60 8636.40 30476.85 79833.20 14903.30 109.00 12180.20 4805.50 194122.00

New ∆% +0.15 +0.07 -1.14 -0.01 +1.59 -17.52 -6.97 -6.75 -0.01 +7.74 -2.51

Maximum Residency Orig. (KB) 7104.70 108.60 120943.30 98.05 302.10 19134.50 12272.80 493.90 2349.80 71.20

New ∆% +3.87 +1.47 -14.53 -7.17 +20.36 -24.11 +102.04 +7.88 -0.44 -33.15 +1.03

Table 2. Speedups, number of sparks and heap consumption on 7 cores. Only two of our application kernels use speculation: MiniMax and Maze. In the case of MiniMax the WEAK policy significantly reduces the variation of residencies over the number of cores, and in a 7-core execution residency is reduced by 83.4%. In the case of Maze residency remains unchanged. In both cases the speedup improves only slightly when employing a WEAK policy. Of course, the very inability of reclaiming speculative sparks with the ROOT policy discouraged any applications using them on a larger scale.

7.

Data parallel coordination, as in [Blelloch 1996] or implemented in Data Parallel Haskell [Chakravarty et al. 2007], supports the parallel evaluation of every element in a collection. This is a good match with Haskell’s powerful constructs for bulk data types, in particular lists. Data parallelism is often more implicit than evaluation strategies: the programmer simply identifies the collections to be evaluated in parallel. Strategies are more general in that they can express both control parallelism and data parallelism, although in terms of performance Data Parallel Haskell is designed to compile parallel programs down to highly optimised low-level loops over arrays, and hence should achieve significantly better absolute performance on data-parallel programs than would be possible using strategies.

Related Work

Parallel functional languages [Hammond and Michaelson 1999] typically embed high level coordination languages into high level computation languages. A range of high level coordination models have been used [Trinder et al. 2002], and this section relates the semi-explicit approach adopted by evaluation strategies to other approaches.

Entirely implicit coordination aims to minimise programmer input, typically using either profiling as in [Harris and Singh 2007] or parallel iteration as in [Grelck and Scholz 2003]. Few entirely implicit approaches other than parallel iteration have delivered acceptable performance [Nikhil and Arvind 2001]. Evaluation strategies provide more general parallel coordination than loop parallelism.

Skeleton based coordination, for instance [Loogen et al. 2005] or [Scaife et al. 2005], is popular in both imperative and functional languages, and exploits a small set of predefined skeletons. Each skeleton is a polymorphic higher-order function describing a common coordination pattern with an efficient parallel implementation [Cole 1989]. As polymorphic higher-order functions, evaluation strategies are similar to skeletons, but there are some key differences. Rather than a fixed set of skeletons, evaluation strategies are readily combined to form new strategies. Moreover, where skeletons are parametrised with computational arguments, a strategy is typically applied to a computation.

Recent work by [Prabhu et al. 2010] has shown that parallelism by speculating on future data dependencies can be provided as a safe (correctness-preserving) abstraction to programmers. As one might expect, their approach translates naturally into Haskell using par. This approach to speculation is complementary to the speculative parallelism afforded by strategies.

101

8.

Conclusion

M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, G. Keller, and S. Marlow. Data parallel Haskell: a status report. In DAMP ’07 — Workshop on Declarative Aspects of Multicore Programming, pages 10– 18, Nice, France, Jan. 2007. ACM Press.

The original strategies were developed in 1996 for Haskell 1.2, i.e. before monads, and using a compiler with relatively tame optimisations. The context for the new strategies is radically different. Monads, supported by rich libraries and syntactic sugar like do-notation, are now the preferred mechanism for sequencing computations, and are familiar to the rapidly growing Haskell user community. Applicative functors elegantly encode data structure traversals. Finally, the aggressive use of optimisations in mature Haskell implementations like GHC make bespoke efficiency specialisations unnecessary.

M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989. C. Grelck and S.-B. Scholz. SAC - from high-level programming with arrays to efficient parallel execution. Parallel Processing Letters, 13 (3):401–412, 2003. K. Hammond and G. Michaelson, editors. Research Directions in Parallel Functional Programming. Springer, 1999. T. Harris and S. Singh. Feedback directed implicit parallelism. In ICFP ’07 — Intl. Conf. on Functional Programming, pages 251–264, Freiburg, Germany, Oct. 2007. ACM Press. T. Harris, S. Marlow, and S. Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell ’05 — Workshop on Haskell, pages 49–61, Tallinn, Estonia, Sept. 2005. ACM Press. D. Jones Jr., S. Marlow, and S. Singh. Parallel performance tuning for Haskell. In Haskell ’09 — Symposium on Haskell, pages 81–92, Edinburgh, Scotland, Sept. 2009. ACM Press. H.-W. Loidl, P. W. Trinder, K. Hammond, S. B. Junaidu, R. G. Morgan, and S. L. Peyton Jones. Engineering parallel symbolic programs in GpH. Concurrency — Practice and Experience, 11(12):701–752, 1999. H.-W. Loidl, P. W. Trinder, and C. Butz. Tuning task granularity and data locality of data parallel GpH programs. Parallel Processing Letters, 11 (4):471–486, 2001. R. Loogen, Y. Ortega-Mall´en, and R. Pe˜na-Mar´ı. Parallel functional programming in Eden. J. Funct. Program., 15(3):431–475, 2005. S. Marlow, S. Peyton Jones, and S. Singh. Runtime support for multicore Haskell. In ICFP ’09 — Intl. Conf. on Functional Programming, pages 65–78, Edinburgh, Scotland, Sept. 2009. ACM Press. C. McBride and R. Paterson. Applicative programming with effects. J. Funct. Program., 18(1):1–13, 2008. E. Mohr, D. A. Kranz, and R. H. Halstead Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst., 2(3):264–280, 1991. R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Morgan Kaufmann Publishers, 2001. P. Prabhu, G. Ramalingam, and K. Vaswani. Safe programmable speculative parallelism. In PLDI ’10 — Conf. on Programming Language Design and Implementation, pages 50–61, Toronto, Ontario, Canada, June 2010. ACM Press. N. Scaife, S. Horiguchi, G. Michaelson, and P. Bristow. A parallel SML compiler based on algorithmic skeletons. J. Funct. Program., 15(4): 615–650, 2005. P. W. Trinder, K. Hammond, H.-W. Loidl, and S. L. Peyton Jones. Algorithms + strategy = parallelism. J. Funct. Program., 8(1):23–60, 1998. P. W. Trinder, H.-W. Loidl, and R. F. Pointon. Parallel and distributed Haskells. J. Funct. Program., 12(4&5):469–510, 2002. Special Issue on Haskell.

The new strategy formulation capitalises on improved Haskell idioms and implementations to provide a modular and compositional notation for specifying pure deterministic parallelism. While it has some minor drawbacks: being relatively complex, providing relatively weak type safety, and requiring care to express control parallelism, the advantages are many and substantial. It provides clear, generic, and efficient specification of parallelism with low runtime overheads. It resolves a subtle space management issue associated with parallelism, better supports speculation, and is able to directly express parallelism embedded within lazy data structures. The new strategies are available as part of the Haskell parallel package (since Version 3); additional code and benchmarks can be downloaded from http://www.macs.hw.ac.uk/~dsg/gph/ papers/abstracts/new-strategies.html. We plan to further enhance and formalise the identity safety of the new strategies, following the direction discussed in Section 5.4. Moreover the genericity of the new strategies could be improved by automatically deriving instances of the NFData class.

Acknowledgments Thanks to Greg Michaelson, Simon Peyton Jones and the anonymous referees for constructive feedback. This research is supported by the EPSRC HPC-GAP (EP/G05553X) and EU FP6 SCIEnce (RII3-CT-2005-026133) projects.

References M. K. Aswad, P. W. Trinder, A. D. Al Zain, G. J. Michaelson, and J. Berthold. Low pain vs no pain multi-core Haskells. In TFP ’09 — Draft Proc. of Symp. on Trends in Functional Programming, pages 112– 130, Komarno, Slovakia, June 2009. C. Baker-Finch, D. J. King, and P. Trinder. An operational semantics for parallel lazy evaluation. In ICFP ’00 — Intl. Conf. on Functional Programming, pages 162–173, Montreal, Canada, Sept. 2000. ACM Press. G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3): 85–97, 1996.

102

Scalable I/O Event Handling for GHC Bryan O’Sullivan

Johan Tibell

Serpentine [email protected]

Google [email protected]

Abstract

that are several orders of magnitude more demanding than before. Our new code is designed to accommodate both the thread-based programming model of Concurrent Haskell (with no changes to existing application code) and the needs of event-driven applications.

We have developed a new, portable I/O event manager for the Glasgow Haskell Compiler (GHC) that scales to the needs of modern server applications. Our new code is transparently available to existing Haskell applications. Performance at lower concurrency levels is comparable with the existing implementation. We support millions of concurrent network connections, with millions of active timeouts, from a single multithreaded program, levels far beyond those achievable with the current I/O manager. In addition, we provide a public API to developers who need to create event-driven network applications.

1.

Background

2.1

The GHC concurrent runtime

GHC provides a multicore runtime system that uses a small number of operating system (OS) threads to manage the execution of a potentially much larger number of lightweight Haskell threads [6]. The number of operating system threads to use may be chosen at program startup time, with typical values ranging up to the number of CPU cores available1 . From the programmer’s perspective, programming in Concurrent Haskell is appealing due to the simplicity of the synchronous model. The fact that Haskell threads are lightweight, and do not have a one-to-one mapping to OS threads, complicates the implementation of the runtime system. When a Haskell thread must block, this cannot lead to an OS-level thread also being blocked, so the runtime system uses a single OS-level I/O event manager thread (which is allowed to block) to provide an event notification mechanism. The standard Haskell file and network I/O libraries are written to cooperate with the I/O event manager thread. When one of these libraries acquires a resource such as a file or a network socket, it immediately tells the OS to access the resource in a non-blocking fashion. When a client attempts to access (e.g. read or write, send or receive) such a resource, the library performs the following actions:

Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—Applicative (functional) languages; D.3.2 [Programming Languages]: Language Classifications—Concurrent, distributed and parallel languages; D.3.3 [Programming Languages]: Language Constructs and Features— Concurrent programming structures; D.3.4 [Programming Languages]: Processors—Runtime-environments General Terms

2.

Algorithms, Languages, Performance

Introduction

The concurrent computing model used by most Haskell programs has been largely stable for almost 15 years [10]. Despite the language’s many innovations in other areas, networked software is written in Haskell using a programming model that will be familiar to most programmers: a thread of control synchronously sends and receives data over a network connection. By synchronous, we mean that when a thread attempts to send data over a network connection, its continued execution will be blocked if the data cannot immediately be either sent or buffered by the underlying operating system. The Glasgow Haskell Compiler (GHC) provides an environment with a number of attractive features for the development of networked applications. It provides composable synchronization primitives that are easy to use [3]; lightweight threads; and multicore support [2]. However, the increasing demands of largescale networked software have outstripped the capabilities of crucial components of GHC’s runtime system. We have rewritten GHC’s event and timeout handling subsystems to be dramatically more efficient. With our changes, a modestly configured server can easily cope with networking workloads

1. Attempt to perform the operation. If it succeeds, resume immediately. 2. If the operation would need to block, the OS will instead cause it to fail and indicate (via EAGAIN or EWOULDBLOCK in Unix parlance) that it must be retried later. 3. The thread registers with the I/O event manager to be awoken when the operation can be completed without blocking. The sleeping and waking are performed using the lightweight MVar synchronization mechanism of Concurrent Haskell. 4. Once the I/O event manager wakes the thread, return to step 1. (The operation may fail repeatedly with a would-block error, e.g. due to a lost race against another thread for resources, or an OS buffer filling up.) As this sketch indicates, GHC provides a synchronous programming model using a lower-level event-oriented mechanism. It does so via a semi-public API that clients (e.g. the file and networking libraries) can use to provide blocking semantics.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

1 GHC

also provides an “unthreaded” runtime, which does not support multiple CPU cores. We are concerned only with the threaded runtime.

103

for server-side applications on the public Internet is for most connections to be idle, the amount of useful work performed per call to select dwindles as the number of open connections increases. This repetitious book-keeping rapidly becomes a noticeable source of overhead. The I/O manager incurs further inefficiency by using ordinary Haskell lists to manage both events and timeouts. It has to walk the list of timeouts once per iteration of its main loop, to figure out whether any threads must be woken and when the next timeout expires. It must walk the the list of events twice per iteration: once to fill out the data structures to pass to select, and again after select has returned to see which threads to wake. Since select imposes such a small limit on the number of resources it can manage, we cannot easily illustrate the cost of using lists to manage events, but in section 9.2, we will demonstrate the clear importance of using a more efficient data structure for managing timeouts.

−− Block the current thread u n t i l data i s available −− on the given f i l e descriptor . threadWaitRead, threadWaitWrite :: Fd → IO ()

2.2

Timeout management and robust networking

Well designed network applications make careful use of timeouts to provide robustness in the face of a number of challenges. At internet scale, broken and malicious clients are widespread. As an example, a defensively written application will, if a newly connected client doesn’t send any data within a typically brief time window, unilaterally close the connection and clean up its resources. To support this style of programming, the System.Timeout module provides a timeout function: timeout :: Int → IO a → IO (Maybe a)

It initiates an IO action, and if the action completes within the specified time limit, returns Just its result, otherwise it aborts the action and returns Nothing. Concurrent Haskell also provides a threadDelay function that blocks the execution of a thread for a specified amount of time. Behind the scenes, the I/O event manager thread maintains a queue of pending timeouts. When a timeout fires, it wakes the appropriate application thread.

3.

5.

When we set out to improve the performance of GHC’s I/O manager, our primary goal was to increase the number of files, network connections, and timeouts GHC could manage by several orders of magnitude. We wanted to achieve this in the framework of the existing Concurrent Haskell model, retaining complete sourcelevel compatibility with existing Haskell code, and in a manner that could be integrated into the main GHC distribution with minimal effort. Secondarily, we wanted to sidestep the long dispute over whether events or threads make a better programming model for high-concurrency servers [11]. Since we needed to implement an event-driven I/O event manager in order to provide synchronous semantics to application programmers, we might as well design the event API cleanly and expose it publicly to those programmers who wish to use events2 . We desired to implement as much as possible of the new I/O event manager in Haskell, rather than delegating to a lower-level language. This wish was partly borne out of pragmatism: we initially thought that it might be more efficient to build on a portable event handling library such as libev or libevent2, but experimentation convinced us that the overhead involved was too high. With performance and aesthetics pushing us in the same direction, we were happy to forge ahead in Haskell. Architecturally, our new I/O event manager consists of two components. Our event notification library provides a clean and portable API, and abstracts the system-level mechanisms used to provide efficient event notifications (kqueue, epoll, and poll). We have also written a shim that implements the semi-public threadWaitRead and threadWaitWrite interfaces. This means that neither the core file or networking libraries, nor other low-level I/O libraries, require any changes to work with our new code, and they transparently benefit from its performance improvements.

Related work

Li and Zdancewic [9] began the push for higher concurrency in Haskell server applications with an application-level library that provides both event- and thread-based interfaces. We followed their lead in supporting both event-based and thread-based concurrency, but unlike their work, ours transparently benefits existing Haskell applications. In the context of the Java Virtual Machine, Haller and Odersky unify event- and thread-based concurrency via a Scala implementation of the actor concurrency model [1]. Much of their work is concerned with safely implementing lightweight threads via continuations on top of Java’s platform-level threads, resulting in an environment similar to the two-level threading of GHC’s runtime, with comparable concurrency management facilities. For several years, C programmers concerned with client concurrency have enjoyed the libev and libevent libraries. These enable an event- and callback-driven style of development that can achieve high levels of both performance and concurrency. Similar frameworks are available in other languages, e.g. Twisted for Python and Node.js for Javascript.

4.

Our approach

Shortcomings of the traditional I/O manager

Although the I/O manager in versions of GHC up to 6.12 is portable, stable, and performs well for low-concurrency applications, its imperfections make it inapplicable to the scale of operations required by modern networked applications. The I/O manager uses the venerable select system call for two purposes. It informs the OS of the resources it wishes to track for events, and the time until the next pending timeout should be triggered, and blocks until either an event occurs or the timeout fires. The select system call has well-known problems. Most obvious is the distressingly small fixed limit on the number of resources it can handle even under modern operating systems, e.g. 1,024 on Linux. In addition, the programming style enforced by select can be inefficient. The sizes of its programmer-visible data structures are linear in the number of resources to watch. They must be filled out, copied twice across the user/kernel address space boundary, and checked afresh for every invocation. Since the common case

6.

Interface to the I/O event manager

Our I/O event manager is divided into a portable front end and a platform-specific back end. The interface to the back end is simple, and is only visible to the front end; it is abstract in the public interface. 2 In

our experience, even in a language with first-class closures and continuations, writing applications of anything beyond modest size in an eventdriven style is painful.

104

data Backend = forall a. Backend { −− State specific to t h i s platform . _beState :: !a,

registerFd :: → → → →

−− Poll the back end for new events . The callback −− provided i s invoked once per f i l e descriptor with −− new events . _bePoll :: a → Timeout −− in milliseconds → (Fd → Events → IO ()) −− I/O callback → IO (),

EventManager IOCallback −− callback to invoke Fd −− f i l e descriptor of i n t e r e s t Events −− events to l i s t e n for IO FdKey

Because the I/O event manager has to accommodate being invoked from other threads as well as from the same thread in which it is running, registerFd wakes the I/O manager thread when invoked. A client remains registered for notifications until it explicitly drops its registration, and is thus called back on every step into the I/O event manager as long as an event remains pending. We find this level-triggered approach to event notification to be easier than edge triggering for client applications to use.

−− Register , modify , or unregister i n t e r e s t in the −− given events on the specified f i l e descriptor . _beModifyFd :: a → Fd −− f i l e descriptor → Events −− old events to watch for → Events −− new events to watch for → IO (),

unregisterFd :: EventManager → FdKey → IO ()

7.

−− Clean up platform−specific s t a t e upon destruction . _beDestroy :: a → IO () }

Implementation

By and large, the story of our efforts revolves around appropriate choices of data structure, with a few extra dashes of contextsensitive and profile-driven optimization thrown in.

A particular back end will provide a new action that fills out a Backend structure. For instance, the Mac OS X back end starts out as follows:

7.1

module System.Event.KQueue (new) where new :: IO Backend

GHC’s original I/O manager has to walk the entire list of blocked clients once per loop before calling select, and mutate the list afterwards to wake and filter out any clients that have pending events. A step through the I/O manager’s loop thus involves O(n) of traversal and mutation, where n is the number of clients. Our new I/O event manager registers file descriptors persistently with the operating system, using epoll on Linux and kqueue on Mac OS X, so the I/O event manager no longer needs to walk through all clients on each step through the list. Instead, we maintain a finite map from file descriptor to client, which we can look up for each triggered event. This map is based on Leijen’s implementation of Okasaki and Gill’s purely functional Patricia tree [7]. The new I/O event manager’s loop thus involves O(m log n) traversal, and negligible mutation, where m is the number of clients with events pending. This works well in the typical case where m n.

On a Unix-influenced platform, typically more than one back end will be available. For instance, on Linux, epoll is the most efficient back end, but select and poll are available. On Mac OS X, kqueue is usually preferred, but again select and poll are also available. Our public API thus provides a default back end, but allows a specific back end to be used (e.g. for testing). −− Construct the f a s t e s t back end for t h i s platform . newDefaultBackend :: IO Backend newWith :: Backend → IO EventManager new :: IO EventManager new = newWith =<< newDefaultBackend

7.2

For low-level event-driven applications, a typical event loop involves running a single step through the I/O event manager to check for new events, handling them, doing some other work, and repeating. Our interface to the I/O event manager supports this approach.

Economical I/O event management

Cheap timeouts

−− Returns an indication of whether the I/O event manager −− should continue , and a modified timeout queue . step :: EventManager → TimeoutQueue −− current pending timeouts → IO (Bool, TimeoutQueue)

In the original I/O manager, GHC maintains pending timeouts in an ordered list, which it partly walks and mutates on every iteration. Inserting a new timeout thus has O(n) cost per operation, as does each step through the I/O manager’s loop. The I/O event manager needs to perform two operations efficiently during every step: remove all timeouts that have expired, and find the next timeout to wait for. Since we need both efficient update by key and efficient access to the minimum value, we use a priority search queue. Ours is based on that of Hinze [4], so insertion and deletion have O(log n) cost. A step through our new loop has O(m log n) cost, where m is the number of expired timeouts (typically m n, so we win on performance).

To register for notification of events on a file descriptor, clients use the registerFd function.

8.

−− Cookie describing an event r e g i s t r a t i o n . data FdKey

Writing fast networking code is tricky business. We have variously encountered:

init :: EventManager → IO ()

War stories, lessons learned, and scars earned

• Tunable kernel variables (15 at the last count) that regulate ob-

−− A set of events to wait for . newtype Events instance Monoid Events evtRead, evtWrite :: Events

scure aspects of the networking stack in ways that are important at scale; • Abstruse kernel infelicities (e.g. Mac OS X lacking the NOTE_EOF

argument to kqueue, even though it has been present in other BSD variants since 2003);

−− A synchronous callback into the application . type IOCallback = FdKey → Events → IO ()

105

• Performance bottlenecks in GHC that required expert diagnosis

“black hole,” i.e. a closure that is being evaluated. From that point on, all the other threads would become blocked on black holes: as one thread called atomicModifyIORef and found a black hole inside, it would deposit a new black hole inside that depended on its predecessor. A black hole is a special kind of thunk that is invisible to applications, so we could not play any of the usual seq tricks to jolly evaluation along. When we encountered this problem, the black hole queue was implemented as a global linear list, which was scanned during every GC. Most of the time, this choice of data structure was not a problem, but it became painful with thousands of threads. In response, Simon Marlow performed a wholesale replacement of GHC’s black hole mechanism. Instead of a single global black hole queue, GHC now queues a blocked thread against the closure upon which it is blocking. His work has fixed our problem.

(section 8.2); • An inability to stress the software enough, due to lack of 10-

gigabit Ethernet hardware (gigabit Ethernet is easily saturated, even with obsolete hardware). In spite of these difficulties, we are satisfied with the performance we have achieved to date. To give a more nuanced flavour of the sorts of problems we encountered, we have chosen to share a few in more detail. 8.1

Efficiently waking the I/O event manager

In a concurrent application with many threads, the I/O event manager thread spends much of its time blocked, waiting for the operating system to notify it of pending events. A thread that needs to block until it can perform I/O has no way to tell how long the I/O event manager thread may sleep for, so it must wake the I/O event manager in order to ensure that its I/O request can be queued promptly. The original implementation of event manager wakeup in GHC uses a Unix pipe, which clients use to transmit one of several kinds of single-byte control message to the I/O event manager thread. The delivery of a control message has the side effect of waking the I/O event manager if it is blocked. Because a variety of control message types exist, the original event manager reads and inspects a single byte from the pipe at a time. If several clients attempt to wake the event manager thread before it can service any of their requests, it acts as if it has been woken several times in succession, potentially performing unneeded work. More damagingly, this design is vulnerable to the control pipe filling up, since a Unix pipe has a fixed-size buffer. If control messages are lost due to a pipe overflow, an application may deadlock3 . As a result, we invested some effort in ameliorating the problem. Our principal observation was that by far the most common control message is a simple “wake up.” We have accordingly special-cased the handling of this message. On Linux, when possible, we use the kernel’s eventfd facility to provide fast wakeups. No matter how many clients send wakeup requests in between checks by the I/O event manager, it will receive only one notification. While other operating systems do not provide a comparably fast facility, we still have a trick up our sleeves. We dedicate a pipe to delivering only wakeup messages. To issue a wakeup request, a client writes of a single byte to this pipe. When the I/O event manager is notified that data is available on this pipe, it issues a single read system call to gather all currently buffered wakeups. It does not need to inspect any of the data it has read, since they must all be wakeups, and the fixed size of the pipe buffer guarantees that it will not be subject to unnecessary wakeups, regardless of the number of clients requesting. This means that we no longer need to worry about wakeup messages that cannot be written for want of buffer space, so the thread doing the waking can safely use a non-blocking write. 8.2

8.3

Bunfight at the GC corral

When a client application registers a new timeout, we must update the data structure that we use to manage timeouts. Originally, we stored the priority search queue inside an IORef, and each client manipulated the queue using atomicModifyIORef. Alas, this led to a bad interaction with GHC’s generational garbage collector. Since our client-side use of atomicModifyIORef did not force the evaluation of the data inside the IORef, the IORef would accumulate a chain of thunks. If the I/O event manager thread did not evaluate those thunks promptly enough, they would be promoted to the old generation and become roots for all subsequent minor garbage collections (GCs). When the thunks eventually got evaluated, they would each create a new intermediate queue that immediately became garbage. Since the thunks served as roots until the next major GC, these intermediate queues would get copied unnecesarily in the next minor GC, increasing GC time. We had created a classic instance of the generational “floating garbage” problem. The effect on performance of the floating garbage problem was substantial. For example, with 20,000 threads sleeping, we saw variations in our threadDelay microbenchmark performance of up to 34%, depending on how we tuned the GC and whether we simply got lucky. We addressed this issue by having clients store a list of edits to the queue, instead of manipulating it directly. type TimeoutEdit = TimeoutQueue → TimeoutQueue

While maintaining a list of edits doesn’t eliminate the creation of floating garbage, it reduces the amount of copying at each minor GC enough that these substantial slowdowns no longer occur.

9.

Empirical results

We gathered Linux results on commodity quad-core server-class R Xeon R X3230 hardware with 4GB of RAM, and 2.66GHz Intel CPUs running 64-bit Debian 4.0. We used version 6.12.1 of GHC for all measurements, running server applications on three cores with GHC’s parallel garbage collector disabled4 . When measuring network application performance, we used an idle gigabit Ethernet network.

The great black hole pileup

Our use of an IORef to manage the timeout queue yielded a problem that was especially difficult to diagnose, with a symptom of programs unpredictably running thousands of times slower. In our threadDelay benchmark, thousands of threads compete to update the single timeout management IORef atomically. If one of these threads was pre-empted while evaluating the thunk left in the IORef by atomicModifyIORef, then the thunk would become a

9.1

Performance of event notification

To evaluate the raw performance of event notification, we wrote two HTTP servers. Each uses the usual Haskell networking libraries, and we compiled each against both the original I/O manager (labeled “(old)” in graphs) and our rewrite (labeled “(new)”).

3 Indeed,

4 The

one of our microbenchmarks inadvertantly provided a demonstration of how easy it was to provoke a deadlock under heavy load!

first release of the parallel GC performed poorly on loosely coupled concurrent applications. This problem has since been fixed.

106

pong (new, epoll) pong (new, poll) pong (old)

Requests per second

Requests per second

pong (new) pong (old) file (new) file (old) 20000 15000 10000 5000

15000 10000 5000 0 1

0 1

10

100

1000 10000 Request latency (ms)

10000 1000 100 10 1 10

100

100

1000 10000

1000 100 10 1 1

0.1 1

10

Concurrent idle clients

Concurrent active clients Request latency (ms)

20000

10

100

1000 10000

1000 10000 Figure 2. Requests served per second (top) and latency per request (bottom), with 64 active connections and varying numbers of idle connections.

Figure 1. Requests served per second (top) and latency per request (bottom) for two HTTP server benchmarks, with all clients busy, under old and new I/O managers.

9.2

Performance of timeout management

We developed a simple microbenchmark to measure the performance of the threadDelay function, and hence the efficiency of the timeout management code. We measured its execution time, with the runtime system set to use two OS-level threads. As the upper graph of figure 3 indicates, GHC’s traditional I/O manager exhibits O(n2 ) behaviour when managing numerous timeouts. In comparison, the lower graph of figure 3 shows that the new timeout managament code has no problem coping with millions of simultaneously active timeouts. The performance of our microbenchmark did not begin to degrade until we had three million threads and timeouts active on a system with 4GB of RAM. Even for smaller numbers of threads, the new timeout management code is far more efficient than the old, as figure 4 shows.

The first, pong, simply responds immediately to any HTTP request with a response of “Pong!”. The second, file, opens and serves the contents of a 4,332-byte file. We used the ApacheBench tool to measure performance while varying client concurrency. In figure 1, all client connections are active simultaneously; none are idle. Under these conditions of peak load, the epoll back end exhibits throughput and latency comparable to the original I/O manager. Notably, the new I/O event manager handles far more concurrent connections than the 1,016 or so that the original I/O manager is capable of. To create a workload that corresponds more closely to conditions for real applications, we open a variable number of idle connections to the server, then measure the performance of a series of requests where we always use 64 concurrently active clients. Figure 2 illustrates the effects on throughput and latency of the pong microbenchmark when we vary the number of idle clients. For completeness, we measured the performance of both the epoll and poll back ends. The original and epoll managers show similar performance up to the 1,024 limit that select can handle, but while the performance of poll is erratic, the epoll back end is solid until we have 50,000 idle connections open5 . In general, the small limit that select imposes on the number of concurrently managed resources prevents us from seeing any interesting changes in the behaviour of the original I/O manager, because applications fall over long before any curves have an opportunity to change shape. We find this disappointing, as we were looking forward to a fair fight.

10.

Future work

We have integrated our event management code into GHC, and it will be available to all applications as of GHC 6.14. Our future efforts will revolve around Windows support and further performance improvements. 10.1

Windows support

As we are primarily Unix developers, our work to date leaves GHC’s event management on Windows unchanged. We believe that our design can accommodate the Windows model of scalable event notification via I/O completion ports. 10.2

Lower overhead

We were a little surprised that epoll is consistently slightly slower than select. This might be in part because we currently issue

5 We have tested the new event manager with as many as 300,000 idle client

connections.

107

Execution time (secs)

We hope to create a benchmark that stresses the I/O event manager in such a way that we can either find bottlenecks in, or demonstrate a performance improvement via, multicore scaling.

35 30 25 20 15 10 5 0

10.4

0

5

10

15

20

25

Execution time (secs)

Thousands of running threads 25 20 15

A.

10 5 0 500

1000 1500 2000 2500 3000

Acknowledgments We owe especial gratitude to Simon Marlow for his numerous detailed conversations about performance, and for his heroic fixes to GHC borne of the tricky problems we encountered. We would also like to thank Brian Lewis and Gregory Collins for their early contributions to the new event code base.

Figure 3. Performance of the threadDelay benchmark, run under the existing I/O event manager (top) and our rewritten manager (bottom).

Execution time (secs)

Additional materials

The source code of the original, standalone version of our event management library and our benchmarks are available at http://github.com/tibbe/event . 0

References

100 old new

10

[1] P. Haller and M. Odersky. Actors that unify threads and events. In Proceedings of the International Conference on Coordination Models and Languages, 2007. [2] T. Harris, S. Marlow, and S. Peyton Jones. Haskell on a sharedmemory multiprocessor. In Haskell ’05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61. [3] T. Harris, S. Marlow, S. Peyton Jones, and M. Herlihy. Composable memory transactions. In PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 48–60.

1 0.1 0.01 0

10

20

30

40

50

60

Thousands of running threads

[4] R. Hinze. A simple implementation technique for priority search queues. In Proceedings of the 2001 International Conference on Functional Programming, pages 110–121.

Figure 4. Comparative performance of old and new I/O managers on the threadDelay microbenchmark. Note the logarithmic scale on the y-axis, needed to make the numbers for the new manager distinguishable from zero.

[5] D. Jones Jr., S. Marlow, and S. Singh. Parallel performance tuning for Haskell. In Proceedings of the 2009 Haskell Symposium. [6] S. Marlow, S. Peyton Jones, and W. Thaller. Extending the Haskell foreign function interface with concurrency. In Haskell ’04: Proceedings of the ACM SIGPLAN workshop on Haskell, pages 57–68. URL http: //www.haskell.org/~simonmar/papers/conc-ffi.pdf. [7] C. Okasaki and A. Gill. Fast mergeable integer maps. In Workshop on ML, pages 77–86, 1998. [8] B. O’Sullivan. Criterion, a new benchmarking library for Haskell. http://bit.ly/rUuAa, 2009. [9] L. Peng and S. Zdancewic. Combining events and threads for scalable network services. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 189–199. [10] S. Peyton Jones, A. Gordon, and S. Finne. Concurrent Haskell. In POPL ’96: Proceedings of the 1996 Annual Symposium on Principles of Programming Languages, pages 295–308.

two epoll ctl system calls per event notification: one to queue it with the kernel, and one to dequeue it. In contrast, the original I/O manager performs none. If we used epoll in edge-triggered mode, we could eliminate one call to epoll ctl to dequeue an event6 . 10.3

Better performance tools

When we were diagnosing performance problems with the I/O event manager, we made heavy use of existing tools, such as the Criterion benchmarking library [8], GHC’s profiling tools, and the ThreadScope event tracing and visualisation tool [5]. As useful as those tools are, when we made our brief foray into multicore event dispatching, we lacked data that could help us to pin down any performance bottleneck. If we could integrate the new Linux perf analysis tools with ThreadScope, we might gain a broader systemic perspective on where performance problems are occurring.

Improved scaling to multiple cores

In theory, an application should be able to improve both throughput and latency by distributing its event management load across multiple cores. We already support running many instances of the low-level I/O event manager at once, with each instance managing a disjoint set of files or network connections. 6 As

a side note, the BSD kqueue mechanism is cleaner than epoll in this one respect, combining queueing, dequeueing, and checking for multiple events into a single system call. However, the smaller number of trips across the user/kernel address space boundary does not appear to result in better performance, and the kqueue mechanism is otherwise more cluttered and difficult to use than epoll.

[11] R. von Behren, J. Condit, and E. Brewer. Why events are a bad idea (for high-concurrency servers). In HotOS IX: 9th Workshop on Hot Topics in Operating Systems, 2003.

108

An LLVM Backend for GHC David A. Terei

Manuel M. T. Chakravarty

University of New South Wales {davidt,chak}@cse.unsw.edu.au

Abstract

portability of C as GNU C itself has been ported to many architectures. However, exploiting GNU C extensions only partially solves the problems of compiling via C, as C compiler optimisations are often ineffective for code generated from high-level languages — much static information gets lost in the translation. Since the exact set of supported extensions depends on the particular version of GNU C, this approach also introduces new dependencies. In the case of GHC, the use of the GNU C compiler as a backend also increases compilation time significantly. As a response, GHC eventually included support for native code generators that directly produce assembly code. These are currently only fully functional for the x86 and the SPARC architecture. The desire to reap the benefits of compiling via C, while avoiding the problems, inspired the development of low-level intermediate languages that can be conveniently targeted by high-level compilers and that provide the basis for code generators shared by multiple compilers. Of particular interest is C--, as its design has been heavily influenced by the experience with GHC [24]. Although using C-- as an intermediate language is technically a very promising approach, it comes with a huge practical problem: it is only worthwhile to develop a portable compiler backend if it is used by many compilers, but compiler writers do not want to commit to a backend technology unless they know it is widely used and supported. As a consequence, a variant of C-- is currently used as a low-level intermediate language in GHC, but there is no useful general-purpose backend based on C-- that GHC could target. Currently, the most promising backend framework is the Low Level Virtual Machine (LLVM), which comes with support for justin-time compilation and life-long program analysis and optimisation [19]. An LLVM-based C compiler, named clang, recently gained significant traction as an alternative to the GNU C compiler.1 Hence, it is very likely that LLVM will be further developed and is a suitable target of a long-term strategy. In this paper, we describe the design of a new GHC backend that leverages LLVM. We illustrate the problems that we encountered, such as conflicting register conventions and GHC’s tablesnext-to-code optimisation, and our approach to solving them. We also present a quantitative analysis of both the performance of the backend itself and of the code it produces. In particular, we compare it to the C backend and the native code generator of GHC. The overall outcome is very promising: the new LLVM backend matches the performance of the existing backends on most code and outperforms the existing backends by up to a factor of 2.8 on tight loops with high register pressure on the x86 architecture. In summary, we make the following technical contributions:

In the presence of ever-changing computer architectures, highquality optimising compiler backends are moving targets that require specialist knowledge and sophisticated algorithms. In this paper, we explore a new backend for the Glasgow Haskell Compiler (GHC) that leverages the Low Level Virtual Machine (LLVM), a new breed of compiler written explicitly for use by other compiler writers, not high-level programmers, that promises to enable outsourcing of low-level and architecture-dependent aspects of code generation. We discuss the conceptual challenges and our backend design. We also provide an extensive quantitative evaluation of the performance of the backend and of the code it produces. Categories and Subject Descriptors D.3.2 [Language Classifications]: Applicative (functional) languages; D.3.4 [Processors]: Code generation, Retargetable compilers General Terms

1.

Design, Languages, Performance

Introduction

The Glasgow Haskell Compiler (GHC) began with a backend translating code for the Spineless Tagless G-machine (STG-machine) to C [23]. The idea was that targeting C would make the compiler portable due to the ubiquity of C compilers. At the time, this was a popular approach [7, 14, 27]. By leveraging C as a portable assembly language, the authors of compilers for higher-level languages hoped to save development effort, reuse the engineering work invested into the backend of optimising C compilers, and achieve portability across multiple architectures and operating systems. Unfortunately, it turned out that C is a less than ideal intermediate language, especially for compilers of lazy functional languages with their non-standard control flow [24]. In particular, C does not support proper tail calls, first-class labels, access to the runtime stack for garbage collection, and many other desirable features. This is not surprising, as C was never designed to act as an intermediate language for high-level compilers. Nevertheless, it complicates the work of the compiler writers, as they have to work around those limitations of C-based backends. Moreover, the resulting machine code is less efficient than that of backends which generate native assembly code. GHC and other high-level compilers partially mitigate these shortcomings by targeting the GNU C compiler and using some of its many language extensions, such as global registers and first-class labels. This doesn’t detract too much from the

• We qualitatively compare GHC’s existing backends and the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

capabilities of the LLVM framework (Sections 3 & 4). • We present a design for an LLVM backend for GHC, including

new methods to solve long standing problems, such as GHC’s 1 To

109

a large part due to the backing and financial support of Apple.

It can generate efficient code, without any of the tricks used by the C backend, such as post-processing the assembly. Implementing a NCG, however, requires detailed knowledge of the target architecture and a considerable amount of work. Much of this work is replicated for each platform GHC supports. It is also quite difficult to implement code generators that generate high-quality code, as this requires many optimisations and fine tuning for each architecture to achieve optimal register and instruction selection. With the NCG, each platform, such as x86, SPARC, and PowerPC has to perform their own code generation with only the register allocator being shared among all platforms. As a result, the NCG, which includes three code generators (for x86, SPARC, and PowerPC) is at about 20,570 lines of code nearly four times the size of the C backend. One of the main advantages of the NCG is that it can generally compile a Haskell program in half the time of the C backend. Also, despite its far larger size compared to the C backend, it is arguably the simpler of the two.

fixed register assignment and tables-next-to-code optimisation (Section 5). • We present a quantitative analysis of the performance of the two

old backends and our new LLVM backend (Section 6). We discuss related work in Section 7 and conclude in Section 8.

2.

The case for a new backend

As we outlined in the previous section, GHC used to have two backends: (1) the C backend, which generates GNU C code, and (2) the native code generator, which generates assembly code for a few architectures. We will briefly review these two backends before giving our motivation for developing a third, the LLVM backend. 2.1

The C backend

The C backend is based on the STG-machine, an abstract machine that was designed to support the compilation of higher-order, lazy functional languages [23]. GHC’s C backend generates C code which contains extensions that are specific to the GNU C compiler. These extensions facilitate the generation of more efficient code by storing the virtual machine registers of the STG-machine in concrete registers of the target hardware, by supporting tail calls with first-class labels, and by removing some overhead due to removing superfluous function entry and exit code. The resulting dependence on GNU C has two major drawbacks: Firstly, portability is limited by the portability of GNU C (it is not very well supported on Microsoft Windows, for example), and even on architectures that are supported by GNU C, the generated code can be of poor quality – as, for example, the code produced by the GNU C backend for the SPARC architecture. Secondly, as GHC not only exploits the extensions, but also the particular form of assembly generated, there are also dependencies on the version of the C compiler. Therefore, changes in the code generator of the GNU C compiler can break, and have in the past broken, GHC. The GHC C code generator consists of a reasonably manageable 5,400 lines of code. However, to achieve even better efficiency than possible with only exploiting GNU C extensions, GHC opts to postprocess the assembly generated by the C compiler. More precisely, it uses a Perl script to match specific patterns of assembly code and to rewrite them to better-optimised assembly code. This script is of course heavily dependent on the target architecture and also on the specific version of GNU C. In particular, it rearranges code blocks to implement the tables-next-to-code scheme of GHC, which we will discuss on more detail in Section 3.5. For obvious reasons, this script is hard to maintain (it is not a coincidence it is nicknamed “the evil mangler”), and has been responsible for more than one tricky bug. Finally, another serious shortcoming of the C backend is its relatively long compilation time. GHC generates sizeable C files and GNU C requires considerable time to turn them into assembly code. Unfortunately, the long compilation time does not result in carefully optimised code, as one might hope. On the contrary, the generated assembly leaves much to be desired. This is not so much the fault of the GNU C compiler as a consequence of GHC generating non-idiomatic C code. 2.2

2.3

The LLVM backend

A GHC backend on the basis of a portable compiler framework, such as the Low Level Virtual Machine (LLVM) [19] , promises to combine the benefits of the C backend and the NCG with few or none of their disadvantages. The idea behind the C backend was to outsource the considerable development and maintenance effort required to implement a compiler backend to the developers of C compilers — after all, this is an area that is fairly far away from where GHC innovates. Compared to the NCG and C backend, an LLVM has the following to offer: Offloading of work. Building a high performance compiler backend is a huge amount of work, LLVM for example was started around 10 years ago. Going forward, GHC’s LLVM backend should be a lot less work to maintain and extend than either the C backend or NCG. It will also benefit from any future improvements to LLVM which has a particularly bright looking future given the community and industrial support behind it [28]. Optimisation passes. GHC does a great job of producing fast Haskell programs. However, there are a large number of lower level optimisations, particularly the kind that require machine specific knowledge, that it doesn’t currently implement. Some examples of these include partial redundancy elimination (PRE), loop unrolling and strength reduction. Through LLVM we gain these and many more for free. The LLVM Framework. Perhaps the most appealing feature of LLVM is that it has been designed from the start to be a compiler framework. Individual optimisations can be chosen and ordered at compile time, as well as new optimisation passes dynamically loaded. LLVM also offers a choice of register allocators and even an interpreter and JIT compiler. For a research driven project like GHC this is a great benefit and makes LLVM a very fun and useful tool to experiment with. The Community. The LLVM project now includes far more then LLVM: it is an entire compiler tool chain with a C/C++ compiler, assembly tools, a linker, a debugger, and static analysis tools. The work of the community on these projects and also the work of third party compilers, such as GHC, benefit all compilers based on LLVM and improve tool support.

The native code generator

GHC’s native code generator (NCG) was developed to avoid the problems of the C backend. It directly generates assembly code. Just as with C code generation, imperative code is generated from a representation of a Haskell program in the language of the STGmachine. As a result, and because appropriate care is taken, code generated by the NCG is binary compatible with code generated by the C backend. GHC’s native code generator shares the usual advantages and disadvantages of a backend that produces assembly.

3.

How GHC works

Before discussing LLVM and how it fits into GHC’s compilation process in Sections 4 and 5, this section details the aspects of GHC’s design that are relevant to the LLVM backend.

110

Haskell

LLVM IR, respectively. The generation of assembly from C and LLVM IR is then left to supporting tools, namely the GNU C compiler and the LLVM tools, respectively. The C backend additionally runs a post-processing tool, a Perl script with target code-specific regular expressions, over the assembly to further optimise it. We will discuss the exact reasons for starting from Cmm in the LLVM backend in detail in Section 5. Before we can do this, we first need to introduce some of the design of the STG-machine and the Cmm intermediate language.

Core STG Cmm LLVM LLVM IR

3.2

NCG C Backend Assembly C

Assembly

The STG-machine essentially comprises three parts: (1) the STGlanguage from Figure 1, (2) an abstract machine configuration consisting of a register set, heap, stack, etc., and (3) an operational semantics that defines in which way the various constructs of the STG-language alter the machine configuration upon execution. The first component, the STG-language itself, is not important for the LLVM backend as we translate the lower-level Cmm to LLVM IR. However, the remaining two components, the machine configuration as well as the operational semantics are crucial to understanding the LLVM backend as it is the ultimate purpose of the backend to map these two components onto the target machine architecture — or more precisely, to map it onto the LLVM machine configuration and LLVM IR code, respectively. In theory, it is LLVM’s responsibility to map STG-configurations and programs encoded in LLVM IR to the various concrete architectures. In practice, the design of the LLVM backend requires us to understand how LLVM IR maps to concrete architectures to generate efficient code. In particular, we need to represent the heap, stack, and machine registers of the STG-machine on top of LLVM. As Cmm is specialised to GHC and the translation of STG-language programs, the Cmm code follows certain idioms and includes specific language constructs to handle the components of the STG machine configuration. Of particular importance is the register set of the STG-machine, which we will call STG registers in the following.

Assembly Mangler Object Code

Figure 1. The GHC pipeline

3.1

Spineless Tagless G-Machine

The GHC pipeline

Figure 1 outlines GHC’s compilation pipeline. The three main intermediate languages in that pipeline are: Core. GHC’s main optimisation engine is implemented in the form of a large number of program analysis and source-to-source transformation steps on the intermediate language Core. Core is a typed lambda calculus —specifically, it is an instance of System FC (X) [26]— and as such far removed from the process of target code generation. Hence, it plays no role in the design of the LLVM backend. Nevertheless, it is central to one of the areas of major innovation in GHC, which highlights our previous point that target code generation is essentially an unwelcome distraction to most GHC developers.

3.3

STG Registers

Abstract machines usually define the most frequently accessed components of their machine state as abstract machine registers to suggest that these are mapped to hardware registers for optimal performance. In the case of GHC, these abstract machine registers are also central for the interaction with the runtime system (RTS), which is written in C, Cmm, and some snippets of assembly. The STG registers function as an interface between generated code and the runtime system. In other words, the mapping of STG registers to the hardware registers and memory locations of the target architecture are hard-wired into the runtime system. The STG Registers include a stack and heap pointer, as well as a set of general registers that are used for argument passing. Currently, there are two different ways in which GHC implements STG registers; they are called unregisterised and registerised mode, respectively. Unregisterised mode is the simple approach where all STG registers are stored in memory on the heap. Due to the frequent use of STG registers in GHC-generated code, this simple approach comes with a significant performance penalty and is mainly meant for porting and bootstrapping GHC on a new architecture. In unregisterised mode, GHC’s C backend generates standard C code and omits the post-processing phase indicated in Figure 1. In contrast, registerised mode uses the hardware registers of the target architectures to store at least the most important STG registers — this process if often referred to as register pinning. As there are far too many STG machine registers to store them all in

STG-language. This is an A-normalised [12] lambda calculus, which serves as the language of the Spineless Tagless GMachine (STGM) [23] – the abstract machine that defines GHC’s execution model. This execution model was originally the basis of the C backend, and hence, strongly impacts a number of the design choices in the target code generation. We will discuss the STG-machine and its impact on code generation in more detail in the following subsection. Cmm. Cmm is a variant of the C-- language [24], which in turn could be described as a subset of C with extensions to facilitate the use as a low-level intermediate language — for example, it supports tail calls and the integration with a garbage collector. As most of the more complex features of C--, such as its runtime system, aren’t supported in Cmm, it is even closer to being a subset of C. In fact, before GHC included Cmm, it used an intermediate language called Abstract C instead. Cmm is the starting point for the two original code generators, the C backend and the NCG, and it also serves as the input to our LLVM backend. It is central to developing a GHC backend and we will discuss it in detail in Subsection 3.4. The dependence of GHC’s code generators on Cmm is obvious in Figure 1, where the pipeline splits after Cmm depending on which backend is used. The NCG generates assembly directly from Cmm, whereas the C backend and the new LLVM backend generate C and

111

section "data" { fibmax: bits32 35; }

real registers though, some still need to be stored in memory. This technique alone has a dramatic effect on the speed of programs. As these registers are used by GHC’s C-based runtime system, implementing the STG registers in a different manner then either of the two currently supported would involve significant changes to the RTS, increasing the development and maintenance effort. Hence an appropriate mapping of the STG registers can be a considerable challenge for a backend since it requires explicit control over register allocation, something not offered by many compiler targets, including LLVM. 3.4

fib() { bits32 bits32 bits32 bits32

count = R1; = 0; = 1; 0;

Cmm if (count == 1 || bits32[fibmax] < count) { n = 1; goto end; }

As depicted in Figure 1, Cmm is the final backend-independent intermediate representation used by GHC, and serves as a common starting point for the backend code generators. Cmm is based on the C-- language, but with numerous additions and omissions. The most important difference is that Cmm doesn’t support any of C--’s runtime interface features. In C--, these features provide support for implementing accurate garbage collection and exception handling. Instead of involving Cmm, GHC uses a portable garbage collector, implemented in the runtime system, that requires no explicit support from the backends, but depends on the idiomatic generation of Cmm code by GHC. Overall, Cmm is designed to be a minimal procedural language. It supports just the features needed to efficiently abstract away hardware and little more. The prominent features of the language are:

for: if (count > 1) { count = count - 1; n = n2 + n1; n2 = n1; n1 = n; goto for; } end: R1 = n; jump StgReturn; }

1. Unlimited variables, abstracting real hardware registers; 2. Simple type system of either bit types or float types; 3. Powerful label type and sections which can be used to implement higher-level data types;

Figure 2. Cmm example: Fibonacci numbers

4. Functions and function calling with efficient tail call support. Functions don’t support arguments though, the STG registers are used instead to explicitly implement the calling convention used;

Info pointer

Payload

Info table

5. Explicit control flow with functions being comprised of blocks and branch statements;

Type-speciﬁc ﬁelds (reversed)

6. Direct memory access;

Object type Layout info

7. A set of global variables that represent the STG registers; and 8. Code and data order in Cmm is preserved in the compiled code. GHC uses this property for implementing one particular optimisation, which we will examine in detail in the next subsection.

Entry code

Cmm greatly simplifies the task of a backend code generator as the non-strict and functional aspects of Haskell have already been handled and the code generators instead only need to deal with a fairly simple procedural language. Figure 2 displays an example of Cmm. It demonstrates a large portion of the Cmm language, such as its types, variables, control structures and use of code and data labels. 3.5

count; n2; n2 n1; n1 n; n =

Figure 3. GHC’s optimised TNTC heap layout But before getting into the details of the LLVM backend, let us review the reason for the onerous constraint of the Cmm intermediate language. GHC uses it to implement an optimisation known as tables-next-to-code (TNTC). The basic idea is to lay the metadata of a closure right before the code for the closure itself. The metadata, which we call an info-table, is required by the runtime system for each closure. With that layout, both the closure’s evaluation code and its metadata can be accessed from a single pointer. A graphical representation of this layout is in Figure 3. The first word of a closure is its info pointer that refers to the first instruction of the closure’s entry code. The remaining fields of the closure, its payload, contains a function’s free variables or a data constructor’s arguments. A closure’s entry code is executed when the closure is evaluated — i.e., when a lazily evaluated piece of code, a thunk, is forced or

Cmm data & code layout

One of the requirements Cmm places on a backend is that the generated object code has the same order of the data and code sections as the Cmm code has. If a data and code section are adjacent in the Cmm code they are expected to be adjacent in the final object code. This is a problematic requirement as C compilers and other tools take the liberty to reorder code and data sections. Hence, this requirement accounts for much of the magic performed by the Perl script realising the assembly post-processing for the GHC’s C backend. It turns out to be a problem for the LLVM backend, too.

112

Info pointer

define i32 @pow( i32 %M, i32 %N ) { LoopHeader : br label %Loop Loop : %res = phi i32 [1, %LoopHeader], [%res2, %Loop] %i = phi i32 [0, %LoopHeader], [%i2, %Loop] %res2 = mul i32 %res , %M %i2 = add i32 %i, 1 %cond = icmp ne i32 %i2 , %N br i1 %cond , label %Loop , label %Exit Exit : ret i32 %res2 }

Payload

Info table Object type Layout info

Entry code

Type-speciﬁc ﬁelds

Figure 4. GHC Unoptimised Heap Layout

Figure 5. LLVM code to raise an integer to a power

when a function closure is entered to apply it to some arguments. The entry code of closures representing data structures that are in normal form returns a value identifying the corresponding data constructor or similar. By indexing backwards from a closure’s info pointer, the runtime system can access the info-table that contains layout information to assist garbage collection and other metadata. Without the TNTC optimisation, the first word of a closure does not refer directly to the entry code, but instead to the info-table, as depicted in Figure 4. The info-table, in turn, explicitly stores a pointer to the entry code in an additional field. Hence, without the TNTC optimisation, info tables use one more word of memory and, more importantly, executing a closure’s entry code, when it is evaluated, requires two pointer lookups instead of one. This is costly as Haskell code creates and evaluates closures at a rapid rate. In summary, due to the frequent closure entry of Haskell code, the GHC designers chose to bake a layout constraint into Cmm that is hard to meet with conventional backend technology, such as compiling via C or using a general-purpose framework, such as LLVM. We will look into the capabilities of LLVM in more detail in the following section.

4.

occupies less storage than the textual format and can be read more efficiently. The LLVM IR is low-level and assembly-like, but it maintains higher-level static information in the form of type and dataflow information — the latter due to using static single assignment (SSA) form [9]. SSA form guarantees that every variable is only assigned once (and never updated), and hence, strongly related to functional programming [6]. The design goal in combining a low-level language with high-level static information is to retain sufficient static information to enable aggressive optimisation, while still being low-level enough to efficiently support a wide variety of programming languages. The main features of LLVM’s assembly language are: 1. Unlimited virtual registers, abstracting real hardware registers; 2. Low-level assembly with higher-level type information; 3. Static single assignment form (SSA) with phi (φ) function; 4. Functions and function calling with efficient tail call support; 5. Explicit control flow with functions comprising blocks and branch statements; and

How LLVM works

The Low Level Virtual Machine (LLVM) is an open source, mature optimising compiler framework whose development started in 2000 as part of Lattner’s Masters thesis [18]. Today, it provides a highperformance static compiler backend, but can also be used to build just-in-time compilers and to provide mid-level analyses and optimisation in a compiler pipeline. Its main innovation is in the area of life-long program analysis and optimisation [19] — i.e., it supports program analysis and optimisation at compile time, link time, and runtime. Our GHC LLVM backend currently ignores the link-time and runtime analysis and optimisation capabilities and uses LLVM as a conventional static backend. Hence, we are most concerned with LLVM’s abstract machine language that serves as input the LLVM pipeline. 4.1

6. Direct memory access, as well as a type-safe address calculation instruction, getelementptr facilitating optimisations. The single-assignment property of the SSA form requires the use of phi (φ) functions in the presence of low-level control flow with explicit branches. A phi function selects the value to be assigned to a virtual register in dependence on the edge of the control-flow graph along which execution reached the phi function. SSA form is well-established as a type of intermediate representation that simplifies the implementation of code analysis and optimisation. The above feature list of the LLVM IR has much in common with the feature list of Cmm (in Section 3.4). We will compare the two in detail in Section 5.1, where we discuss the translation. For the moment, let’s have a look at a concrete piece of LLVM IR code. The code in Figure 5 contains one complete LLVM function, which is made up of a list of basic blocks, each preceded by a label. The function has three basic blocks, those being LoopHeader, Loop, and Exit. All control flow in LLVM is explicit, so each basic block must end with a branch (br) or return statement (ret). Variable names preceded by a percent symbol, such as %res and %i, denote virtual registers. Virtual registers are introduced by the unique assignment that defines them — just as in a let-binding. All operations are annotated with type information, such as i32, which implies an integer type of 32 bits. Finally, the Loop block starts with two phi functions. The first one assigns to %res either the constant 1 or the value stored in register %res2 depending on whether

The LLVM assembly language

The LLVM assembly language, LLVM IR, is the input language which LLVM accepts for code generation. However, it also acts as LLVM’s internal intermediate representation for program analysis and optimisation passes. The IR has three equivalent representations: a textual representation (the assembly form), an in-memory representation, and a binary representation. The textual representation is useful in a compiler pipeline where individual tools communicate via files, as well as for human inspection. The in-memory representation is used internally, but also whenever a compiler links to LLVM as a library to avoid the overhead of file input and output. The binary representation is used for compact storage — it

113

simplifies maintaining ABI compatibility between the new LLVM backend and the existing backends, which is important to enable linking to modules and libraries compiled with other backends. Finally, there is ongoing work in GHC to move to a new Cmm code generator and a slightly changed Cmm representation [25]. Once complete, this work should improve the code generated by all backends, making it complementary to the LLVM backend instead of competitive. Despite all backends compiling off Cmm, the design and implementation of a translation phase to LLVM IR raises a number of conceptual problems that are unique to the LLVM backend: (1) the mapping of Cmm language constructs to LLVM IR, (2) the generation of LLVM’s SSA form and especially of the phi functions, (3) an efficient implementation of the STG registers, and (4) the implementation of Cmm’s strict code and data layout constraints. We will address these issues individually in the following subsections.

execution entered the Loop block from the block LoopHeader or from Loop itself. All LLVM code is defined as part of an LLVM module, with modules serving as compilation units. An LLVM module consists of four parts: meta information, external declarations, global variables, and function definitions. Meta information can be used to define the endianness of the module, as well as the alignment and size of various LLVM types for the architecture the code will be compiled to. Global variables are as expected, and are prefixed with the @ symbol, as are functions, to indicate that they are actually pointers to the data and have global scope. This also distinguishes them from local variables which are prefixed with the % symbol. 4.2

Comparing C-- and LLVM

As mentioned previously, GHC’s Cmm language is based on the C-- language. The goals of the C-- project were not unlike those of the LLVM project. There is, however, an important difference between the two: LLVM is geared towards supporting aggressive optimisation of a universal language and C-- towards supporting high-level language features such as garbage collection with no overhead. It is interesting though that despite these differences both projects independently developed very similar features. This might suggest that a universal low-level language needs to support a certain essential set of features. It is also interesting to see that over its lifetime, LLVM’s design has in some areas moved towards that of C--. A few examples of features that C-- supported in its initial design and that LLVM only added later are:

5.1

• LLVM’s type system originally was similar to C, support-

ing signed and unsigned variations of char, byte, int and long. Now its type system is much closer to C--, having simply bitsN types and not distinguishing between signed or unsigned. LLVM also used to exclusively use overloaded operations, such as addition and division, but now increasingly has separate instructions for the different types. • LLVM at first did not support declaring the calling convention

of functions and calls, they have only been added later. • LLVM originally contained a malloc and free instruction but

these have very recently been removed. • LLVM now has direct support for implementing garbage col-

lection. This is not as complex or as versatile as the compile and runtime interface supported by C--, but it works in a similar manner: frontend compilers annotate their code with safe points and call special functions in their LLVM code that trigger a compiler plug-in, which they need to supply, to add the information needed by their garbage collector to the code. Our backend doesn’t use this support, though as the garbage collector implemented by GHC doesn’t require it.

5.

Compiling Cmm to LLVM IR

The Cmm and LLVM IR were designed with a similar goal in mind: to be a minimal language to abstract hardware. The primary difference is LLVM’s broader focus, aiming to support multiple languages and aggressive optimisation of the code, whereas Cmm, as used in GHC, is skewed towards compiling Haskell-like languages. To support high-level data types, Cmm uses a label system that works like assembly labels for implementing data types. There is no type information, and arrays and record structures are implemented in the same manner. LLVM instead supports such high-level data types, such as arrays and structures explicitly. Nevertheless, translating between the two is fairly straightforward, especially since many of the harder cases, such as a Cmm data structure with labels at non-start positions, aren’t used by GHC and so don’t need to be supported — these features were inherited from C--. Another minor difference is that LLVM’s preferred way of accessing memory is a special instruction, getelementptr, that takes a pointer type, such as an array, and an index, returning a type-safe pointer. In contrast, Cmm uses explicit pointer arithmetic. LLVM supports this, too, but it prevents some worthwhile code optimisations. Initially we simply used pointer arithmetic in the LLVM backend but have recently changed to using the getelementptr instruction, primarily as part of some work to give LLVM better aliasing information. Many other aspects of Cmm and LLVM IR are rather similar and instead of discussing all features in detail, Table 1 provides a summary of the relationship. As a concrete example, consider the translation of the Cmm code of Figure 2 into unoptimised LLVM code in Figure 6. After improving the code with LLVM’s optimiser, the code is more compact as shown in Figure 7. By relying on the LLVM optimiser, instead of trying to generate better-optimised LLVM code directly, we could keep the LLVM backend simpler — after all, we want to offload as much work as possible onto LLVM.

LLVM backend design

5.2

As shown in Figure 1, our LLVM backend uses Cmm as its input language, just like the other two backends. In principle, we could have used STG-language as our input language, to try and use the higher-level information in the STG-language to generate better code. However, that would have meant duplicating much of the existing functionality in the STG-to-Cmm phase, which not only deals with sophisticated language features, such as laziness and partial application, but also runtime system considerations, such as the generation of metadata for the garbage collector. Instead of replicating this functionality, it seems more economical to fix any shortcomings in the Cmm code generator and in the Cmm language if and when we identify any situation where the LLVM backend doesn’t have the information it needs. This hasn’t happened so far. Moreover, sharing as much code as possible between the backends

Dealing with LLVM SSA form

As we discussed in Section 4, LLVM code must be in SSA form — i.e., all LLVM virtual registers are immutable, single-assignment variables. In contrast, all of Cmm’s variables are mutable; so, we need to handle the conversion to SSA form as part of the LLVM backend. The conversion of arbitrary code into SSA form is a well understood problem; however, it requires a fair amount of implementation work. Thankfully, LLVM provides us with an alternative option that is far simpler: instead of LLVM’s virtual registers, we can use stack allocated variables. We translate each mutable Cmm variable into an LLVM variable allocated on the stack using the alloca instruction. This instruction returns a pointer into the stack that can be read from and written to just like any other memory location in LLVM by using

114

%fibmax_struct = type {i32} @fibmax = global %fibmax_struct {i32 35}

Cmm

LLVM Basic Types Fixed set of integer and float- Support for any size N bit ing point types: type and a fixed set of floating point types: i8, i16, 132, 164, i128 i1, i2, i3... i32, ... iN float, double, x86-fp80, fp128 f32, f64, f80, f128 High Level Types Supports a label type that Has explicit support for high represent the address of the level types such as arrays and location it’s declared at. This structures: can be used to implement higher level types such as arrays and structures. Array: cmmLabel {i32, i32, Array: [4 x i32] i32, i32} Structure: cmmLabel {i32, Structure: {i32, float, douf32, f64} ble} Variables Unlimited typed local vari- Unlimited typed local and ables. Global data is repre- global variables, however sented through untyped la- LLVM’s use of SSA form bels, all load and store oper- means Cmm local variables ations are instead typed. don’t map directly to LLVM local variables. Stack allocated variables are used instead. Code Module Structure A module consists of global A module consists of global variables and functions. variables, functions, type aliases, metadata and a data Functions don’t support layout specification. arguments or return values, the STG registers are used Functions support argufor this purpose instead. ments and a single return Functions consist of a list value. of blocks. All control flow between blocks is explicit. Expressions Literals, memory reads, ma- Literals, memory reads and chine operations and STG machine operations. LLVM register reads has a full coverage of the machine operations that Cmm supports and not much more, the mapping is nearly 1 to 1. Statements Comments, assignment, LLVM directly supports all memory writes, uncondi- of the statements that Cmm tional branch, conditional supports and a few more. branch, multi-way branch, tail calls and function calls. Interestingly LLVM also supports many of the Cmm Cmm also supports calls CallishMachOp instructions to a group of functions called and in a similar manner CallishMachOp. These are where in LLVM they are maths functions such as sin termed intrinsic functions. and log. On hardware which supports them they should become CPU instructions, otherwise they are turned into calls to the C standard library.

define cc 10 void @fib( i32 %Base_Arg, i32 %Sp_Arg, i32 %Hp_Arg,i32 %R1_Arg) { cP: [...] %R1_Var = alloca i32, i32 1 store i32 %R1_Arg, i32* %R1_Var %cf = alloca i32, i32 1 %cl = alloca i32, i32 1 %ck = alloca i32, i32 1 %cj = alloca i32, i32 1 %nQ = load i32* %R1_Var store i32 %nQ, i32* %cf store i32 0, i32* %cl store i32 1, i32* %ck store i32 0, i32* %cj %nR = load i32* %cf %nS = icmp eq i32 %nR, 1 br i1 %nS, label %cT, label %nU nU: %nV = bitcast %fibmax_struct* @fibmax to i32* %nX = load i32* %nV %nY = load i32* %cf %nZ = icmp ult i32 %nX, %nY br i1 %nZ, label %cT, label %n10 n10: [...] Figure 6. Partial output of LLVM backend with Figure 2 as input %fibmax_struct = type { i32 } @fibmax = global %fibmax_struct { i32 35 } define cc 10 void @fib( i32 %Base_Arg, i32 %Sp_Arg, i32 %Hp_Arg,i32 %R1_Arg) { cP: %nS = icmp eq i32 %R1_Arg, 1 br i1 %nS, label %c12, label %nU nU: %nX = load i32* getelementptr inbounds (%fibmax_struct* @fibmax, i32 0, i32 0) %nZ = icmp ult i32 %nX, %R1_Arg br i1 %nZ, label %c12, label %c13.preheader c13.preheader: [...] } Figure 7. Output of LLVM Optimiser with Figure 6 as input

explicit load and store instructions. Code using stack-allocated variables instead of virtual registers is generally slower, but LLVM includes an optimisation pass called mem2reg, which is designed to correct this. This pass turns explicit stack allocation into the use of virtual registers in a manner that is compatible with the SSA restriction by using phi functions. In effect, mem2reg implements the SSA conversion for us. 5.3

Handling the STG registers

The efficient treatment of the STG registers was one of the major challenges we faced in writing the LLVM backend. While the LLVM backend could easily implement the STG registers using un-

Table 1. Mapping of Cmm and LLVM languages

115

.text 12 sJ8_info: movl $base_SystemziIO_hPrint2_closure, (%ebp) movl $base_GHCHandleziFD_stdout_closure, -4(%ebp) addl $-4, %ebp jmp base_GHCziIOziHandleziText_hPutChar1_info

registerised mode, where they are all stored in memory, this would lead to poor performance. Hence, we need to support registerised mode to map as many of the STG registers as possible to hardware registers, and we want to do that such that it yields the same register mapping as used by the other two backends. This is crucial for ABI compatibility, as discussed. Register mapping is a straight forward affair for the NCG as it has full control over register allocation. GHC’s NCG register allocator is aware of the special status of the STG registers and simply reserves the appropriate hardware registers for exclusive use as STG registers — i.e., it aliases the STG register with some hardware registers. The situation is similar for the C backend. Although, ANSI C does not offer control over hardware-specific registers, GNU C provides an extension, called global register variables, which facilitates the same approach of reserving fixed hardware registers for specific STG registers throughout the code. LLVM does not provide this option. Instead, our solution is to implement a new calling convention for LLVM that passes the first n arguments of a function call in specific hardware registers. We choose the hardware registers that we would like to associate with STG registers. Then, the LLVM backend compiles each Cmm function such that the corresponding LLVM function uses the new calling convention with the appropriate number of parameters. Furthermore, the generated LLVM code passes the correct STG registers as the first n arguments to that call. As a consequence, the values of the STG registers are in the appropriate hardware registers on entry to any function. This is in contrast to the other two backends, where the STG registers are also pinned to their hardware registers throughout the body of a function. However, to guarantee the correct register assignment of function entry it is sufficient to ensure that the runtime system finds the registers in the correct place and that LLVM code can call and be called from code generated by other backends. In fact, it is an improvement over the strategy of the other two backends, as the n hardware registers can temporarily be used for other purposes in function bodies if LLVM’s register allocator decides it is worthwhile spilling them. In most cases though, simply leaving the STG registers in the hardware registers is the best allocation and LLVM is capable of recognising this. The only down side of a new calling convention is that its addition requires modifying the LLVM source code. However, our extension has recently been accepted upstream by the LLVM developers. It is now included in public versions of LLVM since version 2.7, which was released in May of 2010 — i.e., as long as version 2.7 or later is being used, GHC works with a standard LLVM installation. 5.4

.text 11 sJ8_info_itable: .long Main_main1_srt-sJ8_info .long 0 .long 327712 Figure 8. GNU Assembler Subsections to implement TNTC in ascending numerical order and creates only one section including all the numbered subsections. In other words, the subsections are purely a structure that exists in the assembly, but does not appear explicitly in the object code. To guarantee that a closure’s metadata appears immediately before the closure’s entry code, we simply place the metadata in section ’text n’ and the entry code in section ’text hn + 1i’, making sure that no other code or functions use those subsections. This is illustrated at an example in Figure 8. While this approach works well it does create a portability problem as it only works with the GNU Assembler. Fortunately this covers two of the three major platforms GHC is supported on, Linux and Windows. On Mac OS X though we are unable to use this technique and so for now have resorted to post processing the assembly produced by LLVM in a manner similar to the C backend. While this is regrettable it is important to note that the mangler2 used by the LLVM backend consists of only about 180 lines of Haskell code, of which half is documentation. The C mangler by comparison is around 2000 lines of Perl code as it has to handle multiple platforms and far more than simple assembly code rearrangement. We are planning, for the future, to move to a purely LLVM-based solution by extending LLVM to explicitly support associating a global variable with a function. This approach might enable better code optimisation; for example, by performing global constant propagation with the info table values.

6.

Evaluation of the LLVM backend

Next, we evaluate the new LLVM backend in comparison to GHC’s existing C backend and NCG. The evaluation is in two parts: first, we consider the complexity of the backends themselves, and second, we analyse the performance of the generated code. The backend complexity is a primary concern for GHC developers, whereas code performance concerns both developers and users of GHC.

Handling Cmm Data and Code layout

As discussed in Section 3.5, Cmm’s requirement to preserve the ordering of data and code layout is uncommon and causes problems with backends other than the NCG (which generates the assembly sections explicitly). Unfortunately, GHC uses this property of Cmm to implement the tables-next-to-code optimisation, which is fairly significant with an about 5% reduction in runtimes. As the C backend can’t meet the layout requirement with either ANSI C or through one of the GNU C extensions, it resorts to an extra pass over the assembly code produced by the GNU C compiler to rearrange assembly sections and rewrite the code. The LLVM backend faces the same problem as the C backend, as there is no explicit support for ordering code sections in LLVM. Fortunately, we found a technique to realise the ordering constraint by using the sub-sections feature of the GNU Assembler (gas). This feature facilitates the specification of a numbered subsection whenever assembly is placed into a particular assembly section. When gas compiles the assembly to object code, it combines subsections

6.1

Complexity of backend implementations

As a simple metric for the code complexity of the three backends, we compare their respective code size. This gives us an indication of the amount of work initially required to implement them as well as the effort that is spend on maintenance. Table 2 displays the code size of the various components of the three backends. The LLVM backend is the smallest at 3,133 lines of code. The C backend is over 70% larger at 5,382 lines. The NCG is by far the largest, being 6 times larger than the LLVM backend and 4 times larger than the C backend — it totals 20,570 lines. In addition to plain code size, we also need to consider the structural and conceptual complexity, particularly for the C backend which doesn’t seem an unreasonable size. The C backend con2 We are tentatively calling this pass, the Righteous Mangler, in line with the established naming convention

116

NCG and C backend against LLVM backend (∆%)

Lines of code of the GHC backends C

NCG

LLVM

Total Compiler Includes Assembly Processor Total Shared X86 SPARC PowerPC Total Compiler LLVM Module

5382 1122 2201 2059 20570 7777 5208 4243 3342 3133 1865 1268

Program atom comp lab zift cryptarithm1 hidden integer integrate simple transform treejoin wave4main wheel-sieve2 (79 more) -1 s.d. +1 s.d. Average

Table 2. GHC backend code sizes sists of three distinct components: (1) the actual compiler that maps Cmm to C code; (2) the C header files included in the generated C code; and (3) the evil mangler, a Perl script post-processing the generated assembly. The C headers define a large number of macros and data structures that decrease the work required by the code generator and also deal with platform specific issues, such as word size. The C headers are fairly sophisticated, but the arguably most complex part of the C backend is the evil mangler, which is implements the TNTC optimisation as well as a variety of other optimisations, such as removing each C function prologues and epilogues. The fragile Perl code uses a large number of regular expressions that need to be updated regularly for new versions of GCC. The NCG consists of a shared component plus a platformspecific component for each supported architecture. The design is fairly typical of a NCG. The shared component consists a general framework for abstracting and driving the pipeline as well as a register allocator. Each platform specific component is responsible for the rest of the pipeline, principally consisting of instruction selection and pretty printing of the assembler code. The LLVM backend comprises two components: (1) the compiler itself and (2) a code module for interfacing with LLVM. It has none of the complexity of the C backend with its sophisticated assembly post-processing or of the NCG with its size and architecture-specific code. It’s also nearly platform independent; it needs to know the word size, endianness and the mapping of STG registers to hardware registers. All of this information is already used elsewhere in GHC and so isn’t specific to the LLVM backend. 6.2

NCG Runtime -6.2 -4.6 -3.4 4.7 -1.0 2.8 1.5 4.0 -2.8 6.8 -3.4 .. -3.1 3.0 -0.1

C Runtime -0.8 -0.9 0.9 10.9 -1.3 8.3 16.6 5.7 4.6 12.4 -2.8 .. -2.0 7.6 2.7

Table 3. NoFib runtimes of all three backends Main_runExperiment_entry() c1GU: Hp = Hp + 36; if (Hp > HpLim) goto c1GX; I32[Hp - 32] = s1Gh_info; I32[Hp - 24] = I32[Sp + 12]; I32[Hp - 20] = I32[Sp + 0]; I32[Hp - 16] = I32[Sp + 4]; I32[Hp - 12] = I32[Sp + 8]; I32[Hp - 8] = ghczmprim_GHCziTypes_ZC_con_info; I32[Hp - 4] = I32[Sp + 12]; I32[Hp + 0] = Hp - 32; R1 = Hp - 6; Sp = Sp + 16; jump (I32[Sp + 0]) (); c1GY: R1 = Main_runExperiment_closure; jump stg_gc_fun (); c1GX: HpAlloc = 36; goto c1GY; Figure 9. A typical Cmm function produced by GHC

Performance

end by 0.1%. The C backend comes in last, trailing 2.7% behind the LLVM backend. The tables only includes individual benchmarks where the runtimes vary significantly between backends. We investigated each of these benchmarks individually to determine the cause of the difference. We didn’t find any cases where the C backend had any conceptual advantage. Where the C backend performed poorly, this was a combination of the at times awkward mapping of Haskell to C and performance bugs with GCC. Comparing the NCG against the LLVM backend, further testing showed that the performance difference was the greatest for atom, hidden and wave4main. However, in all three cases, no particular feature of the code generation seemed to be responsible. It was just a matter of a better default instruction selection. All this raises an important question: why do we get such similar results with very different code generators? Especially, if we consider that GHC’s NCG optimisation pass consists of just branchchain elimination and constant folding, whereas LLVM implements around 60 different optimisation passes. We conjecture that the reason for the similar performance is the Cmm code they all use as in-

To compare the quality of the generated code, we will consider the runtime, but also other metrics, such as compilation times and the size of the compiled code. We used a Core 2 Duo 2.2GHz machine running a 32 bit Linux OS and set the runtimes of the LLVM backend to be the baseline, except where otherwise specified. So positive percentages for either the C backend or NCG mean that they are slower than the LLVM backend by that percentage and negative percentages mean they are faster by that percentage. First, we investigate the NoFib benchmark suite and then some interesting individual examples. Also, as the NCG is GHC’s default backend and as the GHC developers are looking at deprecating the C backend, we will focus on comparing against the NCG. NoFib [22] is the standard benchmark suite for GHC. It is developed alongside GHC and used by developers to test the performance impact of changes to GHC. In Table 3, we see the runtimes of the NCG and C backend against the LLVM backend. There is little difference between the three backends. The NCG comes out with the best overall runtime, ahead of the LLVM back-

117

NCG and C backend against LLVM backend (∆%)

LLVM optimiser against O0 (∆%) O1 -4.9 3.0 -1.0

-1 s.d. +1 s.d. Average

O2 -6.7 2.1 -2.4

Metric Object File Sizes Compilation Times

O3 -6.3 4.0 -1.3

NCG -12.8 -64.8

C Backend -5.2 +35.6

Table 6. NoFib: Object file sizes and compile times

Table 4. NoFib runtimes of LLVM at different optimisation levels. mmult #cores NCG LLVM Speed up

1 13.38s 4.64s 2.88

8 1.68s 0.62s 2.71

laplace 1 4.75s 2.98s 1.59

8 1.44s 1.15s 1.25

evaluated both benchmarks using the Criterion benchmarking library [21] and looking at the resulting kernel density estimates. Figure 12 shows the kernel density estimates for both benchmarks using the three backends. For zip3, the LLVM backend comes out clearly in front with a mean runtime of 334ms, the C backend second with 423ms and the NCG last with a mean of 590ms. The generated Cmm code consists of 3 functions that produce the three enumerated lists. Each calls a common comparator function that checks whether the end of the list has been reached. The LLVM backend aggressively inlines the comparator function, saving a jump instruction for each of the three list enumerations. The C backend generates remarkably similar code to the NCG, the difference simply seems to be in the ordering of some branches and basic blocks, with the C backend choosing the correct hot path. For hailstone, we see that the C backend comes out in front with a mean of 567ms, the LLVM backend second with a mean of 637ms and the NCG last with a disappointing mean of 2.268s. The LLVM and C backends perform well for two reasons: (1) they both perform significantly better instruction selection and (2) they both inline a large amount of code. The C backend outperforms the LLVM backend due to slightly better branch ordering. Table 6 lists the summary of the compile times and object file sizes for the NoFib suite. Both the NCG and C backend produce smaller code. This is currently a deficiency of LLVM: it does not yet optimise for code size. For compile times, the LLVM backend sits between the NCG and C backend. This is due to LLVM’s additional optimisation passes, which incur an overhead compared to the NCG. We saw the considerable benefit of these optimisations in the runtimes of the Repa, zip3, and hailstone benchmarks.

fft 1 8.88s 8.75s 1.01

7 2.06s 2.02s 1.02

Table 5. LLVM versus NCG performance for Repa benchmarks import qualified Data.Vector.Unboxed as U main = print $ U.sum $ U.zipWith3 (\x y z -> x * y * z) (U.enumFromTo 1 (100000000::Int)) (U.enumFromTo 2 (100000001::Int)) (U.enumFromTo 7 (100000008::Int)) Figure 10. Vector Zip3 benchmark collatzLen :: Int -> Word32 -> Int collatzLen c 1 = c collatzLen c n = collatzLen (c+1) $ if n ‘mod‘ 2 == 0 then n ‘div‘ 2 else 3*n+1 pmax x n = x ‘max‘ (collatzLen 1 n, n) main = print $ foldl pmax (1,1) [2..1000000] Figure 11. Hailstone benchmark

6.3 put: it just isn’t easily optimised. Much of the Cmm code that GHC produces is essentially memory bound, a side effect of Haskell being a lazy evaluated language and so there is often very little register pressure or choice in the instruction selection, which is why the NCG is able to perform close to LLVM. Figure 9 contains some typical Cmm code, illustrating the problem. To further test our conjecture, we ran the NoFib benchmarks with the optimisation level of LLVM set to the various supported default groups: -O0, -O1, -O2 and -O3. The results are in Table 4. NoFib however doesn’t tell us the full story concerning performance, specifically due to the idiomatic Cmm of GHC. This becomes, for example, apparent in code using stream fusion [8] and highly optimised array code, such as that of the parallel array library Repa [17]. The Cmm code of the compute-intensive, tight inner loops of these libraries generally suffer from high register pressure and can benefit from smart instruction ordering, which leaves considerable scope for LLVM to optimise performance. The considerable impact that LLVM’s optimisations can have on such code is quantified in Table 5, where we compare the single-threaded and multi-threaded performance of NCG and LLVM-generated code for three Repa benchmarks (see [17] for details on these benchmarks). We further investigated two simple benchmarks featuring tight loops: (1) zip3, Figure 10, uses the high-performance vector library, based on array fusion; and (2) hailstone, Figure 11, relies on list-fusion from the standard Prelude and unboxed integers. We

LLVM’s type system

An advantage of LLVM is its fairly strong type system and static checking of compiled code. While it doesn’t approach the level of sophistication that Haskell programmers are used to, it does offer a system similar to C’s. All variables and memory locations are typed, all operations obey strict type rules, and pointers are carefully distinguished from other types. For example to conduct pointer arithmetic, a pointer must first be cast to an integer of word width type for all arithmetic and then cast back to a pointer. We found this type system very helpful while implementing the LLVM backend, especially as there is usually little compiler support for such a low-level task as code generation. To quantify the benefit of LLVM’s checks, we scanned the source code revision history for the backend, looking at the bugs we still had to fix after the backend was able to compile a whole Haskell program. There were 15 fixes in total, after which the backend was capable of compiling GHC itself. Of these 15, 10 fixes were motivated by compile time errors generated by LLVM. Some of these were obvious bugs that would have also been discovered by a traditional assembler, but a few were more subtle. They generally related to pointer handling, such as one bug where we returned the pointer itself instead of the value it pointed to. For the 5 bugs that LLVM didn’t pick up, two were related to generating incorrect function and data labels, one was an incorrectly compiled negate operation, one an incorrectly compiled label offset operation and one was due to a bug in the LLVM optimiser.

118

Densities of execution times for "zipWith3 (llvm)"

Densities of execution times for "zipWith3 (asm)"

0.1

8.0e-2

8.0e-2

estimate of probability density

6.0e-2

6.0e-2

4.0e-2

6.0e-2

4.0e-2

2.0e-2

4.0e-2

2.0e-2

0

2.0e-2

0 333 ms

335 ms

338 ms 340 ms execution time

343 ms

345 ms

580 ms

Densities of execution times for "hailstone (llvm)"

0 590 ms

600 ms 610 ms execution time

620 ms

630 ms

Densities of execution times for "hailstone (asm)" 6.0e-2

7.0e-2

6.0e-2

estimate of probability density

estimate of probability density

0 630 ms

0 635 ms

640 ms

645 ms 650 ms execution time

655 ms

660 ms

445 ms

450 ms

Densities of execution times for "hailstone (c)"

1.0e-2

1.0e-2

1.0e-2

440 ms

2.0e-2

2.0e-2

2.0e-2

430 ms 435 ms execution time

3.0e-2

3.0e-2

3.0e-2

425 ms

4.0e-2

4.0e-2

4.0e-2

420 ms

5.0e-2

5.0e-2

5.0e-2

415 ms

estimate of probability density

330 ms

Densities of execution times for "zipWith3 (c)"

0.1

estimate of probability density

estimate of probability density

8.0e-2

0.0 s

0 1.00 s

2.00 s

3.00 s 4.00 s 5.00 s execution time

6.00 s

7.00 s

8.00 s

564 ms

566 ms

568 ms

570 ms 572 ms execution time

574 ms

576 ms

Figure 12. Runtimes of Hailstone and Zip3 benchmarks • LDC: A compiler for the D programming language using

The combination of LLVM’s type system with the SSA form that requires every operation to be explicit is a significant help for compiler development. It’s also rewarding to see type systems being used at such a low level. Although it was originally designed to enabled more aggressive optimisations, it does increase safety.

7.

LLVM as a backend target [2]. • llvm-lua: A compiler for the Lua programming language using

LLVM as a backend target [3]. • SAFECode: A compiler that takes LLVM bitcode as input and

uses static analysis to produce a memory safe version of the same program [10].

Related work

There is a large amount of work in the compiler community around LLVM, including static code generation, just-in-time code generators, and static analysis tools. Schie recently added an LLVM backend to the Utrecht Haskell Compiler3 (UHC), a research Haskell compiler, designed around the idea of implementing the compiler as a series of compilers [11, 29]. UHC’s usual backend is a C code generator. His work produced some impressive results, which included a 10% reduction in the runtime of compiled programs. The UHC LLVM backend however didn’t reach the stage of being able to handle the full Haskell language, instead working with a subset of Haskell. The results with the UHC backend can hardly be compared to our work, as GHC by default already generates code that is on average 40 times faster than that of UHC. Other projects using LLVM include the following:

Using LLVM isn’t the only approach though to provide a relatively portable, easy to target, high performance compiler target, there are also high-level virtual machines, such as Microsoft’s Common Language Runtime (CLR) [16], or the Java Virtual Machine (JVM) [20]. These are in some ways quite similar to LLVM, they all provide a virtual instruction set that abstracts away the underlying hardware and can be targeted by multiple programming languages. The functional programming language Clojure [15] for example targets the JVM with great success. Targeting the JVM or CLR also has the added benefits of direct access to the high quality libraries that come with both platforms. There are trade offs though, using a high level virtual machine means that many of your choices are made for you. Features such as garbage collection and exception handling are provided and as such you need to be able to efficiently map your language onto the design of these services, which may not always be possible, or at least efficient. Neither the CLR or JVM provide an option of compiling your code to native machine code, both use interpreters and just-in-time compilation for execution which generally leads to lower performance. LLVM doesn’t include these high-level services and enables us to use infrastructure optimised for Haskell. It also permits us to choose between static compilation or interpretation with just-in-time compilation. As well as the work around LLVM, there is also work being done in GHC around code generation. Ramsey et al. are redesigning the architecture of GHC’s backend pipeline [25] to improve the code generated by GHC. A large part of this work is the design of a dataflow optimisation framework called Hoopl that can be used to easily compose together distinct passes. While there is some

• Clang: A C, C++ and Objective-C compiler using LLVM as a

backend target [1]. • OpenJDK Project Zero: A version of Sun Microsystems open

source JVM, OpenJDK, which uses zero assembly. LLVM is used as a replacement for the usual just-in-time compiler [4]. • Pure: A functional programming language based on term

rewriting. Pure uses LLVM as a just-in-time compiler [13]. • Unladen Swallow: Google backed Python virtual machine with

an LLVM based just-in-time compiler [5].

3 UHC

was previously known as the Essential Haskell Compiler (EHC)

119

overlap in the optimisations that can be done with Hoopl and those implemented by LLVM, this work is mostly complementary to the LLVM backend as it is intended to replace the current STG to Cmm code generator with a much more modular design, not just duplicate optimisations present in LLVM. The end result of the work will be more efficient Cmm code passed to the LLVM code generator.

8.

2006 ACM SIGPLAN conference on Programming language design and implementation, pages 144–157, New York, NY, USA, 2006. ACM. ISBN 1-59593-320-4. [11] A. Dijkstra, J. Fokker, and S. D. Swierstra. The structure of the Essential Haskell Compiler or coping with compiler complexity. IFL ’07: Proceedings of the 19th International Symposium on Implementation and Application of Functional Languages, pages 107–122, 2007. [12] C. Flanagan, A. Sabry, B. F. Duba, and M. Felleisen. The essence of compiling with continuations. In Proceedings ACM SIGPLAN 1993 Conf. on Programming Language Design and Implementation, PLDI’93, volume 28(6), pages 237–247. ACM Press, 1993.

Conclusion

Our LLVM backend is clearly simpler, conceptually and in terms of lines of code, than the two previous backends. It effectively outsources a sophisticated part of GHC’s compilation pipeline and frees developer resources to concentrate on issues that are more directly relevant to the Haskell community. Our quantitative analysis shows that the LLVM backend, already in its current form, generates code that is on par with GHC’s native code generator (the more efficient of the two current backends). For tight loops, as generated by the vector package, we even see a clear performance advantage of our backend. The biggest disadvantage of the LLVM backend is currently its comparatively high compilation times with respect to the native code generator. This is partly to be expected as the LLVM backend is performing a lot more optimisation work on the code. We do expect to be able to improve on the compilation speed though as currently the LLVM backend interfaces with LLVM by using intermediate files and calling the LLVM command line tools, which wastes time parsing and pretty printing LLVM code. LLVM can also be used as a shared library, using entirely in-memory representations of the LLVM IR. By using this facility, we should be able to improve compilation speeds significantly.

[13] A. Gr¨af. The Pure programming language. http://code.google. com/p/pure-lang/, 2009. [14] F. Henderson, Z. Somogyi, and T. Conway. Compiling logic programs to C using GNU C as a portable assembler. In Proceedings of the ILPS’95 Postconference Workshop on Sequential Implementation Technologies for Logic Programming Languages, 1995. [15] R. Hickey. The Clojure programming language. In DLS ’08: Proceedings of the 2008 symposium on Dynamic languages, pages 1–1, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-270-2. doi: http://doi.acm.org/10.1145/1408681.1408682. [16] E. International. Standard ECMA-355 - Common Language Infrastructure (CLI). Technical Report 4th Edition. ECMA International, June 2006. [17] G. Keller, M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, and B. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2010, Sept. 2010. [18] C. Lattner. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, December 2002. [19] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In CGO ’04: Proceedings of the International Symposium on Code Generation and Optimization, page 75. IEEE Computer Society, 2004. ISBN 0-7695-2102-9. [20] T. Lindholm and F. Yellin. Java Virtual Machine Specification. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, April 1999. ISBN 0201432943. [21] B. O’Sullivan. criterion: Robust, reliable performance measurement and analysis. http://hackage.haskell.org/package/ criterion, 2010. [22] W. Partain. The nofib benchmark suite of haskell programs. pages 195–202, London, UK, 1993. Springer-Verlag.

Acknowledgements. We are grateful to Chris Lattner, Euguene Todler, Ben Lippmeier and Simon Marlow for technical advice during the development of the LLVM backend. We furthermore thank Ben Lippmeier for providing the Repa benchmark results. We thank Gabriele Keller for improving a draft of this paper. The first author thanks Microsoft Research, Cambridge, for hosting him as an intern during the writing of this paper and to further improve the GHC’s LLVM backend.

References [1] clang: A C language family frontend for LLVM. http://clang. llvm.org/, 2010. [2] LDC: LLVM D Compiler. http://www.dsource.org/projects/ ldc, 2010. [3] llvm-lua, jit/static compiler for lua using llvm on the backend. http: //code.google.com/p/llvm-lua/, 2010. [4] Openjdk - zero-assembler project. http://openjdk.java.net/ projects/zero/, 2010. [5] Unladen Swallow: A faster implementation of Python. http:// code.google.com/p/unladen-swallow/, 2010. [6] A. W. Appel. SSA is functional programming. ACM SIGPLAN Notices, 33(4):17–20, 1998. [7] P. Codognet and D. Diaz. wamcc: compiling Prolog to C. In Proceedings of the Twelfth International Conference on Logic Programming, pages 317–331, 1995. [8] D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: From lists to streams to nothing at all. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, Apr. 2007. [9] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451–490, October 1991. [10] D. Dhurjati, S. Kowshik, and V. Adve. Safecode: enforcing alias analysis for weakly typed languages. In PLDI ’06: Proceedings of the

[23] S. L. Peyton Jones. Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. Journal of Functional Programming, 2(2), 1992. [24] S. L. Peyton Jones, N. Ramsey, and F. Reig. C--: A portable assembly language that supports garbage collection. In PPDP ’99: Proceedings of the International Conference PPDP’99 on Principles and Practice of Declarative Programming, pages 1–28. Springer-Verlag, 1999. ISBN 3-540-66540-4. [25] N. Ramsey, J. Dias, and S. Peyton Jones. Hoopl: Dataflow optimization made simple. In ACM SIGPLAN Haskell Symposium 2010. ACM Press, 2010. [26] M. Sulzmann, M. Chakravarty, S. Peyton Jones, and K. Donnelly. System F with type equality coercions. In ACM SIGPLAN International Workshop on Types in Language Design and Implementation (TLDI’07). ACM, 2007. [27] D. Tarditi, P. Lee, and A. Acharya. No assembly required: compiling Standard ML to C. ACM Lett. Program. Lang. Syst., 1(2):161–177, 1992. ISSN 1057-4514. [28] The LLVM Team. The LLVM compiler infastructure: LLVM users. http://llvm.org/Users.html. [29] J. van Schie. Compiling Haskell to LLVM. Master’s thesis, Department of Information and Computing Sciences, Utrecht University, 2008.

120

Hoopl: A Modular, Reusable Library for Dataflow Analysis and Transformation Norman Ramsey

Jo˜ao Dias

Simon Peyton Jones

Tufts University [email protected]

Tufts University [email protected]

Microsoft Research [email protected]

• Hoopl is purely functional. Although pure functional languages

.

are not obviously suited to writing standard algorithms that transform control-flow graphs, pure functional code is actually easier to write, and far easier to write correctly, than code that is mostly functional but uses a mutable representation of graphs (Ramsey and Dias 2005). When analysis and transformation are interleaved, so that graphs must be transformed speculatively, without knowing whether a transformed graph will be retained or discarded, pure functional code offers even more benefits.

Abstract Dataflow analysis and transformation of control-flow graphs is pervasive in optimizing compilers, but it is typically entangled with the details of a particular compiler. We describe Hoopl, a reusable library that makes it unusually easy to define new analyses and transformations for any compiler written in Haskell. Hoopl’s interface is modular and polymorphic, and it offers unusually strong static guarantees. The implementation encapsulates state-of-the-art algorithms (interleaved analysis and rewriting, dynamic error isolation), and it cleanly separates their tricky elements so that they can be understood independently.

• Hoopl is polymorphic. Just as a list library is polymorphic in the

list elements, so is Hoopl polymorphic, both in the nodes that inhabit graphs and in the dataflow facts that analyses compute over these graphs (Section 4). • The paper by Lerner, Grove, and Chambers is inspiring but ab-

Readers: Code examples are indexed at http://bit.ly/cZ7ts1.

stract. We articulate their ideas in a concrete, simple API, which hides a subtle implementation (Sections 3 and 4). You provide a representation for facts, a transfer function that transforms facts across nodes, and a rewrite function that can use a fact to justify rewriting a node. Hoopl “lifts” these node-level functions to work over control-flow graphs, solves recursion equations, and interleaves rewriting with analysis. Designing APIs is surprisingly hard; after a dozen significantly different iterations, we offer our API as a contribution.

Categories and Subject Descriptors D.3.4 [Processors]: Optimization, Compilers; D.3.2 [Language Classifications]: Applicative (functional) languages, Haskell General Terms Algorithms, Design, Languages

1.

Introduction

A mature optimizing compiler for an imperative language includes many analyses, the results of which justify the optimizer’s codeimproving transformations. Many analyses and transformations— constant propagation, live-variable analysis, inlining, sinking of loads, and so on—should be regarded as particular cases of a single general problem: dataflow analysis and optimization. Dataflow analysis is over thirty years old, but a recent, seminal paper by Lerner, Grove, and Chambers (2002) goes further, describing a powerful but subtle way to interleave analysis and transformation so that each piggybacks on the other.

• Because clients can perform very local reasoning (“y is live be-

fore x:=y+2”), analyses and transformations built on Hoopl are small, simple, and easy to get right. Moreover, Hoopl helps you write correct optimizations: statically, it rules out transformations that violate invariants of the control-flow graph (Sections 3 and 4.3), and dynamically, it can help find the first transformation that introduces a fault in a test program (Section 5.5). • Hoopl implements subtle algorithms, including (a) interleaved

analysis and rewriting, (b) speculative rewriting, (c) computing fixed points, and (d) dynamic fault isolation. Previous implementations of these algorithms—including three of our own— are complicated and hard to understand, because the tricky pieces are implemented all together, inseparably. In this paper, each tricky piece is handled in just one place, separate from the others (Section 5). We emphasize this implementation as an object of interest in its own right.

Because optimizations based on dataflow analysis share a common intellectual framework, and because that framework is subtle, it is tempting to try to build a single, reusable library that embodies the subtle ideas, while making it easy for clients to instantiate the library for different situations. Although such libraries exist, as we discuss in Section 6, they have complex APIs and implementations, and none interleaves analysis with transformation. In this paper we present Hoopl (short for “higher-order optimization library”), a new Haskell library for dataflow analysis and optimization. It has the following distinctive characteristics:

Our work bridges the gap between abstract, theoretical presentations and actual compilers. Hoopl is available from http://ghc. cs.tufts.edu/hoopl and also from Hackage (version 3.8.6.0). One of Hoopl’s clients is the Glasgow Haskell Compiler, which uses Hoopl to optimize imperative code in GHC’s back end.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

Hoopl’s API is made possible by sophisticated aspects of Haskell’s type system, such as higher-rank polymorphism, GADTs, and type functions. Hoopl may therefore also serve as a case study in the utility of these features.

121

2.

fact {x=7} and the instruction z:=x>5, a pure analysis could produce the outgoing fact {x=7, z=True} by simplifying x>5 to True. But the subsequent transformation must perform exactly the same simplification when it transforms the instruction to z:=True! If instead we first rewrite the node to z:=True, then apply the transfer function to the new node, the transfer function becomes wonderfully simple: it merely has to see if the right hand side is a constant. You can see code in Section 4.6.

Dataflow analysis & transformation by example

A control-flow graph, perhaps representing the body of a procedure, is a collection of basic blocks—or just “blocks.” Each block is a sequence of instructions, beginning with a label and ending with a control-transfer instruction that branches to other blocks. The goal of dataflow optimization is to compute valid dataflow facts, then use those facts to justify code-improving transformations (or rewrites) on a control-flow graph.

Another example is the interleaving of liveness analysis and deadcode elimination. As mentioned in Section 1, it is sufficient for the analysis to say “y is live before x:=y+2”. It is not necessary to have the more complex rule “if x is live after x:=y+2 then y is live before it,” because if x is not live after x:=y+2, the assignment x:=y+2 will be transformed away (eliminated). When several analyses and transformations can interact, interleaving them offers even more compelling benefits; for more substantial examples, consult Lerner, Grove, and Chambers (2002).

As a concrete example, we show constant propagation with constant folding. On the left we show a basic block; in the middle we show facts that hold between statements (or nodes) in the block; and on the right we show the result of transforming the block based on the facts: Before Facts After ------------{}------------x := 3+4 x := 7 ----------{x=7}-----------z := x>5 z := True -------{x=7, z=True}------if z goto L1 then goto L1 else goto L2

But the benefits come at a cost. To compute valid facts for a program that has loops, an analysis may require multiple iterations. Before the final iteration, the analysis may compute a fact that is invalid, and a transformation may use the invalid fact to rewrite the program (Section 4.7). To avoid unjustified rewrites, any rewrite based on an invalid fact must be rolled back; transformations must be speculative. As described in Section 4.7, Hoopl manages speculation with minimal cooperation from the client.

Constant propagation works from top to bottom. In this example, we start with the empty fact. Given that fact and the node x:=3+4, can we make a transformation? Yes: constant folding can replace the node with x:=7. Now, given this transformed node and the original fact, what fact flows out of the bottom of the transformed node? The fact {x=7}. Given the fact {x=7} and the node z:=x>5, can we make a transformation? Yes: constant propagation can replace the node with z:=7>5. Now, can we make another transformation? Yes: constant folding can replace the node with z:=True. The process continues to the end of the block, where we can replace the conditional branch with an unconditional one, goto L1.

While it is wonderful that we can create complex optimizations by interleaving very simple analyses and transformations, it is not so wonderful that very simple analyses and transformations, when interleaved, can exhibit complex emergent behavior. Because such behavior is not easily predicted, it is essential to have good tools for debugging. Hoopl’s primary debugging tool is an implementation of Whalley’s (1994) search technique for finding fault-inducing transformations (Section 5.5).

The example above is simple because it has only straight-line code; control flow makes dataflow analysis more complicated. For example, consider a graph with a conditional statement, starting at L1:

3.

Representing control-flow graphs

Hoopl is a library that makes it easy to define dataflow analyses— and transformations driven by these analyses—on control-flow graphs. Graphs are composed from smaller units, which we discuss from the bottom up:

L1: x=3; y=4; if z then goto L2 else goto L3 L2: x=7; goto L3 L3: ...

• A node is defined by Hoopl’s client; Hoopl knows nothing about

Because control flows to L3 from two places (L1 and L2), we must join the facts coming from those two places. All paths to L3 produce the fact y=4, so we can conclude that y=4 at L3. But depending on the the path to L3, x may have different values, so we conclude “x=”, meaning that there is no single value held by x at L3. The final result of joining the dataflow facts that flow to L3 is the fact x=∧ y=4 ∧ z=.

the representation of nodes (Section 3.2). • A basic block is a sequence of nodes (Section 3.3). • A graph is an arbitrarily complicated control-flow graph: basic

blocks connected by edges (Section 3.4). 3.1

Shapes: Open and closed

Forward and backward. Constant propagation works forward, and a fact is often an assertion about the program state, such as “variable x holds value 7.” Some useful analyses work backward. A prime example is live-variable analysis, where a fact takes the form “variable x is live” and is an assertion about the continuation of a program point. For example, the fact “x is live” at a program point P is an assertion that x is used on some program path starting at P. The accompanying transformation is called dead-code elimination; if x is not live, this transformation replaces the node x:=e with a no-op.

In Hoopl, nodes, blocks, and graphs share an important new property: a shape. A thing’s shape tells us whether the thing is open or closed on entry and open or closed on exit. At an open point, control may implicitly “fall through;” at a closed point, control transfer must be explicit and to a named label. For example,

Interleaved analysis and transformation. Our first example interleaves analysis and transformation. Interleaving makes it easy to write effective analyses. If instead we had to finish analyzing the block before transforming it, analyses would have to “predict” the results of transformations. For example, given the incoming

• A label is closed on entry (because in Hoopl we do not allow

• A shift-left instruction is open on entry (because control can

fall into it from the preceding instruction), and open on exit (because control falls through to the next instruction). • An unconditional branch is open on entry, but closed on exit

(because control cannot fall through to the next instruction). control to fall through into a branch target), but open on exit. • The shape of a function-call node is up to the client. If a

call always returns to its inline successor, it could be open on

122

data Node e x where Label :: Label Assign :: Var -> Expr Store :: Expr -> Expr Branch :: Label Cond :: Expr -> Label -> Label ... more constructors ...

-> -> -> -> ->

Node Node Node Node Node

C O O O O

data O data C

O O O C C

-- Open -- Closed

data Block n e x where BFirst :: n C O BMiddle :: n O O BLast :: n O C BCat :: Block n e O -> Block n O x

-> -> -> ->

Block Block Block Block

n n n n

C O O e

O O C x

Figure 1. A typical node type as it might be defined by a client data Graph GNil :: GUnit :: GMany :: -> -> ->

entry and exit. But if a call could return in multiple ways— for example by returning normally or by raising an exception— then it has to be closed on exit. GHC uses calls of both shapes. Blocks and graphs have shapes too. For example the block

n e x where Graph n O O Block n O O -> Graph n O O MaybeO e (Block n O C) LabelMap (Block n C C) MaybeO x (Block n C O) Graph n e x

x:=7; y:=x+2; goto L data MaybeO ex t where JustO :: t -> MaybeO O t NothingO :: MaybeO C t

is open on entry and closed on exit, which we often abbreviate “open/closed.” We may also refer to an “open/closed block.” The shape of a thing determines that thing’s control-flow properties. In particular, whenever E is a node, block, or graph,

newtype Label -- abstract newtype LabelMap a -- finite map from Label to a addBlock :: NonLocal n => Block n C C -> LabelMap (Block n C C) -> LabelMap (Block n C C) blockUnion :: LabelMap a -> LabelMap a -> LabelMap a

• If E is open on entry, it has a unique predecessor; if it is closed,

it may have arbitrarily many predecessors—or none. • If E is open on exit, it has a unique successor; if it is closed, it

may have arbitrarily many successors—or none. 3.2

Nodes class NonLocal n where entryLabel :: n C x -> Label successors :: n e C -> [Label]

The primitive constituents of a control-flow graph are nodes. For example, in a back end a node might represent a machine instruction, such as a load, a call, or a conditional branch; in a higher-level intermediate form, a node might represent a simple statement. Hoopl’s graph representation is polymorphic in the node type, so each client can define nodes as it likes. Because they contain nodes defined by the client, graphs can include arbitrary data specified by the client, including (say) method calls, C statements, stack maps, or whatever.

Figure 2. The block and graph types defined by Hoopl

3.3

Blocks

Hoopl combines the client’s nodes into blocks and graphs, which, unlike the nodes, are defined by Hoopl (Figure 2). A Block is parameterized over the node type n as well as over the flag types that make it open or closed at entry and exit.

The type of a node specifies its shape at compile time. Concretely, the type constructor for a node has kind *->*->*, where the two type parameters are type-level flags, one for entry and one for exit. Each type parameter may be instantiated only with type O (for open) or type C (for closed).

The BFirst, BMiddle, and BLast constructors create one-node blocks. Each of these constructors is polymorphic in the node’s representation but monomorphic in its shape. Why not use a single constructor of type n e x -> Block n e x, which would be polymorphic in a node’s representation and shape? Because by making the shape known statically, we simplify the implementation of analysis and transformation in Section 5.

As an example, Figure 1 shows a typical node type as it might be defined by one of Hoopl’s clients. The type parameters are written e and x, for entry and exit respectively. The type is a generalized algebraic data type; the syntax gives the type of each constructor. For example, constructor Label takes a Label and returns a node of type Node C O, where the “C” says “closed on entry” and the “O” says “open on exit”. The types Label, O, and C are defined by Hoopl (Figure 2). In other examples from Figure 1, constructor Assign takes a variable and an expression, and it returns a Node open on both entry and exit; constructor Store is similar. Finally, control-transfer nodes Branch and Cond (conditional branch) are open on entry and closed on exit. Types Var and Expr are private to the client, and Hoopl knows nothing about them.

The BCat constructor concatenates blocks in sequence. It makes sense to concatenate blocks only when control can fall through from the first to the second; therefore, two blocks may be concatenated only if each block is open at the point of concatenation. This restriction is enforced by the type of BCat, whose first argument must be open on exit and whose second argument must be open on entry. It is impossible, for example, to concatenate a Branch immediately before an Assign. Indeed, the Block type guarantees statically that any closed/closed Block—which compiler writers normally call a “basic block”—consists of exactly one first node (such as Label in Figure 1), followed by zero or more middle nodes (Assign or Store), and terminated with exactly one last node (Branch or Cond). Enforcing these invariants by using GADTs is one of Hoopl’s innovations.

Nodes closed on entry are the only targets of control transfers; nodes open on entry and exit never perform control transfers; and nodes closed on exit always perform control transfers.1 Because of the position each shape of node occupies in a basic block, we often call them first, middle, and last nodes respectively. 1 To

obey these invariants, a node for a conditional-branch instruction, which typically either transfers control or falls through, must be represented as a two-target conditional branch, with the fall-through path in a separate

block. This representation is standard (Appel 1998), and it costs nothing in practice: such code is easily sequentialized without superfluous branches.

123

3.4

Graphs Part of optimizer

Hoopl composes blocks into graphs, which are also defined in Figure 2. Like Block, the data type Graph is parameterized over both nodes n and over its shape at entry and exit (e and x). Graph has three constructors. The first two deal with the base cases of open/open graphs: an empty graph is represented by GNil while a single-block graph is represented by GUnit.

Control-flow graphs Nodes in a control-flow graph Dataflow fact F Lattice operations Transfer functions Rewrite functions Analyze-and-rewrite functions

More general graphs are represented by GMany, which has three fields: an optional entry sequence, a body, and an optional exit sequence. • If the graph is open on entry, it contains an entry sequence of

Specified Implemented by by US YOU

US YOU

YOU US US US US

YOU YOU YOU YOU US

How many

One One type per intermediate language One type per logic One set per logic One per analysis One per transformation Two (forward, backward)

Table 3. Parts of an optimizer built with Hoopl

type Block n O C. We could represent this sequence as a value of type Maybe (Block n O C), but we can do better: a value of Maybe type requires a dynamic test, but we know statically, at compile time, that the sequence is present if and only if the graph is open on entry. We express our compile-time knowledge by using the type MaybeO e (Block n O C), a type-indexed version of Maybe which is also defined in Figure 2: the type MaybeO O a is isomorphic to a, while the type MaybeO C a is isomorphic to ().

3.5

Edges, labels and successors

Although Hoopl is polymorphic in the type of nodes, it still needs to know how control may be transferred from one node to another. Within a block, a control-flow edge is implicit in every application of the BCat constructor. An implicit edge originates in a first node or a middle node and flows to a middle node or a last node.

• The body of the graph is a collection of closed/closed blocks.

Between blocks, a control-flow edge is represented as chosen by the client. An explicit edge originates in a last node and flows to a (labelled) first node. If Hoopl is polymorphic in the node type, how can it follow such edges? Hoopl requires the client to make the node type an instance of Hoopl’s NonLocal type class, which is defined in Figure 2. The entryLabel method takes a first node (one closed on entry, as per Section 3.2) and returns its Label; the successors method takes a last node (closed on exit) and returns the Labels to which it can transfer control.

To facilitate traversal of the graph, we represent the body as a finite map from label to block. • The exit sequence is dual to the entry sequence, and like the

entry sequence, its presence or absence is deducible from the static type of the graph. Graphs can be spliced together nicely; the cost is logarithmic in the number of closed/closed blocks. Unlike blocks, two graphs may be spliced together not only when they are both open at splice point but also when they are both closed—and not in the other two cases:

In Figure 1, the client’s instance declaration for Node would be instance NonLocal Node where entryLabel (Label l) = l successors (Branch b) = [b] successors (Cond e b1 b2) = [b1, b2]

gSplice :: Graph n e a -> Graph n a x -> Graph n e x gSplice GNil g2 = g2 gSplice g1 GNil = g1 gSplice (GUnit b1) (GUnit b2) = GUnit (b1 ‘BCat‘ b2)

Again, the pattern matching for both functions is exhaustive, and the compiler checks this fact statically. Here, entryLabel cannot be applied to an Assign or Branch node, and any attempt to define a case for Assign or Branch would result in a type error.

gSplice (GUnit b) (GMany (JustO e) bs x) = GMany (JustO (b ‘BCat‘ e)) bs x gSplice (GMany e bs (JustO x)) (GUnit b2) = GMany e bs (JustO (x ‘BCat‘ b2)) gSplice (GMany e1 bs1 (JustO x1)) (GMany (JustO e2) bs2 x2) = GMany e1 (bs1 ‘blockUnion‘ (b ‘addBlock‘ bs2)) x2 where b = x1 ‘BCat‘ e2 gSplice (GMany e1 bs1 NothingO) (GMany NothingO bs2 x2) = GMany e1 (bs1 ‘blockUnion‘ bs2) x2

This definition illustrates the power of GADTs: the pattern matching is exhaustive, and all the shape invariants are checked statically. For example, consider the second-to-last equation for gSplice. Since the exit sequence of the first argument is JustO x1, we know that type parameter a is O, and hence the entry sequence of the second argument must be JustO e2. Moreover, block x1 must be closed/open, and block e2 must be open/closed. We can therefore concatenate x1 and e2 with BCat to produce a closed/closed block b, which is added to the body of the result.

While the client provides this information about nodes, it is convenient for Hoopl to get the same information about blocks. Internally, Hoopl uses this instance declaration for the Block type: instance NonLocal n => NonLocal (Block n) where entryLabel (BFirst n) = entryLabel n entryLabel (BCat b _) = entryLabel b successors (BLast n) = successors n successors (BCat _ b) = successors b Because the functions entryLabel and successors are used to track control flow within a graph, Hoopl does not need to ask for the entry label or successors of a Graph itself. Indeed, Graph cannot be an instance of NonLocal, because even if a Graph is closed on entry, it need not have a unique entry label.

4.

Using Hoopl to analyze and transform graphs

Now that we have graphs, how do we optimize them? Hoopl makes it easy; a client must supply these pieces:

We have carefully crafted the types so that if BCat is considered as an associative operator, every graph has a unique representation. To guarantee uniqueness, GUnit is restricted to open/open blocks. If GUnit were more polymorphic, there would be more than one way to represent some graphs, and it wouldn’t be obvious to a client which representation to choose—or if the choice made a difference.

• A node type (Section 3.2). Hoopl supplies the Block and Graph

types that let the client build control-flow graphs out of nodes. • A data type of facts and some operations over those facts (Sec-

tion 4.1). Each analysis uses facts that are specific to that par-

124

data FwdPass m n f = FwdPass fp_lattice :: DataflowLattice f , fp_transfer :: FwdTransfer n f , fp_rewrite :: FwdRewrite m n f

ticular analysis, which Hoopl accommodates by being polymorphic in the fact type. • A transfer function that takes a node and returns a fact trans-

former, which takes a fact flowing into the node and returns the transformed fact that flows out of the node (Section 4.2).

------- Lattice ---------data DataflowLattice f = DataflowLattice fact_bot :: f , fact_join :: JoinFun f type JoinFun f = OldFact f -> NewFact f -> (ChangeFlag, f) newtype OldFact f = OldFact f newtype NewFact f = NewFact f data ChangeFlag = NoChange | SomeChange

• A rewrite function that takes a node and an input fact, performs

a monadic action, and returns either Nothing or Just g, where g is a graph that should replace the node (Sections 4.3 and 4.4). For many code-improving transformations, The ability to replace a node by a graph is crucial. These requirements are summarized in Table 3. Because facts, transfer functions, and rewrite functions work together, we combine them in a single record of type FwdPass (Figure 4).

------- Transfers ---------newtype FwdTransfer n f -- abstract type mkFTransfer :: (forall e x . n e x -> f -> Fact x f) -> FwdTransfer n f

Given a node type n and a FwdPass, a client can ask Hoopl to analyze and rewrite a graph. Hoopl provides a fully polymorphic interface, but for purposes of exposition, we present a function that is specialized to a closed/closed graph: analyzeAndRewriteFwdBody :: ( CkpointMonad m -- Roll back speculative actions , NonLocal n ) -- Extract non-local flow edges => FwdPass m n f -- Lattice, transfer, rewrite -> [Label] -- Entry point(s) -> Graph n C C -- Input graph -> FactBase f -- Input fact(s) -> m ( Graph n C C -- Result graph , FactBase f ) -- ... and its facts Given a FwdPass and a list of entry points, the analyze-and-rewrite function transforms a graph into an optimized graph. As its type shows, this function is polymorphic in the types of nodes n and facts f; these types are chosen by the client. The type of the monad m is also chosen by the client.

------- Rewrites ---------newtype FwdRewrite m n f -- abstract type mkFRewrite :: FuelMonad m => (forall e x . n e x -> f -> m (Maybe (Graph -> FwdRewrite m n f thenFwdRw :: FwdRewrite m n f -> FwdRewrite m n -> FwdRewrite m n f iterFwdRw :: FwdRewrite m n f -> FwdRewrite m n noFwdRw :: Monad m => FwdRewrite m n

n e x))) f f f

------- Fact-like things, aka "fact(s)" ----type family Fact x f :: * type instance Fact O f = f type instance Fact C f = FactBase f

As well as taking and returning a graph, the function also takes input facts (the FactBase) and produces output facts. A FactBase is a finite mapping from Label to facts (Figure 4); if a Label is not in the domain of the FactBase, its fact is the bottom element of the lattice. For example, in our constant-propagation example from Section 2, if the graph represents the body of a procedure with parameters x, y, z, we would map the entry Label to a fact x = ∧ y = ∧ z = , to specify that the procedure’s parameters are not known to be constants.

------- FactBase ------type FactBase f = LabelMap f -- A finite mapping from Labels to facts f mkFactBase :: DataflowLattice f -> [(Label, f)] -> FactBase f ------- Rolling back speculative rewrites ---class Monad m => CkpointMonad m where type Checkpoint m checkpoint :: m (Checkpoint m) restart :: Checkpoint m -> m ()

The client’s model of analyzeAndRewriteFwdBody is as follows: Hoopl walks forward over each block in the graph. At each node, Hoopl applies the rewrite function to the node and the incoming fact. If the rewrite function returns Nothing, the node is retained as part of the output graph, the transfer function is used to compute the outgoing fact, and Hoopl moves on to the next node. But if the rewrite function returns Just g, indicating that it wants to rewrite the node to the replacement graph g, Hoopl recursively analyzes and may further rewrite g before moving on to the next node. A node following a rewritten node sees up-to-date facts; that is, its input fact is computed by analyzing the replacement graph.

------- Optimization fuel ---type Fuel = Int class Monad m => FuelMonad m where getFuel :: m Fuel setFuel :: Fuel -> m () Figure 4. Hoopl API data types 4.1

A rewrite function may take any action that is justified by the incoming fact. If further analysis invalidates the fact, Hoopl rolls back the action. Because graphs cannot be mutated, rolling back to the original graph is easy. But rolling back a rewrite function’s monadic action requires cooperation from the client: the client must provide checkpoint and restart operations, which make m an instance of Hoopl’s CkpointMonad class (Section 4.7).

Dataflow lattices

For each analysis or transformation, the client must define a type of dataflow facts. A dataflow fact often represents an assertion about a program point, but in general, dataflow analysis establishes properties of paths: • An assertion about all paths to a program point is established

by a forward analysis. For example the assertion “x = 3” at point P claims that variable x holds value 3 at P, regardless of the path by which P is reached.

Below we flesh out the interface to analyzeAndRewriteFwdBody, leaving the implementation for Section 5.

125

• An assertion about all paths from a program point is established

block’s successors. For example, consider doing constant propagation (Section 2) on the following graph, whose entry point is L1:

by a backward analysis. For example, the assertion “x is dead” at point P claims that no path from P uses variable x.

L1: x=3; goto L2 L2: y=x+4; x=x-1; if x>0 then goto L2 else return

A set of dataflow facts must form a lattice, and Hoopl must know (a) the bottom element of the lattice and (b) how to take the least upper bound (join) of two elements. To ensure that analysis terminates, it is enough if every fact has a finite number of distinct facts above it, so that repeated joins eventually reach a fixed point.

Forward analysis starts with the bottom fact {} at every label except the entry L1. The initial fact at L1 is {x=,y=}. Analyzing L1 propagates this fact forward, applying the transfer function successively to the nodes of L1, and propagating the new fact {x=3,y=} to L2. This new fact is joined with the existing (bottom) fact at L2. Now the analysis propagates L2’s fact forward, again applying the transfer function, and propagating the new fact {x=2, y=7} to L2. Again the new fact is joined with the existing fact at L2, and the process repeats until the facts reach a fixed point.

In practice, joins are computed at labels. If fold is the fact currently associated with a label L, and if a transfer function propagates a new fact fnew into label L, Hoopl replaces fold with the join fold fnew . And Hoopl needs to know if fold fnew = fold , because if not, the analysis has not reached a fixed point. The bottom element and join operation of a lattice of facts of type f are stored in a value of type DataflowLattice f (Figure 4). As noted in the previous paragraph, Hoopl needs to know when the result of a join is equal to the old fact. It is often easiest to answer this question while the join itself is being computed. By contrast, a post facto equality test on facts might cost almost as much as a join. For these reasons, Hoopl does not require a separate equality test on facts. Instead, Hoopl requires that fact_join return a ChangeFlag as well as the join. If the join is the same as the old fact, the ChangeFlag should be NoChange; if not, the ChangeFlag should be SomeChange.

A transfer function has an unusual sort of type: not quite a dependent type, but not a bog-standard polymorphic type either. The result type of the transfer function is indexed by the shape (i.e., the type) of the node argument: If the node is open on exit, the transfer function produces a single fact. But if the node is closed on exit, the transfer function produces a collection of (Label,fact) pairs: one for each outgoing edge. The collection is represented by a FactBase; auxiliary function mkFactBase (Figure 4) joins facts on distinct outgoing edges that target the same label. The indexing is expressed by Haskell’s (recently added) indexed type families. A forward transfer function supplied by a client, which is passed to mkFTransfer, is polymorphic in e and x (Figure 4). It takes a node of type n e x, and it returns a fact transformer of type f -> Fact x f. Type constructor Fact is a species of type-level function: its signature is given in the type family declaration, and its definition is given by two type instance declarations. The first declaration says that a Fact O f, which comes out of a node open on exit, is just a fact f. The second declaration says that a Fact C f, which comes out of a node closed on exit, is a mapping from Label to facts.

To help clients create lattices and join functions, Hoopl includes functions and constructors that can extend a fact type f with top and bottom elements. In this paper, we use only type WithTop, which comes with value constructors that have these types: PElem :: f -> WithTop f Top :: WithTop f Hoopl provides combinators which make it easy to create join functions that use Top. The most useful is extendJoinDomain, which uses auxiliary types defined in Figure 4:

4.3

extendJoinDomain :: (OldFact f -> NewFact f -> (ChangeFlag, WithTop f)) -> JoinFun (WithTop f)

We compute dataflow facts in order to enable code-improving transformations. In our constant-propagation example, the dataflow facts may enable us to simplify an expression by performing constant folding, or to turn a conditional branch into an unconditional one. Similarly, facts about liveness may allow us to replace a dead assignment with a no-op.

A client supplies a join function that consumes only facts of type f, but may produce either Top or a fact of type f—as in the example of Figure 5 below. Calling extendJoinDomain extends the client’s function to a proper join function on the type WithTop a, guaranteeing that joins involving Top obey the appropriate algebraic laws.

A FwdPass therefore includes a rewrite function, whose type, FwdRewrite, is abstract (Figure 4). A programmer creating a rewrite function chooses the type of a node n and a dataflow fact f. A rewrite function might also want to consume fresh names (e.g., to label new blocks) or take other actions (e.g., logging rewrites). So that a rewrite function may take actions, Hoopl requires that a programmer creating a rewrite function also choose a monad m. So that Hoopl may roll back actions taken by speculative rewrites, the monad must satisfy the constraint CkpointMonad m, as explained in Section 4.7 below. The programmer may write code that works with any such monad, may create a monad just for the client, or may use a monad supplied by Hoopl.

Hoopl also provides a value constructor Bot and type constructors WithBot and WithTopAndBot, along with similar functions. Constructors Top and Bot are polymorphic, so for example, Top also has type WithTopAndBot a. It is also common to use a lattice that takes the form of a finite map. In such lattices it is typical to join maps pointwise, and Hoopl provides a function that makes it convenient to do so: joinMaps :: Ord k => JoinFun f -> JoinFun (Map.Map k f)

4.2

The rewrite function and the client’s monad

When these choices are made, the easy way to create a rewrite function is to call the function mkFRewrite in Figure 4. The client supplies a function r, which is specialized to a particular node, fact, and monad, but is polymorphic in the shape of the node to be rewritten. Function r takes a node and a fact and returns a monadic computation, but what result should that computation return? Returning a new node is not good enough: in general, it must be possible for rewriting to result in a graph. For example,

The transfer function

A forward transfer function is presented with the dataflow fact coming into a node, and it computes dataflow fact(s) on the node’s outgoing edge(s). In a forward analysis, Hoopl starts with the fact at the beginning of a block and applies the transfer function to successive nodes in that block, until eventually the transfer function for the last node computes the facts that are propagated to the

126

we might want to remove a node by returning the empty graph, or more ambitiously, we might want to replace a high-level operation with a tree of conditional branches or a loop, which would entail returning a graph containing new blocks with internal control flow.

Appealing to this model, we see that • A function mkFRewrite rw never rewrites a replacement graph;

this behavior is shallow rewriting. • When a function r1 ‘thenFwdRw‘ r2 is applied to a node,

It must also be possible for a rewrite function to decide to do nothing. The result of the monadic computation returned by r may therefore be Nothing, indicating that the node should not be rewritten, or Just g, indicating that the node should be replaced with g: the replacement graph.

if r1 replaces the node, then r2 is used to transform the replacement graph. And if r1 does not replace the node, Hoopl tries r2. • When a function iterFwdRw r rewrites a node, iterFwdRw r

is used to transform the replacement graph; this behavior is deep rewriting. If r does not rewrite a node, neither does iterFwdRw r.

The type of mkFRewrite in Figure 4 guarantees that the replacement graph g has the same shape as the node being rewritten. For example, a branch instruction can be replaced only by a graph closed on exit. 4.4

• Finally, noFwdRw never replaces a graph.

For convenience, we also provide the function deepFwdRw, which is the composition of iterFwdRw and mkFRewrite.

Shallow rewriting, deep rewriting, rewriting combinators, and the meaning of FwdRewrite

Our combinators satisfy the algebraic laws that you would expect; for example, noFwdRw is a left and right identity of thenFwdRw. A more interesting law is

When a node is rewritten, the replacement graph g must itself be analyzed, and its nodes may be further rewritten. Hoopl can make a recursive call to analyzeAndRewriteFwdBody—but how should it rewrite the replacement graph g? There are two common cases:

iterFwdRw r = r ‘thenFwdRw‘ iterFwdRw r Unfortunately, this law cannot be used to define iterFwdRw: if we used this law to define iterFwdRw, then when r returned Nothing, iterFwdRw r would diverge.

• Rewrite g using the same rewrite function that produced g.

This procedure is called deep rewriting. When deep rewriting is used, the client’s rewrite function must ensure that the graphs it produces are not rewritten indefinitely (Section 4.8).

4.5 When the type of nodes is not known

• Analyze g without rewriting it. This procedure is called shallow

We note above (Section 4.2) that the type of a transfer function’s result depends on the argument’s shape on exit. It is easy for a client to write a type-indexed transfer function, because the client defines the constructor and shape for each node. The client’s transfer functions discriminate on the constructor and so can return a result that is indexed by each node’s shape.

rewriting. Deep rewriting is essential to achieve the full benefits of interleaved analysis and transformation (Lerner, Grove, and Chambers 2002). But shallow rewriting can be vital as well; for example, a backward dataflow pass that inserts a spill before a call must not rewrite the call again, lest it attempt to insert infinitely many spills.

What if you want to write a transfer function that does not know the type of the node? For example, a dominator analysis need not scrutinize nodes; it needs to know only about labels and edges in the graph. Ideally, a dominator analysis would work with any type of node n, provided only that n is an instance of the NonLocal type class. But if we don’t know the type of n, we can’t write a function of type n e x -> f -> Fact x f, because the only way to get the result type right is to scrutinize the constructors of n.

An innovation of Hoopl is to build the choice of shallow or deep rewriting into each rewrite function, through the use of the four combinators mkFRewrite, thenFwdRw, iterFwdRw, and noFwdRw shown in Figure 4. Every rewrite function is made with these combinators, and its behavior is characterized by the answers to two questions: Does the function rewrite a node to a replacement graph? If so, what rewrite function should be used to analyze the replacement graph recursively? To answer these questions, we present an algebraic datatype that models FwdRewrite with one constructor for each combinator:

There is another way; in place of a single function that is polymorphic in shape, Hoopl also accepts a triple of functions, each of which is polymorphic in the node’s type but monomorphic in its shape:

data Rw r = Mk r | Then (Rw r) (Rw r) | Iter (Rw r) | No

mkFTransfer3 :: -> -> ->

Using this model, we specify how a rewrite function works by giving a reference implementation: the function rewrite, below, computes the replacement graph and rewrite function that result from applying a rewrite function r to a node and a fact f. The code is in continuation-passing style; when the node is rewritten, the first continuation j accepts a pair containing the replacement graph and the new rewrite function to be used to transform it. When the node is not rewritten, the second continuation n is the (lazily evaluated) result.

(n C O -> f (n O O -> f (n O C -> f FwdTransfer

-> Fact O f) -> Fact O f) -> Fact C f) n f

We have used this interface to write a number of functions that are polymorphic in the node type n: • A function that takes a FwdTransfer and wraps it in logging

code, so an analysis can be debugged by watching facts flow through nodes

rewrite :: Monad m => FwdRewrite m n f -> n e x -> f -> m (Maybe (Graph n e x, FwdRewrite m n f)) rewrite r node f = rew r (return . Just) (return Nothing) where rew (Mk rw) j n = do { mg <- rw node f ; case mg of Nothing -> n Just g -> j (g, No) } rew (r1 ‘Then‘ r2) j n = rew r1 (j . add r2) (rew r2 j n) rew (Iter r) j n = rew r (j . add (Iter r)) n rew No j n = n add nextrw (g, r) = (g, r ‘Then‘ nextrw)

127

• A pairing function that runs two passes interleaved, not sequen-

tially, potentially producing better results than any sequence: pairFwd :: Monad m => FwdPass m n f -> FwdPass m n f’ -> FwdPass m n (f, f’) • An efficient dominator analysis in the style of Cooper, Harvey,

and Kennedy (2001), whose transfer function is implemented using only the functions in the NonLocal type class

-- Type and definition of the lattice type ConstFact = Map.Map Var (WithTop Lit) constLattice :: DataflowLattice ConstFact constLattice = DataflowLattice { fact_bot = Map.empty , fact_join = joinMaps (extendJoinDomain constFactAdd) } where constFactAdd _ (OldFact old) (NewFact new) = if new == old then (NoChange, PElem new) else (SomeChange, Top)

variable holds is unknown. We represent these facts using a finite map from a variable to a fact of type WithTop Lit (Section 4.1). A variable with a constant value maps to Just (PElem k), where k is the constant value; a variable with a non-constant value maps to Just Top; and a variable with an unknown value maps to Nothing (it is not in the domain of the finite map). The definition of the lattice (constLattice) is straightforward. The bottom element is an empty map (nothing is known about what any variable holds). The join function is implemented with the help of combinators provided by Hoopl. The client writes a simple function, constFactAdd, which compares two values of type Lit and returns a result of type WithTop Lit. The client uses extendJoinDomain to lift constFactAdd into a join function on WithTop Lit, then uses joinMaps to lift that join function up to the map containing facts for all variables.

--------------------------------------------------- Analysis: variable equals a literal constant varHasLit :: FwdTransfer Node ConstFact varHasLit = mkFTransfer ft where ft :: Node e x -> ConstFact -> Fact x ConstFact ft (Label _) f = f ft (Assign x (Lit k)) f = Map.insert x (PElem k) f ft (Assign x _) f = Map.insert x Top f ft (Store _ _) f = f ft (Branch l) f = mapSingleton l f ft (Cond (Var x) tl fl) f = mkFactBase constLattice [(tl, Map.insert x (PElem (Bool True)) f), (fl, Map.insert x (PElem (Bool False)) f)] ft (Cond _ tl fl) f = mkFactBase constLattice [(tl, f), (fl, f)]

The forward transfer function varHasLit is defined using the shape-polymorphic auxiliary function ft. For most nodes n, ft n simply propagates the input fact forward. But for an assignment node, if a variable x gets a constant value k, ft extends the input fact by mapping x to PElem k. And if a variable x is assigned a non-constant value, ft extends the input fact by mapping x to Top. There is one other interesting case: a conditional branch where the condition is a variable. If the conditional branch flows to the true successor, the variable holds True, and similarly for the false successor, mutatis mutandis. Function ft updates the fact flowing to each successor accordingly. Because ft scrutinizes a GADT, it cannot use a wildcard to default the uninteresting cases.

--------------------------------------------------- Rewriting: replace constant variables constProp :: FuelMonad m => FwdRewrite m Node ConstFact constProp = mkFRewrite cp where cp node f = return $ liftM nodeToG $ mapVN (lookup f) node mapVN = mapEN . mapEE . mapVE lookup f x = case Map.lookup x f of Just (PElem v) -> Just $ Lit v _ -> Nothing

The transfer function need not consider complicated cases such as an assignment x:=y where y holds a constant value k. Instead, we rely on the interleaving of transformation and analysis to first transform the assignment to x:=k, which is exactly what our simple transfer function expects. As we mention in Section 2, interleaving makes it possible to write very simple transfer functions without missing opportunities to improve the code. Figure 5’s rewrite function for constant propagation, constProp, rewrites each use of a variable to its constant value. The client has defined auxiliary functions that may change expressions or nodes:

--------------------------------------------------- Simplification ("constant folding") simplify :: FuelMonad m => FwdRewrite m Node f simplify = deepFwdRw simp where simp node _ = return $ liftM nodeToG $ s_node node s_node :: Node e x -> Maybe (Node e x) s_node (Cond (Lit (Bool b)) t f) = Just $ Branch (if b then t else f) s_node n = (mapEN . mapEE) s_exp n s_exp (Binop Add (Lit (Int n1)) (Lit (Int n2))) = Just $ Lit $ Int $ n1 + n2 -- ... more cases for constant folding

type MaybeChange a = a -> Maybe a mapVE :: (Var -> Maybe Expr) -> MaybeChange mapEE :: MaybeChange Expr -> MaybeChange mapEN :: MaybeChange Expr -> MaybeChange mapVN :: (Var -> Maybe Expr) -> MaybeChange nodeToG :: Node e x -> Graph Node e x

The client composes mapXX functions to apply lookup to each use of a variable in each kind of node; lookup substitutes for each variable that has a constant value. Applying liftM nodeToG lifts the final node, if present, into a Graph.

----------------------------------------- Defining the forward dataflow pass constPropPass = FwdPass { fp_lattice = constLattice , fp_transfer = varHasLit , fp_rewrite = constProp ‘thenFwdRw‘ simplify }

Figure 5 also gives another, distinct function for constant folding: simplify. This function rewrites constant expressions to their values, and it rewrites a conditional branch on a boolean constant to an unconditional branch. To rewrite constant expressions, it runs s_exp on every subexpression. Function simplify does not check whether a variable holds a constant value; it relies on constProp to have replaced the variable by the constant. Indeed, simplify does not consult the incoming fact, so it is polymorphic in f.

Figure 5. The client for constant propagation and constant folding (extracted automatically from code distributed with Hoopl)

4.6

Expr Expr (Node e x) (Node e x)

The FwdRewrite functions constProp and simplify are useful independently. In this case, however, we want both of them, so we compose them with thenFwdRw. The composition, along with the lattice and the transfer function, goes into constPropPass (bottom of Figure 5). Given constPropPass, we can improve a graph g by passing constPropPass and g to analyzeAndRewriteFwdBody.

Example: Constant propagation and constant folding

Figure 5 shows client code for constant propagation and constant folding. For each variable, at each program point, the analysis concludes one of three facts: the variable holds a constant value of type Lit, the variable might hold a non-constant value, or what the

128

4.7

• A transformation that uses deep rewriting must not return a re-

Checkpointing the client’s monad

placement graph which contains a node that could be rewritten indefinitely.

When analyzing a program with loops, a rewrite function could make a change that later has to be rolled back. For example, consider constant propagation in this loop, which computes factorial:

Under these conditions, the algorithm terminates and is sound.

i = 1; prod = 1; L1: if (i >= n) goto L3 else goto L2; L2: i = i + 1; prod = prod * i; goto L1; L3: ...

5. Hoopl’s implementation Section 4 gives a client’s-eye view of Hoopl, showing how to create analyses and transformations. Hoopl’s interface is simple, but the implementation of interleaved analysis and rewriting is not. Lerner, Grove, and Chambers (2002) do not describe their implementation. We have written at least three previous implementations, all of which were long and hard to understand, and only one of which provided compile-time guarantees about open and closed shapes. We are not confident that any of these implementations are correct.

Function analyzeAndRewriteFwdBody iterates through this graph until the dataflow facts stop changing. On the first iteration, the assignment i = i + 1 is analyzed with an incoming fact i=1, and the assignment is rewritten to the graph i = 2. But on a later iteration, the incoming fact increases to i=, and the rewrite is no longer justified. After each iteration, Hoopl starts the next iteration with new facts but with the original graph—by virtue of using purely functional data structures, rewrites from previous iterations are automatically rolled back.

In this paper we describe a new implementation. It is elegant and short (about a third of the size of our last attempt), and it offers strong compile-time guarantees about shapes. We describe only the implementation of forward analysis and transformation. The implementations of backward analysis and transformation are exactly analogous and are included in Hoopl.

But a rewrite function doesn’t only produce new graphs; it can also take monadic actions, such as acquiring a fresh name. These actions must also be rolled back, and because the client chooses the monad in which the actions take place, the client must provide the means to roll back the actions. Hoopl therefore defines a rollback interface, which each client must implement; it is the type class CkpointMonad from Figure 4:

We also explain, in Section 5.5, how we isolate errors in faulty optimizers, and how the fault-isolation machinery is integrated with the rest of the implementation. 5.1 Overview Instead of the interface function analyzeAndRewriteFwdBody, we present the more polymorphic, private function arfGraph, which is short for “analyze and rewrite forward graph:”

class Monad m => CkpointMonad m where type Checkpoint m checkpoint :: m (Checkpoint m) restart :: Checkpoint m -> m ()

arfGraph :: forall m n f e x. (CkpointMonad m, NonLocal n) => FwdPass m n f -- lattice, transfers, rewrites -> MaybeC e [Label] -- entry points for a closed graph -> Graph n e x -- the original graph -> Fact e f -- fact(s) flowing into entry/entries -> m (DG f n e x, Fact x f)

Hoopl calls the checkpoint method at the beginning of an iteration, then calls the restart method if another iteration is necessary. These operations must obey the following algebraic law: do { s <- checkpoint; m; restart s } == return () where m represents any combination of monadic actions that might be taken by rewrite functions. (The safest course is to make sure the law holds for any action in the monad.) The type of the saved checkpoint s is up to the client; it is specified as an associated type of the CkpointMonad class. 4.8

Function arfGraph has a more general type than the function analyzeAndRewriteFwdBody because arfGraph is used recursively to analyze graphs of all shapes. If a graph is closed on entry, a list of entry points must be provided; if the graph is open on entry, the graph’s entry sequence must be the only entry point. The graph’s shape on entry also determines the type of fact or facts flowing in. Finally, the result is a “decorated graph” DG f n e x, and if the graph is open on exit, an “exit fact” flowing out.

Correctness

Facts computed by the transfer function depend on graphs produced by the rewrite function, which in turn depend on facts computed by the transfer function. How do we know this algorithm is sound, or if it terminates? A proof requires a POPL paper (Lerner, Grove, and Chambers 2002); here we merely state the conditions for correctness as applied to Hoopl:

A “decorated graph” is one in which each block is decorated with the fact that holds at the start of the block. DG actually shares a representation with Graph, which is possible because the definition of Graph in Figure 2 contains a white lie: Graph is a type synonym for an underlying type Graph’, which takes the type of block as an additional parameter. (Similarly, function gSplice in Section 3.4 is actually a higher-order function that takes a block-concatenation function as a parameter.) The truth about Graph and DG is as follows:

• The lattice must have no infinite ascending chains; that is,

every sequence of calls to fact_join must eventually return NoChange. • The transfer function must be monotonic: given a more infor-

mative fact in, it must produce a more informative fact out.

type Graph = Graph’ Block type DG f = Graph’ (DBlock f) data DBlock f n e x = DBlock f (Block n e x)

• The rewrite function must be sound: if it replaces a node n by a

replacement graph g, then g must be observationally equivalent to n under the assumptions expressed by the incoming dataflow fact f. Moreover, analysis of g must produce output fact(s) that are at least as informative as the fact(s) produced by applying the transfer function to n. For example, if the transfer function says that x=7 after the node n, then after analysis of g, x had better still be 7.

Type DG is internal to Hoopl; it is not seen by any client. To convert a DG to the Graph and FactBase that are returned by the API function analyzeAndRewriteFwdBody, we use a 12-line function: normalizeGraph :: NonLocal n => DG f n e x -> (Graph n e x, FactBase f)

129

Function arfGraph is implemented as follows: arfGraph where node :: => block::

pass entries = graph forall e x . (ShapeLifter e x) n e x -> f -> m (DG f n e x, Fact x f) forall e x . Block n e x -> f -> m (DG f n e x, Fact x f)

body :: [Label] -> LabelMap (Block n C C) -> Fact C f -> m (DG f n C C, Fact C f) graph:: Graph n e x -> Fact e f -> m (DG f n e x, Fact x f) ... definitions of ’node’, ’block’, ’body’, and ’graph’ ...

The four auxiliary functions help us separate concerns: for example, only node knows about rewrite functions, and only body knows about fixed points. Each auxiliary function works the same way: it takes a “thing” and returns an extended fact transformer. An extended fact transformer takes dataflow fact(s) coming into the “thing,” and it returns an output fact. It also returns a decorated graph representing the (possibly rewritten) “thing”—that’s the extended part. Finally, because rewrites are monadic, every extended fact transformer is monadic. • Extended fact transformers for nodes and blocks have the same

type; like forward transfer functions, they expect a fact f rather than the more general Fact e f required for a graph. Because a node or a block has exactly one fact flowing into the entry, it is easiest simply to pass that fact. type, as expressed using Fact: if the graph is open on entry, its fact transformer expects a single fact; if the graph is closed on entry, its fact transformer expects a FactBase.

The node function is where we interleave analysis with rewriting: node :: forall e x . (ShapeLifter e x) => n e x -> f -> m (DG f n e x, Fact x f) node n f = do { grw <- frewrite pass n f ; case grw of Nothing -> return ( singletonDG f n , ftransfer pass n f ) Just (g, rw) -> let pass’ = pass { fp_rewrite = rw } f’ = fwdEntryFact n f in arfGraph pass’ (fwdEntryLabel n) g f’ }

In the Nothing case, no rewrite takes place. We return node n and its incoming fact f as the decorated graph singletonDG f n. To produce the outgoing fact, we apply the transfer function ftransfer pass to n and f.

• Extended fact transformers for bodies have the same type as

extended fact transformers for closed/closed graphs.

In the Just case, we receive a replacement graph g and a new rewrite function rw, as specified by the model in Section 4.4. We use rw to analyze and rewrite g recursively with arfGraph. The recursive analysis uses a new pass pass’, which contains the original lattice and transfer function from pass, together with rw. Function fwdEntryFact converts fact f from the type f, which node has, to the type Fact e f, which arfGraph expects.

Function arfGraph and its four auxiliary functions comprise a cycle of mutual recursion: arfGraph calls graph; graph calls body and block; body calls block; block calls node; and node calls arfGraph. These five functions do three different kinds of work: compose extended fact transformers, analyze and rewrite nodes, and compute fixed points.

As shown above, several functions called in node are overloaded over a (private) class ShapeLifter. Their implementations depend on the open/closed shape of the node. By design, the shape of a node is known statically everywhere node is called, so this use of ShapeLifter is specialized away by the compiler.

Analyzing blocks and graphs by composing extended fact transformers

Extended fact transformers compose nicely. For example, block is implemented thus: block :: forall e x . Block n e x -> f -> m (DG f n e x, Fact x f) block (BFirst n) = node n block (BMiddle n) = node n block (BLast n) = node n block (BCat b1 b2) = block b1 ‘cat‘ block b2

5.4

Fixed points

The fourth and final auxiliary function of arfGraph is body, which iterates to a fixed point. This part of the implementation is the only really tricky part, and it is cleanly separated from everything else: body

:: [Label] -> LabelMap (Block n C C) -> Fact C f -> m (DG f n C C, Fact C f) body entries blockmap init_fbase = fixpoint Fwd lattice do_block blocks init_fbase where blocks = forwardBlockList entries blockmap lattice = fp_lattice pass do_block b fb = block b entryFact where entryFact = getFact lattice (entryLabel b) fb

The composition function cat feeds facts from one extended fact transformer to another, and it splices decorated graphs. e m m m =

5.3 Analyzing and rewriting nodes

Function node uses frewrite to extract the rewrite function from pass, and it applies that rewrite function to node n and incoming fact f. The result, grw, is scrutinized by the case expression.

• Extended fact transformers for graphs have the most general

cat :: forall (f1 -> -> (f2 -> -> (f1 -> cat ft1 ft2 f

Function graph is much like block, but it has more cases.

class ShapeLifter e x where singletonDG :: f -> n e x -> DG f n e x fwdEntryFact :: NonLocal n => n e x -> f -> Fact e f fwdEntryLabel :: NonLocal n => n e x -> MaybeC e [Label] ftransfer :: FwdPass m n f -> n e x -> f -> Fact x f frewrite :: FwdPass m n f -> n e x -> f -> m (Maybe (Graph n e x, FwdRewrite m n f))

The types of the extended fact transformers are not quite identical:

5.2

suitable for DBlocks.) The name cat comes from the concatenation of the decorated graphs, but it is also appropriate because the style in which it is used is reminiscent of concatMap, with the node and block functions playing the role of map.

a x f1 f2 f3. (DG f n e a, f2)) (DG f n a x, f3)) (DG f n e x, f3)) do { (g1,f1) <- ft1 f ; (g2,f2) <- ft2 f1 ; return (g1 ‘dgSplice‘ g2, f2) }

Function getFact looks up a fact by its label. If the label is not found, getFact returns the bottom element of the lattice:

(Function dgSplice is the same splicing function used for an ordinary Graph, but it uses a one-line block-concatenation function

getFact :: DataflowLattice f -> Label -> FactBase f -> f

130

rewrites finds an n such that the program works after n − 1 rewrites but fails after n rewrites. The nth rewrite is faulty. As alluded to at the end of Section 2, this technique enables us to debug complex optimizations by identifying one single rewrite that is faulty.

Function forwardBlockList takes a list of possible entry points and a finite map from labels to blocks. It returns a list of blocks, sorted into an order that makes forward dataflow efficient.2 forwardBlockList :: NonLocal n => [Label] -> LabelMap (Block n C C) -> [Block n C C]

To use this debugging technique, we must be able to control the number of rewrites. We limit rewrites using optimization fuel. Each rewrite consumes one unit of fuel, and when fuel is exhausted, all rewrite functions return Nothing. To debug, we do binary search on the amount of fuel.

For example, if the entry point is at L2, and the block at L2 branches to L1, but not vice versa, then Hoopl will reach a fixed point more quickly if we process L2 before L1. To find an efficient order, forwardBlockList uses the methods of the NonLocal class— entryLabel and successors—to perform a reverse postorder depth-first traversal of the control-flow graph.

The supply of fuel is encapsulated in the FuelMonad type class (Figure 4), which must be implemented by the client’s monad m. To ensure that each rewrite consumes one unit of fuel, mkFRewrite wraps the client’s rewrite function, which must be oblivious to fuel, in another function that satisfies the following contract:

The rest of the work is done by fixpoint, which is shared by both forward and backward analyses:

• If the fuel supply is empty, the wrapped function always returns

data Direction = Fwd | Bwd fixpoint :: forall m n f. (CkpointMonad m, NonLocal n) => Direction -> DataflowLattice f -> (Block n C C -> Fact C f -> m (DG f n C C, Fact C f)) -> [Block n C C] -> (Fact C f -> m (DG f n C C, Fact C f))

Nothing. • If the wrapped function returns Just g, it has the monadic

effect of reducing the fuel supply by one unit.

6.

While there is a vast body of literature on dataflow analysis and optimization, relatively little can be found on the design of optimizers, which is the topic of this paper. We therefore focus on the foundations of dataflow analysis and on the implementations of some comparable dataflow frameworks.

Except for the Direction passed as the first argument, the type signature tells the story. The third argument can produce an extended fact transformer for any single block; fixpoint applies it successively to each block in the list passed as the fourth argument. Function fixpoint returns an extended fact transformer for the list.

Foundations. When transfer functions are monotone and lattices are finite in height, iterative dataflow analysis converges to a fixed point (Kam and Ullman 1976). If the lattice’s join operation distributes over transfer functions, this fixed point is equivalent to a join-over-all-paths solution to the recursive dataflow equations (Kildall 1973).3 Kam and Ullman (1977) generalize to some monotone functions. Each client of Hoopl must guarantee monotonicity.

The extended fact transformer returned by fixpoint maintains a “current FactBase” which grows monotonically: as each block is analyzed, the block’s input fact is taken from the current FactBase, and the current FactBase is augmented with the facts that flow out of the block. The initial value of the current FactBase is the input FactBase, and the extended fact transformer iterates over the blocks until the current FactBase stops changing.

Cousot and Cousot (1977, 1979) introduce abstract interpretation as a technique for developing lattices for program analysis. Steffen (1991) shows that a dataflow analysis can be implemented using model checking; Schmidt (1998) expands on this result by showing that an all-paths dataflow problem can be viewed as model checking an abstract interpretation.

Implementing fixpoint requires about 90 lines, formatted for narrow display. The code, which is appended to the Web version of this paper (http://bit.ly/cZ7ts1), is mostly straightforward— although we try to be clever about deciding when a new fact means that another iteration is required. There is one more subtle point worth mentioning, which we highlight by considering a forward analysis of this graph, where execution starts at L1:

Marlowe and Ryder (1990) present a survey of different methods for performing dataflow analyses, with emphasis on theoretical results. Muchnick (1997) presents many examples of both particular analyses and related algorithms.

L1: x:=3; goto L4 L2: x:=4; goto L4 L4: if x>3 goto L2 else goto L5

Lerner, Grove, and Chambers (2002) show that interleaving analysis and transformation is sound, even when not all speculative transformations are performed on later iterations.

Block L2 is unreachable. But if we na¨ıvely process all the blocks (say in order L1, L4, L2), then we will start with the bottom fact for L2, propagate {x=4} to L4, where it will join with {x=3} to yield {x=}. Given x=, the conditional in L4 cannot be rewritten, and L2 seems reachable. We have lost a good optimization.

Frameworks. Most dataflow frameworks support only analysis, not transformation. The framework computes a fixed point of transfer functions, and it is up to the client of the framework to use that fixed point for transformation. Omitting transformation makes it much easier to build frameworks, and one can find a spectrum of designs. We describe two representative designs, then move on to frameworks that do interleave analysis and transformation.

Function fixpoint solves this problem by analyzing a block only if the block is reachable from an entry point. This trick is safe only for a forward analysis, which is why fixpoint takes a Direction as its first argument. 5.5

Related work

Throttling rewriting using “optimization fuel”

The Soot framework is designed for analysis of Java programs (Vall´ee-Rai et al. 2000). While Soot’s dataflow library supports only analysis, not transformation, we found much to admire in its

When optimization produces a faulty program, we use Whalley’s (1994) technique to find the fault: given a program that fails when compiled with optimization, a binary search on the number of

3 Kildall

uses meets, not joins. Lattice orientation is a matter of convention, and conventions have changed. We use Dana Scott’s orientation, in which higher elements carry more information.

2 The

order of the blocks does not affect the fixed point or any other result; it affects only the number of iterations needed to reach the fixed point.

131

design. Soot’s library is abstracted over the representation of the control-flow graph and the representation of instructions. Soot’s interface for defining lattice and analysis functions is like our own, although because Soot is implemented in an imperative style, additional functions are needed to copy lattice elements.

they correspond to applications of singletonDG in node and of dgSplice in cat. In an earlier version of Hoopl, this overhead was eliminated by splitting arfGraph into two phases, as in Whirlwind. The single arfGraph is simpler and easier to maintain; we don’t know if the extra thunks matter.

The CIL toolkit (Necula et al. 2002) supports both analysis and rewriting of C programs, but rewriting is clearly distinct from analysis: one runs an analysis to completion and then rewrites based on the results. The framework is limited to one representation of control-flow graphs and one representation of instructions, both of which are mandated by the framework. The API is complicated; much of the complexity is needed to enable the client to affect which instructions the analysis iterates over.

• The representation of a forward-transfer function is private to

Hoopl. Two representations are possible: we may store a triple of functions, one for each shape a node may have; or we may store a single, polymorphic function. Hoopl uses triples, because although working with triples makes some code slightly more complex, the costs are straightforward. If we used a single, polymorphic function, we would have to use a shape classifier (supplied by the client) when composing transfer functions. Using a shape classifier would introduce extra case discriminations every time we applied a transfer function or rewrite function to a node. We don’t know how these extra discriminations might affect performance.

The Whirlwind compiler contains the dataflow framework implemented by Lerner, Grove, and Chambers (2002), who were the first to interleave analysis and transformation. Their implementation is much like our early efforts: it is a complicated mix of code that simultaneously manages interleaving, deep rewriting, and fixed-point computation. By separating these tasks, our implementation simplifies the problem dramatically. Whirlwind’s implementation also suffers from the difficulty of maintaining pointer invariants in a mutable representation of control-flow graphs, a problem we have discussed elsewhere (Ramsey and Dias 2005).

In summary, Hoopl performs well enough for use in GHC, but there is much we don’t know. We have no evidence that any of the decisions above measurably affects performance—systematic investigation is indicated.

8.

Because speculative transformation is difficult in an imperative setting, Whirlwind’s implementation is split into two phases. The first phase runs the interleaved analyses and transformations to compute the final dataflow facts and a representation of the transformations that should be applied to the input graph. The second phase executes the transformations. In Hoopl, because control-flow graphs are immutable, speculative transformations can be applied immediately, and there is no need for a phase distinction.

7.

Discussion

We built Hoopl in order to combine three good ideas (interleaved analysis and transformation, an applicative control-flow graph, and optimization fuel) in a way that could easily be reused by many compiler writers. To evaluate how well we succeeded, we examine how Hoopl has been used, we examine the API, and we examine the implementation. We also sketch one of the many alternatives we have implemented. Using Hoopl. As suggested by the constant-propagation example in Figure 5, Hoopl makes it easy to implement many standard dataflow analyses. Students using Hoopl in a class at Tufts were able to implement such optimizations as lazy code motion (Knoop, Ruething, and Steffen 1992) and induction-variable elimination (Cocke and Kennedy 1977) in just a few weeks. Graduate students at Yale and at Portland State have also implemented a variety of optimizations.

Performance considerations

Our work on Hoopl is too new for us to be able to say much about performance. It is important to know how well Hoopl performs, but the question is comparative, and there isn’t another library we can compare Hoopl with. For example, Hoopl is not a dropin replacement for an existing component of GHC; we introduced Hoopl to GHC as part of a major refactoring of GHC’s back end. With Hoopl, GHC seems about 15% slower than the previous GHC, but we don’t know what part of the slowdown, if any, should be attributed to the optimizer. We can say that the costs of using Hoopl seem reasonable; there is no “big performance hit.” And a somewhat similar library, written in an impure functional language, actually improved performance in an apples-to-apples comparison with a library using a mutable control-flow graph (Ramsey and Dias 2005).

Hoopl’s graphs can support optimizations beyond classic dataflow. For example, in GHC, Hoopl’s graphs are used to implement optimizations based on control flow, such as eliminating branch chains. Hoopl is SSA-neutral: although we know of no attempt to use Hoopl to establish or enforce SSA invariants, Hoopl makes it easy to include φ-functions in the representation of first nodes, and if a transformation preserves SSA invariants, it will continue to do so when implemented in Hoopl. Examining the API. We hope that our presentation of the API in Section 4 speaks for itself, but there are a couple of properties worth highlighting. First, it’s a good sign that the API provides many higher-order combinators that make it easier to write client code. We have had space to mention only a few: extendJoinDomain, joinMaps, thenFwdRw, iterFwdRw, deepFwdRw, and pairFwd.

Although thorough evaluation of Hoopl’s performance must await future work, we can identify some design decisions that might affect performance. • In Figure 2, we show a single concatenation operator for blocks.

Using this representation, a block of N nodes is represented using 2N − 1 heap objects. We have also implemented a representation of blocks that include “cons-like” and “snoc-like” constructors; this representation requires only N + 1 heap objects. We don’t know how this choice affects performance.

Second, the static encoding of open and closed shapes at compile time worked out well. Shapes may seem like a small refinement, but they helped eliminate a number of bugs from GHC, and we expect them to help other clients too. GADTs are a convenient way to express shapes, and for clients written in Haskell, they are clearly appropriate. If one wished to port Hoopl to a language without GADTs, many of the benefits could be realized by making the shapes phantom types, but without GADTs, pattern matching would be significantly more tedious and error-prone.

• In Section 5, the body function analyzes and (speculatively)

rewrites the body of a control-flow graph, and fixpoint iterates this analysis until it reaches a fixed point. Decorated graphs computed on earlier iterations are thrown away. For each decorated graph of N nodes, at least 2N − 1 thunks are allocated;

132

References

Examining the implementation. If you are thinking of adopting Hoopl, you should consider not only whether you like the API, but whether if you had to, you could maintain the implementation. We believe that Section 5 sketches enough to show that Hoopl’s implementation is a clear improvement over previous implementations of similar ideas. By decomposing our implementation into node, block, body, graph, cat, fixpoint, and mkFRewrite, we have cleanly separated multiple concerns: interleaving analysis with rewriting, throttling rewriting using optimization fuel, and computing a fixed point using speculative rewriting. Because of this separation of concerns, we believe our implementation will be easier to maintain than anything that preceded it.

Andrew W. Appel. 1998. Modern Compiler Implementation. Cambridge University Press, Cambridge, UK. Available in three editions: C, Java, and ML. John Cocke and Ken Kennedy. 1977. An algorithm for reduction of operator strength. Communications of the ACM, 20(11):850–856. Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. 2001. A simple, fast dominance algorithm. Technical report, Rice University. Unpublished report available from http://www.hipersoft.rice.edu/ grads/publications/dom14.pdf. Patrick Cousot and Radhia Cousot. 1977 (January). Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Conference Record of the 4th ACM Symposium on Principles of Programming Languages, pages 238–252. Patrick Cousot and Radhia Cousot. 1979 (January). Systematic design of program analysis frameworks. In Conference Record of the 6th Annual ACM Symposium on Principles of Programming Languages, pages 269– 282. John B. Kam and Jeffrey D. Ullman. 1976. Global data flow analysis and iterative algorithms. Journal of the ACM, 23(1):158–171. John B. Kam and Jeffrey D. Ullman. 1977. Monotone data flow analysis frameworks. Acta Informatica, 7:305–317. Gary A. Kildall. 1973 (October). A unified approach to global program optimization. In Conference Record of the ACM Symposium on Principles of Programming Languages, pages 194–206. Jens Knoop, Oliver Ruething, and Bernhard Steffen. 1992. Lazy code motion. Proceedings of the ACM SIGPLAN ’92 Conference on Programming Language Design and Implementation, in SIGPLAN Notices, 27 (7):224–234. Sorin Lerner, David Grove, and Craig Chambers. 2002 (January). Composing dataflow analyses and transformations. Conference Record of the 29th Annual ACM Symposium on Principles of Programming Languages, in SIGPLAN Notices, 31(1):270–282.

Design alternatives. We have explored many alternatives to the API presented above. While these alternatives are interesting, describing and discussing an interesting alternative seems to take us a half-column or a column of text. Accordingly, we discuss only the single most interesting alternative: keeping the rewrite monad m private instead of allowing the client to define it. We have implemented an alternative API in which every rewrite function must use a monad mandated by Hoopl. This alternative has advantages: Hoopl implements checkpoint, restart, setFuel, and getFuel, so we can ensure that they are right and that the client cannot misuse them. The downside is that the only actions a rewrite function can take are the actions in the monad(s) mandated by Hoopl. These monads must therefore provide extra actions that a client might need, such as supplying fresh labels for new blocks. Worse, Hoopl can’t possibly anticipate every action a client might want to take. What if a client wanted one set of unique names for labels and a different set for registers? What if, in order to judge the effectiveness of an optimization, a client wanted to log how many rewrites take place, or in what functions they take place? Or what if a client wanted to implement Primitive Redex Speculation (Runciman 2010), a code-improving transformation that can create new function definitions? Hoopl’s predefined monads don’t accommodate any of these actions. By permitting the client to define the monad m, we risk the possibility that the client may implement key operations incorrectly, but we also ensure that Hoopl can support these examples, as well as other examples not yet thought of.

Thomas J. Marlowe and Barbara G. Ryder. 1990. Properties of data flow frameworks: a unified model. Acta Informatica, 28(2):121–163. Steven S. Muchnick. 1997. Advanced compiler design and implementation. Morgan Kaufmann, San Mateo, CA. George C. Necula, Scott McPeak, Shree Prakash Rahul, and Westley Weimer. 2002. CIL: Intermediate language and tools for analysis and transformation of C programs. In CC ’02: Proceedings of the 11th International Conference on Compiler Construction, pages 213–228. Norman Ramsey and Jo˜ao Dias. 2005 (September). An applicative controlflow graph based on Huet’s zipper. In ACM SIGPLAN Workshop on ML, pages 101–122. Colin Runciman. 2010 (June). Finding and increasing PRS candidates. Reduceron Memo 50, www.cs.york.ac.uk/fp/reduceron.

Final remarks. Dataflow optimization is usually described as a way to improve imperative programs by mutating control-flow graphs. Such transformations appear very different from the tree rewriting that functional languages are so well known for and which makes Haskell so attractive for writing other parts of compilers. But even though dataflow optimization looks very different from what we are used to, writing a dataflow optimizer in Haskell was a win: we had to make every input and output explicit, and we had a strong incentive to implement things compositionally. Using Haskell helped us make real improvements in the implementation of some very sophisticated ideas.

David A. Schmidt. 1998. Data flow analysis is model checking of abstract interpretations. In ACM, editor, Conference Record of the 25th Annual ACM Symposium on Principles of Programming Languages, pages 38–48. Bernhard Steffen. 1991. Data flow analysis as model checking. In TACS ’91: Proceedings of the International Conference on Theoretical Aspects of Computer Software, pages 346–365.

Acknowledgments Brian Huffman and Graham Hutton helped with algebraic laws. Sukyoung Ryu told us about Primitive Redex Speculation. Several anonymous reviewers helped improve the presentation.

Raja Vall´ee-Rai, Etienne Gagnon, Laurie J. Hendren, Patrick Lam, Patrice Pominville, and Vijay Sundaresan. 2000. Optimizing Java bytecode using the Soot framework: Is it feasible? In CC ’00: Proceedings of the 9th International Conference on Compiler Construction, pages 18–34.

The first and second authors were funded by a grant from Intel Corporation and by NSF awards CCF-0838899 and CCF-0311482. These authors also thank Microsoft Research Ltd, UK, for funding extended visits to the third author.

David B. Whalley. 1994 (September). Automatic isolation of compiler errors. ACM Transactions on Programming Languages and Systems, 16 (5):1648–1659.

133

Supercompilation by Evaluation Maximilian Bolingbroke

Simon Peyton Jones

University of Cambridge [email protected]

Microsoft Research [email protected]

Abstract

Our supercompiler can deforest the following term:

This paper shows how call-by-need supercompilation can be recast to be based explicitly on an evaluator, contrasting with standard presentations which are specified as algorithms that mix evaluation rules with reductions that are unique to supercompilation. Building on standard operational-semantics technology for call-by-need languages, we show how to extend the supercompilation algorithm to deal with recursive let expressions.

let ones = 1 : ones; map = . . . in map (λx . x + 1) ones into the direct-style definition: let xs = 2 : xs in xs With the exception of Klyuchnikov’s HOSC [8], previous supercompilers for lazy languages have dealt only with nonrecursive let bindings. HOSC is also able to deforest this example, but at the cost of sometimes duplicating work – something that we are careful to avoid.

Categories and Subject Descriptors D.3.1 [Programming Languages]: Formal Definitions and Theory – Semantics; D.3.2 [Programming Languages]: Language Classifications – Applicative (functional) languages; D.3.4 [Programming Languages]: Processors – Optimization General Terms

1.

Because recursion is not special, we do not need to give the program top-level special status, or λ-lift the input program.

Algorithms, Performance

• We perform an empirical evaluation of our supercompiler (Sec-

tion 5), in particular comparing it to Mitchell’s supercompiler [7]. The source code for the implementation, the Cambridge Haskell Supercompiler (CHSC), is available online1 . Our supercompiler reduces benchmark runtime by up to 95%, with a mean reduction of 26%.

Overview

Supercompilation is a powerful program transformation technique due to Turchin [1] which can be used to both automatically prove theorems about programs [2] and greatly improve the efficiency with which they execute [3]. Supercompilation is capable of achieving transformations such as deforestation [4], function specialisation and constructor specialisation [5]. Despite its remarkable power, the transformation is simple, principled and fully automatic. Supercompilation is closely related to partial evaluation, but can achieve strictly more optimising transformations [6]. The key contributions of this paper are as follows:

2.

Supercompilation by example

The best way to understand how supercompilation works is by example. Let’s begin with a simple example of how standard supercompilation can specialise functions to their higher-order arguments: let inc = λx . x + 1 map = λf xs. case xs of [ ] → [ ] (y : ys) → f y : map f ys in map inc zs

• Inspired by Mitchell’s promising results [7], we cast supercom-

pilation in a new light, showing how to design a modular supercompiler that is based directly on the operational semantics of the language (Section 3). Viewing supercompilation in this way is valuable, because it makes it easier to derive a supercompiler in a systematic way from the language, and to adapt it to new language features. Previous work intermingles evaluation and specialisation in a much more complex and ad-hoc way.

A supercompiler evaluates open terms, so that reductions that would otherwise be done at runtime are performed at compile time. Consequently, the first step of the algorithm is to reduce the term as much as possible, following standard evaluation rules: let inc = . . . ; map = . . . in case zs of [ ] → [ ] (y : ys) → inc y : map inc ys

• As an example of this flexibility, we show how to supercompile

a call-by-need language with unrestricted recursive let bindings, by making use of a standard evaluator for call-by-need (Section 4). This has two advantages:

At this point, we become stuck on the free variable zs. The most important decision when designing a supercompiler is how to proceed in such a situation, and we will spend considerable time later explaining how this choice is made when we cover the splitter in Section 3.5. In this particular example, we continue by recursively supercompiling two subexpressions. We intend to later recombine the two subexpressions into an output term where the case zs remains in the output program, but where both branches of the case have been further optimised by supercompilation.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

1 http://github.com/batterseapower/supercompilation-by-evaluation/

135

The first subexpression is just [ ]. Because this is already a value, supercompilation makes no progress: the result of supercompiling that term is therefore [ ]. The second subexpression is:

Variables

x, y, z

Primitives

Data Constructors

let inc = . . . ; map = . . . in inc y : map inc ys

let inc = . . . in inc y Again, we perform reduction, yielding the supercompiled term y + 1. The other subexpression, originating from splitting the (y : ys) case branch, is: let inc = . . . ; map = . . . in map inc ys

C ::= True, Just, (:), . . .

` ::= 1, 2, . . . , ’a’, ’b’, . . .

Literals

Again, evaluation of this term is unable to make progress: the rules of call-by-need reduction do not make allowance for evaluating within non-strict contexts such as the arguments of data constructors. It is once again time to use the splitter to produce some subexpressions suitable for further supercompilation. This time, the first subexpression is:

⊗ ::= +, −, . . .

Values v ::= | |

λx. e ` Cx

Terms e ::= | | | | |

x v ex e⊗ e let x = e in e case e of α → e

Lambda abstraction Literal Saturated constructed data Variable reference Values Application Binary primops Recursive let-binding Case decomposition

Case Alternative α ::= ` Literal alternative | C x Constructor alternative

This term identical to the one we started with, except that it has the free variable ys rather than zs. If we continued inlining and β-reducing the map call, the supercompiler would not terminate. This is not what we do. Instead, the supercompiler uses a memo function. It records all of the terms it has been asked to supercompile as it proceeds, so that it never supercompiles the same term twice. In concrete terms, it builds up a set of promises, each of which is an association between a term previously submitted for supercompilation, its free variables, and a unique, fresh name (typically written h0 , h1 , etc.). At this point in the supercompilation of our example, the promises will look something like this:

Heaps

H ::= x 7→ e

Stack Frames κ ::= update x | •x | case • of α → e | •⊗ e | v⊗ •

Stacks

K ::= κ

Update frame Apply to function value Scrutinise value Apply first value to primop Apply second value to primop

Figure 1: Syntax of the Core language and evaluator

h0 zs 7→ let inc = . . . ; map = . . . in map inc zs h1 7→ [ ] h2 y ys 7→ let inc = . . . ; map = . . . in inc y : map inc ys h3 y 7 let inc = . . . in inc y →

A trivial post-pass can eliminate some of the unnecessary indirections to obtain a version of the original input expression, where map has been specialised on its functional argument:

We have presented the promises in a rather suggestive manner, as if the promises were a sequence of bindings. Indeed, the intention is that the final output of the supercompilation process will be not only an optimised expression, but one optimised binding for each h0 , h1 , ... ever added to the promises. Because the term we are now being asked to supercompile is simply a renaming of the original term (with which we associated the name h0 ) we can immediately return h0 ys as the supercompiled version of the current term. Producing a tieback like this we can rely on the (not yet known) optimised form of the original term (rather than supercompiling afresh), while simultaneously sidestepping a possible source of non-termination. Now, both of the recursive supercompilations requested in the process of supercompiling h1 have been completed. We can now rebuild the optimised version of the h2 term from the optimised subterms, which yields:

let h0 zs = case zs of [ ] → [ ]; (y : ys) → (y + 1) : h0 ys in h0 zs

3.

The basic supercompiler

We now describe the design of an unusually-modular supercompiler for a simple functional language that closely approximates GHC’s intermediate language, Core. The syntax of the language itself is presented in Figure 1; it is a standard untyped call-by-need calculus with recursive let, algebraic data types, primitive literals and strict primitive operations. Although Figure 1 describes terms in A-normal form [9], for clarity of presentation we will often write non-normalised expressions. A program is simply a term, in which the top-level function definitions appear as possibly-recursive let bindings. A small-step operational semantics of Core appears in Figure 3, and is completely conventional in the style of Sestoft [10] — so conventional that our description here is very brief indeed. The state of the machine is a triple hH e Ki, of a heap, a term and a stack. The term is the focus of evaluation, while the stack embodies the evaluation context, or continuation, that will consume the value produced by the term. Figure 1 gives the syntax of heaps and stacks, as well as terms. Our supercompiler is built from the following four, mostly independent, subsystems:

h3 y : h0 ys Continuing this process of rebuilding an optimised version of the supercompiler input from the optimised subexpressions, we eventually obtain this final program: let h0 zs = case zs of [ ] → h1 ; (y : ys) → h2 y ys h1 = [ ] h2 y ys = h3 y : h0 ys h3 y = y + 1 in h0 zs

136

type Heap = Map Var Term

The core of the supercompilation algorithm is sc, whose key property is this: for any history h and state s, (sc h s) returns a term with exactly the same meaning as s, but which is implemented more efficiently.

-- See H in Figure 1

type Stack = [StackFrame ] data StackFrame = . . .

-- See κ in Figure 1

data Term = . . .

-- See e in Figure 1

sc, sc 0 :: History → State → ScpM Term sc hist = memo (sc 0 hist) sc 0 hist state = case terminate hist state of Continue hist 0 → split (sc hist 0 ) (reduce state) Stop → split (sc hist) state

type State = (Heap, Term, Stack ) freeVars :: State → [Var ] rebuild :: State → Term sc :: History → State → ScpM Term

As foreshadowed in Section 2, sc is a memoised function: if it is ever asked to supercompile a State that is identical to one we have previously supercompiled (modulo renaming), we want to reuse that previous work. This is achieved by calling memo, which memoises uses of sc by recording information in the ScpM monad. We will describe memoisation in more detail in Section 3.4. Memoisation deals with the case where sc is called on an identical argument. But what if it is called on a growing argument? You might imagine that we would keep supercompiling forever. This well-known problem arises, for example, when supercompiling a recursive function with an accumulating parameter. There is likewise a well-known way to ensure that supercompilation terminates, which involves maintaining a “history” of previous arguments. In concrete terms, the parameter hist is the history, and sc 0 starts by calling terminate (Figure 2) to decide whether to Stop or (the common case) Continue. The implementation of histories and terminate is elaborated in Section 3.6. The normal case is that terminate returns Continue hist 0 , in which case sc 0 proceeds thus:

-- The evaluator (Section 3.3) reduce :: State → State -- The splitter (Section 3.5) split :: Monad m ⇒ (State → m Term) → State → m Term -- Termination checking (Section 3.2) type History = [State ] emptyHistory = [ ] :: History data TermRes = Stop | Continue History terminate :: History → State → TermRes -- Memoisation and the ScpM monad (Section 3.4) memo :: (State → ScpM Term) → State → ScpM Term match :: State → State → Maybe (Var → Var ) runScpM :: ScpM Term → Term freshName :: ScpM Var bind :: Var → Term → ScpM () promises :: ScpM [Promise ] promise :: Promise → ScpM () data Promise = P {name :: Var , fvs :: [Var ], meaning :: State }

1. It invokes a call-by-need evaluator, reduce, to optimise the state s by evaluating it to head normal form. This amounts to performing compile-time evaluation, so reduce must itself be careful not to diverge – see Section 3.3. 2. It uses split to recursively supercompile some subcomponents of the reduced state, optimising parts of the term that reduction didn’t reach.

Figure 2: Types used in the standard supercompiler

Here is an example. Imagine that this term was input to sc 2 : let x = True; y = 1 + 2 in case x of True → Just y; False → Nothing

1. A termination criterion that prevents the supercompiler from running forever: Section 3.2

Assuming that this State has never been previously supercompiled, sc 0 will be invoked by memo. Further assuming that the termination check in sc 0 returns Continue, we would reduce the input state to head normal form, giving a new state 0 :

2. An evaluator for the language under consideration: Section 3.3 3. A memoiser, which ensures that we supercompile any term at most once: Section 3.4 4. A splitter that tells us how to proceed when evaluation becomes blocked: Section 3.5

let y = 1 + 2 in Just y The case computation and x binding have been reduced away. It would be possible to return this state 0 as the final, supercompiled form of our input — indeed, in general the supercompiler is free to stop at any time, using rebuild to construct a semanticallyequivalent result term. However, doing so misses the opportunity to supercompile some subcomponents of state 0 that are not reduced in the head normal form. Instead, we feed state 0 to split, which:

We will show how to implement each of these components in a way that will yield a standard supercompiler, which is nonetheless more powerful than previous work in that it will naturally support recursive let. 3.1

The top-level

A distinctive feature of our supercompiler is that it operates on States rather than Terms; we reflect on why in Section 3.7. A State is a triple of type (Heap, Term, Stack ), and it represents precisely the state hH e Ki of the abstract machine (Figure 3). Notice that Term and State are related: any Term can be converted to its initial State, and any State can be converted back to a Term simply by wrapping the heap and the stack around the term, a function we call rebuild . The signatures of the major functions and data types used by the supercompiler – including State and rebuild – are given for easy reference in Figure 2.

1. Invokes sc hist 0 on the subterm 1 + 2, achieving further supercompilation (and hence optimisation). Let’s say for the purposes of the example that this then returns the final optimised term h1 , with a corresponding optimised binding h1 = 3 recorded in the monad. 2 Technically

sc takes a State not a Term, but for ease of presentation our examples will often use a term e in place of the state (emptyHeap, e, [ ]), as we do here.

137

expression. Likewise, StackFrames are labelled with the tag of the term the evaluator produced them from – e.g. a case • of α → e frame would be labelled with the tag of the corresponding case expression. Occasionally, the evaluator needs to manufacture a new term which did not necessarily occur in the input program – e.g. if we evaluate 1 + 2 to get the new value 3. In such cases, one of the operand tags is used as the tag for the new term. The termination criterion then defines an internal function that obtains a tag-bag from the components of a State triple:

2. Reconstructs the term using the optimised subexpressions. So in this case the Term returned by split would be let y = h1 in Just y. The entry point to the supercompiler, start, is as follows: start :: Term → Term start e = runScpM (sc emptyHistory (emptyHeap, e, [ ])) The input term, e, is first converted into an initial State, namely (emptyHeap, e, [ ]). This initial state is passed to the main supercompiler sc, along with the initial history. Finally sc is performed in the ScpM monad, initialised by runScpM – we describe this monad in detail in Section 3.4. In the following sections, we will explore the meaning and implementation of the reduce, memo, terminate and split functions in much more detail. 3.2

tagBag :: State → Bag Tag tagBag (h, e, k ) = (termTag e ∗ 2) ‘insertBag‘ fmap (∗3) (heapTagBag h) ‘plusBag‘ fmap (∗5) (stackTagBag k ) The tagBag function multiplies tags by distinct prime numbers depending on where in the evaluation context the tag originated from. This does not change the fact that there are only ever a finite number of distinct tags in the bags (and hence Ctb is still a well-quasi-order). However, the multiplication tends to prevent the evaluator from terminating just because e.g. a tagged binding that used to appear in the Heap is forced and hence has its tag show up on a StackFrame instead. Finally, we can combine tagBag and Ctb to produce the wellquasi-order C on States used by terminate:

The termination criterion

The core of the supercompiler’s termination check is provided by a single function, terminate: terminate :: History → State → TermRes data TermRes = Stop | Continue History As the supercompiler proceeds, it builds up an ever-larger History of previously-observed States. This history is both interrogated and extended by calling terminate. Termination is guaranteed by making sure that History cannot grow indefinitely. More precisely, terminate guarantees that, for any history h0 and states s0 , s1 , s2 , . . . there can be no infinite sequence of calls to terminate of this form: terminate h0 s0 = Continue h1 terminate h1 s1 = Continue h2 ...

(C) :: State → State → Bool s1 C s1 = tagBag s1 Ctb tagBag s2 Mitchell uses tag-bags in a similar way, but only associates tags with let-bound variables. In order to tag every subexpression, he keeps terms in a normal form where all subexpressions are let-bound. Supercompiling States and tagging subterms directly means that we can avoid let-floating and – because we distinguish between tags from subexpressions currently being evaluated (in the stack), and those subexpressions that are not in the process of being forced (in the heap) – our termination criterion is more lenient.

Instead, there will always exist some j such that: terminate hj sj = Stop In Section 3.3 we will see how reduce uses terminate to ensure that it only performs a bounded number of reduction steps, and we will discuss how terminate ensures that the overall supercompiler terminates in Section 3.6. So much for the specification, but how can terminate be implemented? Of course, (λxy. Stop) would be a sound implementation of terminate, in that it satisfies the property described above, but it is wildly over-conservative because it forces the supercompiler to stop reduction immediately. We want an implementation of terminate that is correct, but which nonetheless waits for as long as possible before preventing further reduction by answering Stop. The key to implementing such a termination criterion is defining a well-quasi-order [11, 12]. The relation C ∈ S ×S is a well-quasiorder iff for all infinite sequences of elements of S (s0 , s1 , . . .), it holds that: ∃ij.i < j ∧ si C sj . Given any well-quasi-order C : State × State, we can implement a correct terminate function:

3.3

The evaluator

The reduce function tries to reduce a State to head normal form. In case the term diverges, reduce includes a termination check that allows it to stop after a finite number of steps. (This check is conservative, of course, so reduce might fail to find a head normal form when one does exist.) The two key properties of reduce are: • Reduction preserves meaning: the State returned has the same

semantics as the input State • Regardless of what meaning the input State may have, reduce

always terminates The implementation is straightforward: reduce :: State → State reduce = go emptyHistory where go hist state = case step state of Nothing → state Just state 0 | intermediate state 0 → go hist state 0 | otherwise → case terminate hist state 0 of Stop → state 0 0 Continue hist → go hist 0 state 0 intermediate ( , Var , ) = False intermediate = True step :: State → Maybe State -- Implements Figure 3

terminate prevs here = if any (Chere) prevs then Stop else Continue (here : prevs) Concretely, we choose the tag-bag ordering of Mitchell [7] as the basis of our well-quasi-order. The tag-bag order relates bags (multisets) of “tags” as follows: t1 Ctb t2 ⇐⇒ set (t1 ) = set (t2 ) ∧ |t1 | ≤ |t2 | For this to be a well-quasi-order there must be a finite number of distinct tags that can appear in the bags. We take tags to be Ints, and assume that every sub-term of the supercompiler’s input program is labelled with a unique Int, which forms the tag for that

138

hH e Ki VAR UPDATE APP LAMBDA PRIM PRIM - LEFT PRIM - RIGHT CASE DATA LIT LETREC

hH e Ki

hH, x 7→ e x Ki hH v update x, Ki hH e x Ki hH λx. e • x, Ki hH e1 ⊗ e2 Ki hH v1 • ⊗ e2 , Ki hH v2 v1 ⊗ •, Ki hH case escrut of α → e Ki hH C x case • of {. . . , C x → e, . . .} , Ki hH ` case • of {. . . , ` → e, . . .} , Ki hH let x = e in ebody Ki

hH e update x, Ki hH, x 7→ v v Ki hH e • x, Ki hH e Ki hH e1 • ⊗ e2 , Ki hH e2 v1 ⊗ •, Ki hH ⊗ (v1 , v2 ) Ki hH escrut case • of α → e, Ki hH e Ki hH e Ki hH, x 7→ e ebody Ki

Figure 3: Operational semantics of the Core language The reduce function uses a loop, the function go, with an accumulating history. In turn go uses an internal function, step, which implements precisely the one-step reduction relation of Figure 3. Note that step returns a Maybe State – this accounts for reduction being unable to proceed due to either reaching a value, or because a variable is in the focus which is not bound by the heap (remember that reduce may be used on open terms). In that case reduce terminates with the state it has reached. The totality of reduce is achieved using the terminate function. If terminate reports that evaluation appears to be diverging, reduce immediately returns. As a result, the State triple (h, e, k ) returned by reduce might not be fully reduced – in particular, it might be the case that e ≡ Var x where x is bound by h. As an optimisation, the termination criterion is not tested if the State is considered to be “intermediate”. The intermediate predicate shown ensures that we only test for non-termination upon reaching a variable. This is safe because every infinite series of reduction steps must certainly have a variable occur in the focus an infinite number of times – it is straightforward to construct a measure on (e, K) pairs that is strictly decreased by every reduction rule except VAR). After some experience with our supercompiler we discovered that making termination tests infrequent is actually more than a mere optimisation. If we test for termination very frequently (say, after every tiny step), the successive states will be very similar; and the more similar they are, the greater the danger that the necessarily-conservative termination criterion (Section 3.2) will unnecessarily say Stop. (For example, in the limit, it must say Stop for two identical states.) 3.4

different ways, we can reuse the supercompiled version of a State several times. The data structure used to store all this information is called a Promise (Figure 2). 2. The optimised bindings, each of the form x = e. The runScpM function, which is used to actually execute ScpM Term computations, wraps the optimised bindings collected during the supercompilation process around the final supercompiled Term in order to produce the final output. 3. A supply of fresh names (h0 , h1 , ...) to use for the optimised bindings. When sc begins to supercompile a State, it records a promise for that state; when it finishes supercompiling that state it records a corresponding optimised binding for it. At any moment there may be unfulfilled promises that lack a corresponding binding, but every binding has a corresponding promise. Moreover, every promise will eventually be fulfilled by an entry appearing in the optimised bindings. Figure 2 summarises the signatures of the functions provided by ScpM . We can now implement memo as follows: memo :: (State → ScpM Term) → State → ScpM Term memo opt state = do ps ← promises let ress = [ (name p ‘apps‘ map rn (fvs p)) | p ← ps , Just rn ← [match (meaning p) state ] ] case ress of res : → return res [] → do x ← freshName let vs = freeVars state promise P {name = x , fvs = vs, meaning = state } e ← opt state bind x (lambdas vs e) return (x ‘apps‘ vs)

The memoiser

The purpose of the memoisation function, memo, is to ensure that we never supercompile a term more than once. We achieve this by using the ScpM monad to record information about previously supercompiled States. Precisely, the ScpM monad is a simple state monad with three pieces of state: 1. The promises, which comprise all the States that have been previously submitted for supercompilation, along with: • The names that the supercompiled versions of those States

will be bound to in the final program (e.g. h0 , h1 ) • The list of free variables that those bindings will be ab-

The memo function proceeds as follows:

stracted over3 . By instantiating these free variables several 3 Strictly

us from accidentally introducing space leaks by increasing the garbagecollection lifetime of constant expressions.

speaking, bindings with no free variables at all should nonetheless be λ-abstracted over a dummy argument (such as ()). This will prevent

139

1. Firstly, it examines all existing promises. If the match function reports that some existing promise matches the State we want to supercompile (up to renaming), memo returns a call to the optimised binding corresponding to that existing promise.

Such an implementation is wildly conservative, because not even trivially reducible subexpressions will benefit from supercompilation. A good split function will residualise as little of the input as possible, using opt to optimise as much as possible. It turns out that, starting from this sound-but-feeble baseline, there is a rich variety of choices one can make for split, as we explore in the rest of this section. In preparation for describing split in more detail, we first introduce a notational device similar to that of Mitchell [7] for describing the operation of split on particular examples. Suppose that the following State is given to split:

2. Assuming no promise matches, memo continues: (a) A new promise for this novel State is made, in the form of a new Promise entry. A fresh name of the form hn (for some n) is associated with the Promise. (b) The state is optimised by calling opt, obtaining an optimised term e.

hx 7→ 1, xs 7→ map (const 1) ys x : xs i

(c) A final optimised binding hn = λfvs (s) . e is recorded using bind . This binding will be placed in the output program by runScpM .

In our notation the output of split would be this “term”, which has sub-components that are States:

(d) Finally, a call to that binding, hn fvs (s), is returned.

let x = h 1 i ; xs = h map (const 1) ys i in x : xs

The match function is used to compare States: match :: State → State → Maybe (Var → Var )

You should read this in the following way:

The key properties of the match function are that:

• The part of the term outside the hstate bracketsi is the residual

code that will form part of the output program.

• If match s1 s2 ≡ Just rn then the meaning of s2 is the same

• In contrast, those things that live within the brackets are the

as that of rn(s1).

not-yet-residual States which are fed to opt for further supercompilation.

• If s1 is syntactically identical to s2 , modulo renaming, then

isJust (match s1 s2 ). This property is necessary for termination of the supercompiler, as we will discuss later.

Before split returns, the supercompiled form of the bracketed expressions is pasted into the correct position in the residual code. So the actual end result of such a supercompilation run might be something like:

Naturally, it is desirable for the match function to match as many truly equivalent terms as possible. This is made slightly more convenient by the fact that we consider matching States, as they may have already been weakly normalised by the evaluator. Our implementation exploits this by providing a match function that is insensitive to the exact order of bindings in the Heap. One subtle point is that the matching should be careful not to duplicate work. This can happen if an old term such as:

let x = h2 ; xs = h3 ys in x : xs where h2 and h3 will have optimised bindings in the output program, as usual. So far, we have only seen examples where split opt invokes opt on subterms of the original input. While this is a good approximation to what split does, in general, we will also want to include some of the context in which that subterm lives. Consider the following input:

let x = fact 100; y = fact 100 in (x , y) is matched against a proposed new one such as: let x = fact 100 in (x , x )

hx 7→ 1, y 7→ x + x Just y i

However, if the let-bindings in those terms had bound, say, True instead of fact 100 then matching them would be both permissible and desirable. Unlike some supercompilers (e.g. [8]), our use of the memo function means that we will even share the work of supercompiling nodes that are siblings in the supercompilation “process tree”. 3.5

A good way to split is as follows: let y = hx 7→ 1 x + x i in Just y Note that split opt decided to recursively optimise the term x + x , along with a heap binding for x taken from the context which the subterm lived in. This extra context will allow the supercompiler to reduce x + x to 2 at compile time. Another way that a subterm can get some context added to it by split is when evaluation of a case expression gets stuck. As an example, consider the following (stuck) input to split:

The splitter

The job of the splitter is to somehow continue the process of supercompiling a State which we may not reduce further, either because of a lack of information (e.g. if the State is blocked on a free variable), or because the termination criterion is preventing us from making any further one-step reductions. The splitter has the following type signature:

h x case • of (True → 1; False → 2) , • + 3i One possibility is that split could break the expression up for further supercompilation as follows:

split :: Monad m ⇒ (State → m Term) → State → m Term

(case x of True → h 1 i False → h 2 i) + h 3 i

In general, (split opt s) identifies some sub-components of the state s, uses opt to optimise them, and combines the results into a term whose meaning is the same as s (assuming, of course, that opt preserves meaning). A sound, but feeble, implementation of split opt s would be one which never recursively invokes opt: split

However, split can achieve rather more potential for reduction if it duplicates the stack frame performing addition into both case branches: in particular, that will mean that we are able to evaluate the addition at compile time: (case x of True → h 1 (• + 3)i False → h 2 (• + 3)i)

s = return (rebuild s)

140

Our split uses let-floating to make more heap bindings suitable for pushing down under these criteria. For example, this state:

In fact, in general we will always want to push all of the stack frames following a case • of α → e frame to meet with the expressions e in the case branches. This is one of the places where the decision to have the supercompiler work with States rather than Terms pays off: the fact that we have an explicit evaluation context makes the process of splitting at a residual case very systematic and easy to implement. The key property of split is that for any opt that is meaning preserving (such that opt s returns an expression e with the same meaning as s), split opt must be meaning preserving in the same sense. There are a number of subtle points to bear in mind when implementing split. We describe some issues below, and will have more to say in Section 4.

hx 7→ Just (fact n) λm. case x of Just y → y + m i Will be split as follows: let a = h fact n i inλm. hx 7→ Just a case x of Just y → y + m i Sketching split Due to space limitations, we are unable to give a complete description of split. However, we can give a sketch of a suboptimal implementation that may nonetheless clarify our description. We first introduce the concept of a Bracket. This is a Haskell representation of the “term with holes” notational device we introduced earlier. Each hole contains a State:

Issue 1: learning from residual case branches We gain information about a free variable when it is scrutinised by a residual case. Thus, given this State:

data Bracket = B {holes :: [State ], build :: [Term ] → Term } termBracket :: Term → Bracket termBracket e = B [(emptyHeap, e, noStack )] (λ[e 0 ] → e 0 )

h x case • of (3 → x + x ; 4 → x ∗ x )i We split as follows:

Our code examples will often make use of a [[bracketed]] syntax to concisely define a value of type Bracket:

case x of 3 → hx → 7 3 x + x i 4 → hx → 7 4 x ∗ x i

[[f h 1 i]] :: Bracket

Because we have learnt the value of x from the case alternative, we are able to statically reduce the + and ∗ operations in each branch.

This particular example corresponds to: B {holes = [(, 1, )], build = λ[e 0 ] → var "f" ‘apps‘ e 0 }

Issue 2: work duplication Consider splitting the following State, where fact is an unknown function and hence must be assumed to to be expensive to execute:

Split can now be defined as follows: split opt (h, e, k ) = liftM (build br ) $ mapM opt (holes br ) where xs = case e of Var x → [x ]; → [ ] br = splitHeap h $ splitStack xs k $ splitTerm e

hx 7→ fact n (x + 1, x + 2) i One possibility is to split as follows:

Each part of the State is split independently to produce a Bracket, which then has all of its holes optimised before we rebuild the final term. Before we cover splitTerm, splitStack and splitHeap, we will need a way to build a larger bracket from smaller ones:

(hx 7→ fact n x + 1 i , hx 7→ fact n x + 2 i) Unfortunately, this choice leads to duplication of the expensive fact n subterm. If we freely duplicate unbounded amounts of work in this manner we can easily end up “optimising” the program into a much less efficient version. Work can be duplicated even if no syntactic duplication occurs, as occurs if we take this example:

plusBrackets :: [Bracket ] → ([Term ] → Term) → Bracket plusBrackets brs rb = B {holes = concatMap holes brs, build = f } where f es = rb (zipWith (λbr es → build br es) brs ess) where ess = splitInto (map holes brs) es splitInto :: [[b ]] → [a ] → [[a ]] -- splitInto bss as ≡ ass ∧ length (concat bss) ≡ length as -- =⇒ map length bss ≡ map length ass ∧ as ≡ concat ass

hx 7→ fact n λy. x + y i We would duplicate work if we were to split in the following way: λy → hx 7→ fact n x + y i Furthermore, syntactic duplication does not necessarily lead to work duplication. Consider:

Now, splitTerm just identifies some subexpressions for supercompilation:

hx 7→ fact n y case • of (True → x + 1; False → x + 2)i

splitTerm :: Term → Bracket splitTerm e = plusBrackets (map termBracket es) rb where (es, rb) = uniplate e

Notice that splitting it as follows does not duplicate the computation of fact n: case y of True → hx → 7 fact n x + 1 i False → hx → 7 fact n x + 2 i

We make use of the uniplate combinator (following Mitchell and Runciman [13]), which takes a Term apart into a list of its immediate subterms, and a function to recombine those subterms to obtain the original input:

Consequently, we push the heap bindings supplied to split down into those split-out subterms of which they are free variables, as long as either one of these conditions is met:

uniplate :: Term → ([Term ], [Term ] → Term)

• The binding manifestly binds a value, such as λx . x : values

There is more work to do when splitting the stack:

require no further reduction, so no work can be lost that way

splitStack :: [Var ] → Stack → Bracket → ([(Var , Bracket)], Bracket)

• Pushing the binding down into the subterm would not result

in the allocation of its thunk occurring more than once in any possible context consuming the output

141

The call splitStack xs k b splits stack k with bracket b in the focus, where all of the variables xs are guaranteed to have the same value as the focus. We will use the xs in splitStack to learn from residual case branches. There are three principal possibilities that splitStack has to deal with. Firstly, applications and primitives can be handled uniformly:

3.6

Termination of the supercompiler

Although we have been careful to ensure that our evaluation function, reduce, is total, it is not so obvious that sc itself is terminating. Since split may recursively invoke sc via its higher order argument, we might get an infinitely deep stack of calls to sc! To rule out this possibility, sc carries a history, which – as we saw in Section 3 – is checked before any reduction is performed. If terminate allows the history to be extended, the input State is reduced before recursing. Otherwise, the input State is fed to split unchanged. In order to be able to prove that the supercompiler terminates, we need some condition on exactly what sort of subcomponents split opt invokes opt on. It turns out that the presence of recursive let requires us to choose a rather complicated condition here, as we will explain further in Section 4.4. Let us pretend for a moment that we have no recursive let. In this scenario, it is always the case for our split that split opt s invokes opt s 0 only if s 0 ≺ s. The ≺ relation is a well-founded relation defined by s0 ≺ s ⇐⇒ size (s0 ) < size (s), where size : State → N returns the number of abstract syntax tree nodes in the State. This is sufficient to ensure termination, as the following argument shows:

splitStack xs (• x : k ) br = splitStack [ ] k [[hbr i x ]] splitStack xs (• ⊗ e : k ) br = splitStack [ ] k [[hbr i ⊗ h e i]] splitStack xs (v ⊗ • : k ) br = splitStack [ ] k [[h v i ⊗ hbr i]] The next possibility is that the stack frame arises from a case: splitStackhhxs (case • of α → e : k ) brii = ([ ], case hbr i of α → haltbr i ) where altbr = haltHeap α e k i altHeap α = fromList [(x , altConValue α) | x ← xs ] altConValue :: AltCon → Value altConValue (C x ) = (C x ) altConValue ` =`

Theorem: sc always recurses a finite number of times Proceed by contradiction. If sc recursed an infinite number of times, then by definition the call stack would contain infinitely many activations of sc hist s for (possibly repeating) sequences of hist and s values. Denote the infinite chains formed by those values as hhist 0 , hist 1 , . . .i and hs0 , s1 , . . .i respectively. Now, observe that there must be infinitely many i such that isContinue (terminate hist i si ). This follows because the only other possibility is that there must exist some j such that ∀l.l ≥ j =⇒ isStop (terminate hist l sl ). On such a suffix, sc is recursing through split without any intervening uses of reduce. However, by the property we required split to have, such a sequence of states must have a strictly decreasing size:

Notice that we do not recursively call splitStack in this situation: as we discussed, the entire stack is pushed into each case branch. We also use altHeap to construct a heap that binds the variables being scrutinised (if any) to the value corresponding to the particular case alternative. Finally, the immediate stack frame may be an update frame: splitStack xs (update x : k ) br = ((x , br ) : xbrs 0 , br 0 ) where (xbrs 0 , br 0 ) = splitStack (x : xs) k [[x ]] In this case, we recursively split the remainder of the stack, but change the focus to be the variable being updated. The presence of update frames is why splitStack returns a [(Var , Bracket)] as well as a Bracket – the list of (Var , Bracket) contains a Bracket for every update frame that splitStack encountered. As we will see shortly, the brackets from this list will be placed in an enclosing let expression along with those arising from the Heap. Finally, we can implement splitHeap:

∀l.l > j =⇒ size (sl ) < size (sj ) However, < is a well founded relation, so such a chain cannot be infinite. This contradicts our assumption that this suffix of sc calls is infinite, so it must be the case that there are infinitely many i such that isContinue (terminate hist i si ). Now, form the infinite chain ht1 , t2 , . . .i consisting of si such that isContinue (terminate hist i si ). By the properties of terminate, it follows that:

splitHeap :: Heap → ([(Var , Bracket)], Bracket) → Bracket splitHeap h (xbrs, br ) = plusBrackets (map inline (br : brs)) (λ(e : es) → letRec (xs ‘zip‘ es) e) where (xs, brs) = unzip (xbrs + + [ (x , termBracket e) | (x , e) ← toList h ])

∀ij.j < i =⇒ ¬ (tagBag tj C tagBag ti ) However, this contradicts the fact that C is a well-quasi-order. Combined with the requirement that split opt only calls opt finitely many times, the whole supercompilation process must terminate.

This completes the implementation of split. A real implementation will need to add several complications:

Two non-termination checks It is important to note that the history carried by sc is extended entirely independently from the history produced by the reduce function (similar to “transient reductions” [14]). The two histories deal with different sources of nontermination. The history carried by reduce prevents non-termination due to divergent expressions, such as this one:

• The splitHeap function should attempt to push some elements

of the Heap into the holes of the brackets from splitStack . A linearity analysis will be required in order to avoid duplicating work when non-value heap bindings get pushed down. • The Heap should be let-floated to expose values under lets,

let f x = 1 + (f x ) in f 10

and hence allow more bindings to be propagated downwards. • In the presence of recursive let it is not always valid for

In contrast, the history carried by sc prevents non-termination that can arise from repeatedly invoking the split function – even if every subexpression would, considered in isolation, terminate. This is illustrated in the following program:

splitStack to push down the entire stack into the branches of a residual case. This issue is discussed in more detail in Section 4.

142

let count n = n : count (n + 1) in count 0

e has been reduced to a value, v, the update frame will be popped from the stack, which is the cue for the evaluator to update the heap with a binding x 7→ v, replacing the old one. Now, subsequent uses of x in the course of evaluation will be able to reuse that value directly, without reducing e again. As an example of how update frames work, consider this reduction sequence:

Left unchecked, we would repeatedly reduce the calls to count, yielding a value (a cons-cell) each time. The split function would then pick out both the head and tail of the cons cell to be recursively supercompiled, leading to yet another unfolding of count, and so on. The resulting (infinite) residual program would look something like:

hx 7→ 1 + 2 x + x i hx 7→ 1 + 2 x • + x i h 1 + 2 update x , • + x i ... h 3 update x , • + x i hx 7→ 3 3 • + x i hx 7→ 3 x 3 + •i h 3 update x , 3 + •i hx 7→ 3 3 3 + •i hx 7→ 3 6 i

let h0 = h1 : h2 ; h1 = 0 h2 = h3 : h4 ; h3 = 1 h4 = h5 : h6 ; h5 = 2 ... The check with terminate before reduction ensures that instead, one of the applications of count is left unreduced. This use of terminate ensures that our program remains finite:

Because the corresponding heap binding is removed from the heap whenever an update frame is pushed, the update frame mechanism is what causes reduction to become blocked if you evaluate a term which forms a black hole:

let h0 = h1 : h2 ; h1 = 0 h2 = let count = λn. h3 n in count 1 h3 n = n : h3 (n + 1) in h0

hx 7→ x + 1 x i

4.2

Negative recursion in data constructors As a nice aside, the rigorous termination criterion gives us a stronger termination guarantee than the Glasgow Haskell Compiler (GHC) [15], the leading Haskell implementation. Because GHC does not check for recursion through negative positions in data constructors, the following notorious program will force GHC into an infinite loop:

Splitting in the presence of update frames

We will split as follows, pushing the whole stack, including the update frame for y, into the case branch: case x of T → h F update y, case • of F → (2, y)i After supercompilation is complete, we will then obtain an output term something like the following:

Observations on the supercompiler

case x of T → let y = F in (2, y) This is what the splitStack function we saw in Section 3.5 does. 4.3

Splitting update frames from recursive lets

The key problem that the splitter must face is that update frames derived from recursive let can interact badly with our intention to push the entire enclosing stack into the branches of a case. Consider this input to split: h unk

Extending to recursive let

•+y, case • of 1 → 2, update y, •+ 2i

Following our earlier discussion of case, we might be tempted to split as follows:

In the previous section, we described all the pieces necessary to implement a complete supercompiler. The handling of recursive let is mostly straightforward in this framework, with the exception of two things:

case unk + y of 1 → h 2 update y, •+ 2i However, this is a disastrous choice – due to the occurrence of y in the scrutinee, y is now a free variable of the output expression! The lesson here is that update frames should not be pushed inside case branches if they bind a variable that we may need to refer to outside the case. Following this rule, our example is instead split as follows:

• Update frames originating from recursive let complicate the

splitter: Section 4.3 • The termination proof for the supercompiler becomes more

complicated: Section 4.4

let y = case unk + y of 1 → h 2 i in y + h 2 i

We cover each of these points in order. 4.1

• + 1, update x i 6

h x case • of T → F , update y, case • of F → (2, y)i

It is a unique feature of our supercompiler that all our ingredients operate on States, rather than Terms. This is a consequence of explicitly basing the supercompiler on an evaluator, but it pays off in the splitter (Section 3.5) as well. The splitter operates distinctively differently on each of the three components of the State, and takes advantage of the explicit representation of the Stack to push the continuation into the branches of a residual case expression. To split a Term well would be much more inconvenient.

4.

h x

Just like all other kinds of stack frame, we want to push update frames into residual case branches. Consider this input to split:

data U = MkU (U → Bool ) russel u@(MkU p) = not (p u) x = russel (MkU russel ) :: Bool 3.7

...

Update frames complicate the supercompiler slightly, but in a localised way – we must think carefully as to how the split function should deal with update frames.

Irritatingly, the choice about which update frames should not be pushed inside case branches is not as straightforward as a simple free-variable check. The reason is that choosing to not push an update frame down may make more of the variables bound by other pushable update frames free, and hence require us to prevent pushing in yet more update frames! Here is a contrived example illustrating the point – note that for clarity we will not write the update frames directly, and represent the States as if they were terms:

Update frames

The evaluator (Figure 3 and Section 3.3) deals with a call-by-need language, using update frames in the conventional way to model laziness [10]. When a heap binding x 7→ e is demanded by a variable x coming into the focus of the evaluator, e may not yet be a value. To ensure that we only reduce any given heap-bound e to a value at most once, the evaluator pushes an update frame update x on the stack, before beginning the evaluation of e . After

143

let w = fact z ; y = unk + x x = case y of 10 → w + 3 z = case x of 20 → a + 3 in z + w + a

Theorem: sc always recurses a finite number of times Proceed by contradiction. If sc recursed an infinite number of times, then by definition the call stack would contain infinitely many activations of sc hist s for (possibly repeating) sequences of hist and s values. Denote the infinite chains formed by those values as hhist 0 , hist 1 , . . .i and hs0 , s1 , . . .i respectively. Now, observe that there must be infinitely many i such that isContinue (terminate hist i si ). This follows because the only other possibility is that there must exist some j such that ∀l.l ≥ j =⇒ isStop (terminate hist l sl ). On such a suffix, sc is recursing through split without any intervening uses of reduce. By the modified property of split and the properties of alt-heap and subterms we have that

Our initial guess at the output of split may be as follows: let y = unk + hx i in case y of 10 → h let w = fact z ; x = w + 3 z = case x of 20 → a + 3 in z + w + a i Unfortunately, x is now a free variable of the whole expression, and consequently we should not have pushed the update frame for x within the case branch. Based on this information, our next guess may be:

∀l.l ≥ j =⇒ hl ⊆ hj ∪ alt-heap (ej , kj ) ∧ kl ‘isInfixOf ‘ kj ∧ el ∈ subterms (sj )

let w = hfact z i; y = unk + hx i x = case y of 10 → hw + 3i in case x of 20 → h let z = a + 3 in z + w + a i

We can therefore conclude that the infinite suffix must repeat itself at some point: ∃l.l > j ∧ sl ≡ sj . However, we required that match always succeeds when matching two terms equivalent up to renaming, which means that sc hist l sl would have been tied back by memo rather than recursing. This contradicts our assumption that this suffix of sc calls is infinite, so it must be the case that there are infinitely many i such that isContinue (terminate hist i si ). Now, form the infinite chain ht1 , t2 , . . .i consisting of si such that isContinue (terminate hist i si ). As in Section 3.6, this contradicts the fact that C is a well-quasi-order. Although the termination argument becomes more complex, the actual supercompilation algorithm remains as simple as ever.

Note that we have now been forced not to push the w binding down into either the case branch, because doing so would risk work duplication. Unfortunately, that has caused z to be free in the output expression! The correct solution is in fact to not push down the update frames for both x and z : let w = hfact z i; y = unk + hx i x = case y of 10 → hw + 3i z = case x of 20 → ha + 3i in z + hw i + hai Our real split implementation uses a fixed point that follows essentially this reasoning process to determine the set of update frames which may not be pushed down. 4.4

5.

Termination in the presence of recursive let

Results

We have implemented the supercompiler for a subset of Haskell. It is implemented as a preprocessor: programs are run through the supercompiler before being compiled by GHC at the -O2 optimisation level. The preliminary results of running the supercompiler on a standard array of benchmark programs are shown in Figure 4. For comparison, we include benchmark results from a supercompiler of Mitchell [7]. The “append”, “factorial”, “raytracer”, “sumtree” and “treeflip” benchmarks are all standard examples that have been described in previous work on supercompilation and deforestation [3, 4, 7, 16]. The “sumsquare” program is taken from work in stream fusion [17]. The “bernouilli”, “digitsofe2”, “exp3 8”, “primes”, “rfib”, “tak”, “wheel-sieve1”, “wheel-sieve2” and “x2n1” benchmarks are from the imaginary portion of the nofib benchmark suite [18]. We tested two variants of our supercompiler: one where the supercompiler evaluated primitive operations (primops), and one where it did not. Both variants treated primitives as strict operations. The benchmark results are promising. The supercompiler without primops reduced runtime by an (arithmetic) average of 26% compared to GHC alone. Evaluating primops reduced the average runtime reduction to 16%. Similar to our system, Mitchell’s system achieved an average reduction of 27%, though the improvements had a rather different profile. The use of supercompilation in practice is limited because despite the fact that it is guaranteed to terminate, it might take very long indeed to do so. Nofib imaginary suite benchmarks such as “digitsofe1” and “gen regexps” are prohibitively expensive to supercompile in both our system and that of Mitchell. Interestingly, the same problem afflicts “tak” – but only when evaluation of primops is enabled.

In Section 3.6 we showed why the supercompiler without recursive let terminated. However, to make that argument we had to rely on a condition on split that is simply too restrictive for the supercompiler with recursive let. Before, we used the property that split opt s invoked opt s 0 only if s0 ≺ s ⇐⇒ size (s0 ) < size (s). However, consider this input to split: hf 7→ λy. Just (f (not y)) Just (f (not y)) i We would like to split as follows: let f = λx . hf 7→ λy. Just (f (not y)) Just (f (not y)) i in Just (f (not y)) This is disallowed by the size-based criterion because the recursivelyoptimised State would be no smaller than the input. In the presence of recursive let, we can instead use the property that for our split, split opt (h, e, k ) only invokes opt on states (h 0 , e 0 , k 0 ) that satisfy all of these conditions: 1. h 0 ⊆ h ∪ alt-heap (e, k ) 2. k 0 ‘isInfixOf ‘ k 3. e 0 ∈ subterms (h, e, k ) The subterms (h, e, k ) function returns all expressions that occur syntactically within any of the Heap, Stack or Term inputs. The alt-heap (e, k ) function takes the variables bound by update frames in k and, if e ≡ Var x , the variable x . It then forms the cross product of that set with the values corresponding to the α in any case • of α → e ∈ k . We are now in a position to repair the proof.

144

Program append bernouilli digitsofe2 exp3 8 factorial primes raytracer rfib sumsquare sumtree tak treeflip wheel-sieve1 wheel-sieve2 x2n1 Average Minimum Maximum a d

SC.a 0.0s 5.8s 4.2s 0.8s 0.0s 0.1s 0.0s 0.0s 19.5s 0.1s 0.1s 0.1s N/A N/A 0.1s 2.4s 0.0s 19.5s

Mitchell [7] Cmp.b Runc Memd 0.88 0.86 0.85 1.63 0.98 0.97 1.24 0.32 0.46 1.34 0.96 1.00 0.99 0.95 1.00 1.04 0.63 0.99 1.00 0.57 0.44 0.94 0.93 1.00 1.45 0.36 0.00 1.01 0.13 0.00 0.86 0.81 655.04 1.03 0.56 0.45 N/A N/A N/A N/A N/A N/A 1.06 0.92 0.99 1.10 0.73 44.35 0.86 0.13 0.00 1.63 1.00 655.04

Sizee 1.29 3.76 1.15 6.59 0.77 0.79 1.54 0.87 7.38 1.50 0.59 1.99 N/A N/A 1.39 2.11 0.59 7.38

SC.a 0.0s 0.1s 0.1s 8.7s 0.0s 0.0s 0.0s 0.0s 2.3s 0.0s 0.1s 0.0s 22.2s 1.3s 0.0s 2.3s 0.0s 22.2s

Evaluator-based, no primops Cmp.b Runc Memd 1.00 0.89 0.87 1.07 0.98 0.95 1.07 1.17 1.08 2.85 0.59 0.67 0.96 0.99 1.00 0.98 0.72 1.07 1.00 0.52 0.45 1.00 0.67 1.00 1.97 0.05 0.00 1.02 0.14 0.00 1.34 0.74 18644.34 1.02 0.13 0.05 7.87 0.90 0.53 3.16 1.55 1.21 1.10 0.99 0.95 1.83 0.74 1243.61 0.96 0.05 0.00 7.87 1.55 18644.34

Sizee 3.24 2.26 2.81 85.17 1.00 0.87 1.37 2.00 20.78 2.46 7.22 2.53 71.07 18.35 1.21 14.82 0.87 85.17

SC.a 0.0s 0.1s 0.1s 15.4s 0.0s 0.0s 0.0s 0.0s 3.0s 0.2s N/A 0.2s 16.8s 1.4s 0.0s 2.7s 0.0s 16.8s

Evaluator-based, primops Cmp.b Runc Memd Sizee 1.03 0.92 0.87 3.24 1.07 0.98 0.95 2.24 1.08 1.18 1.09 2.79 3.35 0.55 0.67 114.31 0.98 1.05 1.00 0.91 0.98 0.71 1.07 0.80 1.00 0.51 0.45 1.38 1.00 0.67 1.01 2.00 1.95 0.06 0.00 21.15 1.24 0.68 0.93 9.09 N/A N/A N/A N/A 1.47 0.81 0.91 19.40 10.61 1.00 0.54 71.47 3.06 1.55 1.21 18.24 1.15 0.99 0.95 1.18 2.06 0.84 0.84 17.95 0.98 0.06 0.00 0.80 10.61 1.55 1.21 114.31

b GHC compile time relative to no supercompilation c Program runtime relative to no supercompilation Supercompilation time (seconds) e Size (in syntax tree nodes) of program relative to no supercompilation Runtime allocation relative to no supercompilation

Figure 4: Benchmark results Primitive operations Indeed, the supercompiler performed worse overall when evaluating primops than when it left them unevaluated – particularly suffering on “sumtree” and “treeflip”. These benchmarks have a common structure where a binary tree is generated and then consumed by a function pipeline, terminated by a simple sum of the tree nodes. The initial construction of the tree does not deforest cleanly, but the consuming function pipeline makes several intermediate copies of the tree which can be deforested to produce a function that produces the required sum directly. Both our system (without primops) and Mitchell’s system are able to fuse these pipelines together. The addition of primops to the system means that we create specialisations of the fused pipeline that include in their evaluation contexts frames such as 2 + •, where 2 is a partial sum of the tree. Every specialisation of the fused pipeline includes such a stack frame, and because the partial sum changes regularly those specialisations can never be reused. We end up building a lot of specialisations of the pipeline for a few values of the partial sum, before the termination condition kicks in and stops us. Unfortunately, the resulting termination splitting prevents us from fusing the pipeline entirely. The net result is that the first few iterations of the sum are computed with perfect deforestation, but later iterations must fall back on a fully-forested function isomorphic to the original unfused pipeline.

The benchmark where we do noticeably worse than Mitchell is “digitsofe2” – we actually increase both allocations and runtime, while he reduces each figure by more than 50%. Although the exact reasons remain unclear, it appears that once again the problem is that the supercompilation process has prevented GHC from aggressively unboxing the output. Supercompilation time Benchmarking our supercompiler on one program (“digits-of-e2”) showed that the vast majority of time (42%) is spent on managing names and renaming. Matching against previous states accounted for 14% of the runtime. Only 6% of time was spent testing the termination condition.

6.

Related Work

Supercompilation was introduced by Turchin [1], but has recently seen a revival of interest from both the call-by-value [3], call-byname [8] and call-by-need [7] perspectives. Partial evaluation [20] is a technique closely related to supercompilation. The fields overlap somewhat, but supercompilers tend to make a distinctive set of choices which set them apart: they specialise expressions in the context in which they occur, operate on unannotated programs and test for termination online. Theoretical work has suggested that certain kinds of partial evaluator suffer from strictly less information propagation than supercompilers, limiting their optimising power [6]. The idea of building a partial evaluation system around an actual evaluator is hardly new – it is present from the very earliest work by Sestoft et al. [21]. However, this approach seems to have received surprisingly little attention in the supercompilation community, though it is somewhat foreshadowed by early work of Turchin [22]. Much of the supercompilation literature makes use of the homeomorphic embedding test for ensuring termination [3, 8, 23]. Users of this test uniformly report that testing the termination condition makes up the majority of their supercompilers runtime [3, 23]. The tag-bag criterion appears to be much more efficient – our supercompiler spends only 6% of its runtime on termination testing. Jørgensen has previously produced a compiler for call-by-need through partial evaluation of a Scheme partial evaluator with respect to an interpreter for the lazy language [24]. His work made use of a partial evaluator capable of dealing with the set! primitive, which was used to implement updateable thunks. Our supercom-

Recursive let We are able to report results for two benchmarks (“wheel-sieve1” and “wheel-sieve2”) that Mitchell’s system is unable to supercompile because they make fundamental use of recursive let. We achieve an improvement in “wheel-sieve1” by deforesting intermediate lists, but actually manage to increase allocations in “wheel-sieve2”. Opportunities for improvement The “tak” benchmark reported a staggering 18,000-fold increase in allocations, although this was up from a very low base – the unmodified program allocates only 13kB. Mitchell’s supercompiler exhibits the same problem, albeit to a lesser degree. Investigation shows that the allocation increase is due to supercompilation introducing several large join points which take boxed integers as arguments. When compiled without supercompilation, there are no join points and all arithmetic is unboxed by GHC’s strictness analyser [19].

145

piler takes a direct approach that avoids the need for any imperative features in the language being supercompiled.

7.

In ESOP ’94: Proceedings of the 5th European Symposium on Programming, pages 485–500, London, UK, 1994. Springer-Verlag. [7] Neil Mitchell. Rethinking supercompilation. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2010. ACM, 2010.

Further Work

Because the supercompiler described here is nicely separated from issues of evaluation order, it should be straightforward to modify the system to supercompile a pure call-by-name language for e.g. the purposes of theorem proving. Indeed, a splitter for call-by-name (or call-by-value) is rather simple to define because such evaluation strategies have no equivalent to update frames, and it is always permissible to duplicate heap bindings – so no work-duplication check is required at all. We plan to extend the supercompiler to work on the typed language System FC [25] for implementation as a part of GHC. Again, this should be fairly straightforward, and involve mostly local changes to the evaluator. Supercompilation works best when it has access to the whole program, but GHC already has the necessary facilities to get hold of the definitions from imported modules, in the shape of interface files. Although our presentation is nicely modular, the split function remains a tricky point and heavily dependent on the semantics of the language under consideration. A principled way to derive split from the operational semantics would be an interesting avenue for further exploration.

8.

[8] Ilya Klyuchnikov. Supercompiler HOSC 1.0: under the hood. Preprint 63, Keldysh Institute of Applied Mathematics, Moscow, 2009. [9] Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of compiling with continuations. ACM SIGPLAN Notices, 28(6):237–247, 1993. [10] Peter Sestoft. Deriving a lazy abstract machine. Journal of Functional Programming, 7(03):231–264, 1997. [11] G. Higman. Ordering by divisibility in abstract algebras. Proceedings of the London Mathematical Society, 3(1):326, 1952. [12] Michael Leuschel. On the power of homeomorphic embedding for online termination. In Static Analysis, volume 1503 of Lecture Notes in Computer Science, pages 230–245. Springer Berlin / Heidelberg, 1998. [13] Neil Mitchell and Colin Runciman. Uniform boilerplate and list processing. In Proceedings of the ACM SIGPLAN Workshop on Haskell, page 60. ACM, 2007. [14] Morten Heine Sørensen and Robert Gl¨uck. Introduction to supercompilation. In Partial Evaluation, volume 1706 of Lecture Notes in Computer Science, pages 246–270. Springer Berlin / Heidelberg, 1999. [15] Simon Peyton Jones, Cordy Hall, Kevin Hammond, Jones Cordy, Kevin Hall, Will Partain, and Phil Wadler. The Glasgow Haskell compiler: a technical overview, 1992.

Conclusions

Supercompilation is a simple, powerful and principled technique for program optimisation. A single pass with a supercompiler achieves many optimisations that have traditionally been laboriously specified and implemented independently. We have shown how to produce a supercompiler by basing it explicitly on an evaluator. This clean design allowed us to extend the technique to lazy languages with recursive let, by building the supercompiler around a call-by-need evaluator. Initial benchmark results are promising, but also bring to light weaknesses in the algorithm. In particular, a method is sorely needed for reducing the worst-case runtime of supercompilation.

[16] Jan Kort. Deforestation of a raytracer. Master’s thesis, Department of Computer Science, University of Amsterdam, The Netherlands, 1996. [17] Duncan Coutts, Roman Leshchinskiy, and Donald Stewart. Stream fusion: From lists to streams to nothing at all. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, April 2007. [18] Will Partain. The nofib benchmark suite of Haskell programs. In Proceedings of the 1992 Glasgow Workshop on Functional Programming, pages 195–202, London, UK, 1993. Springer-Verlag. [19] Simon Peyton Jones and John Launchbury. Unboxed values as first class citizens in a non-strict functional language. In Functional Programming Languages and Computer Architecture, pages 636–666. Springer, 1991.

Acknowledgments This work was partly supported by a PhD studentship generously provided by Microsoft Research. Thanks are due to Neil Mitchell and Peter Jonsson for enlightening discussions and feedback, and to the anonymous reviewers for their detailed feedback.

[20] Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. Partial evaluation and automatic program generation. Prentice-Hall International Series In Computer Science, page 415, 1993. [21] Peter Sestoft. The structure of a self-applicable partial evaluator. In Programs as Data Objects, volume 217 of Lecture Notes in Computer Science, pages 236–256. Springer Berlin / Heidelberg, 1986. [22] Valentin F. Turchin. The algorithm of generalization in the supercompiler. Dines Bjørner, Andrei P. Ershov, and Neil D. Jones, editors, Partial Evaluation and Mixed Computation, pages 531–549. [23] Neil Mitchell and Colin Runciman. A supercompiler for core Haskell. In Implementation and Application of Functional Languages, volume 5083 of Lecture Notes in Computer Science, pages 147–164. Springer Berlin / Heidelberg, 2008.

References [1] Valentin F. Turchin. The concept of a supercompiler. ACM Trans. Program. Lang. Syst., 8(3):292–325, 1986. [2] Alexei P. Lisitsa and Andrei P. Nemytykh. Verification as specialization of interpreters with respect to data. In Proocedings of First International Workshop on Metacomputation in Russia, pages 94–112, 2008. [3] Peter A. Jonsson and Johan Nordlander. Positive supercompilation for a higher order call-by-value language. In POPL ’09: Proceedings of the 36th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, 2009.

[24] Jesper Jørgensen. Generating a compiler for a lazy language by partial evaluation. In POPL ’92: Proceedings of the 19th ACM SIGPLANSIGACT symposium on Principles of Programming Languages, pages 258–268, New York, NY, USA, 1992. ACM. [25] Martin Sulzmann, Manuel Chakravarty, Simon Peyton Jones, and Kevin Donnelly. System F with type equality coercions. In ACM SIGPLAN International Workshop on Types in Language Design and Implementation (TLDI’07). ACM, 2007.

[4] Philip Wadler. Deforestation: Transforming programs to eliminate trees. In ESOP ’88, volume 300 of Lecture Notes in Computer Science, pages 344–358. Springer Berlin / Heidelberg, 1988. [5] Simon Peyton Jones. Constructor specialisation for Haskell programs. Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, pages 327–337, 2007. [6] Morten Heine Sørensen, Robert Gl¨uck, and Neil D. Jones. Towards unifying partial evaluation, deforestation, supercompilation, and gpc.

146

Species and Functors and Types, Oh My! Brent A. Yorgey University of Pennsylvania [email protected]

Abstract

a certain size, or randomly generate family structures. There exist tools for accomplishing at least two of these tasks: QuickCheck [9] and SmallCheck [22] can be used to do random and exhaustive generation, respectively. However, suppose Dorothy now decides that the order of the families in a group doesn’t matter, although she wants to continue using the same list representation. Suddenly she is out of luck: Haskell has no way to formally describe this rather common situation, and there is no easy way to inform QuickCheck and SmallCheck of her intentions. She could add a Bag newtype,

The theory of combinatorial species, although invented as a purely mathematical formalism to unify much of combinatorics, can also serve as a powerful and expressive language for talking about data types. With potential applications to automatic test generation, generic programming, and language design, the theory deserves to be much better known in the functional programming community. This paper aims to teach the basic theory of combinatorial species using motivation and examples from the world of functional programming. It also introduces the species library, available on Hackage, which is used to illustrate the concepts introduced and can serve as a platform for continued study and research.

newtype Bag a = Bag [a ] and endow it with custom QuickCheck and SmallCheck generators, but this is rather ad-hoc. What if she later decides that the order of the families does matter after all, but only up to cyclic rotations? Or that groups must always contain at least two families? Or what if she wants to have a data structure representing the graph of interactions between different family groups? What Dorothy needs is a coherent framework in which to describe these sorts of sophisticated structures. The theory of species is precisely such a framework: for example, her original data structure can be described succinctly by the regular species expression F = 2 • X + X • (L ◦ F); Section 3 explains how to interpret this expression. The variants on her structure correspond to non-regular species (Section 5) and can be expressed with only simple tweaks to the original expression. The payoff is that these species expressions form an abstract syntax (Section 6) with multiple useful semantic interpretations, including ways to exhaustively enumerate, count, or randomly generate structures (Sections 7 and 8). This paper is available at http://www.cis.upenn.edu/ ~byorgey/papers/species-pearl.lhs as a literate Haskell document. The species library itself, together with a good deal of documentation and examples, is available on Hackage [10] at http://hackage.haskell.org/package/species.

Categories and Subject Descriptors D.3.3 [Language Constructs and Features]: Data types and structures; D.1.1 [Programming Techniques]: Applicative (Functional) Programming; G.2.1 [Combinatorics] General Terms Keywords

Languages, Theory

Combinatorial species, algebraic data types

1. Introduction The theory of combinatorial species was invented by Andr´e Joyal in 1981 [16] as an elegant framework for understanding and unifying much of enumerative combinatorics. Since then, mathematicians have continued to develop the theory, proving a wide range of fundamental results and producing at least one excellent reference text on the topic [4]. Connections to computer science and functional programming have been pointed out in detail, notably by Flajolet, Salvy, and Zimmermann [12, 13]. Sadly, however, this beautiful theory is not widely known among functional programmers. Suppose Dorothy G. Programmer has created the following data type to aid in her ethological study of alate simian family groups: data Family a = Monkey Bool a | Group a [Family a ]

2. Combinatorial species Intuitively, a species describes a family of structures, parameterized by a set of labels which identify locations in the structures. In programming language terms, a species is like a polymorphic type constructor with a single type argument.

That is, a family (parameterized by names of type a) is either a single monkey with a boolean indicating whether it can fly, or an alpha male together with a group of families. While developing and testing her software, Dorothy might want to do things such as enumerate or count all the family structures of

Definition 1. A species F consists of a pair of mappings, F• and F↔ , with the following properties: • F• , given a finite set U of labels, sends U to a finite set of

structures F• [U ] which can be “built from” the given labels. We call F• [U ] the set of F-structures with support U , or sometimes just F-structures over U . ∼ • F↔ , given a bijection σ : U1 ↔ U2 between two label sets U1 and U2 (i.e. a relabeling), “lifts” σ to a bijection between F-structures, ∼ F↔ [σ] : F• [U1 ] ↔ F• [U2 ].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright

147

Moreover, this lifting must be functorial: the identity relabeling should lift to the identity on structures, and composition of relabelings should commute with lifting.

3.1 Basic regular species For each primitive species or species operation, we will define a corresponding Haskell data type embodying the F• mapping— that is, values of the type will correspond to F-structures. The F↔ mapping, in turn, can be expressed by an instance of the Functor type class, whose method fmap :: (a → b) → f a → f b shows how to relabel a structure f a by applying a relabeling map a → b to each of its labels. (We can also use fmap to fill in a labeled shape with content, by applying it to a mapping from labels to data.) For each species we also exhibit a method to enumerate all distinct labeled structures on a given set of labels, via an instance of the Enumerable type class shown in Figure 2. The actual Enumerable type class used in the species library is more sophisticated, but not fundamentally so.

We usually omit the subscript on F• when it is clear from context. For example, Figure 1 illustrates lifting a relabeling σ to a relabeling of binary trees.

class Enumerable f where enumerate :: [a ] → [f a ] Figure 2. The Enumerable type class Finally, for each species or species operation we also exhibit a picture as an aid to intuition. These pictures are not meant to be formal, but they will generally conform to the following rules:

Figure 1. Relabeling a binary tree Note that the notion of structures in this definition is entirely abstract: a “structure” is just an element of the set output by F• , which could be anything (subject to the restrictions on F). Fans of category theory will recognize a much more concise version of this definition: a species is an endofunctor on B, the category of finite sets with bijections as arrows. You won’t need to know any category theory to understand this paper or to make use of the species library; however, the categorical point of view does add considerable conciseness and depth to the study of combinatorial species. The ability to relabel structures means that the actual labels we use don’t matter; we get “the same structures”, up to relabeling, for any label sets of the same size. We might say that species are parametric in the label sets of a given size. In particular, F’s action on all label sets of size n is determined by its action on any particular such set: if |U1 | = |U2 | and we know F[U1 ], we can determine F[U2 ] by lifting any bijection between U1 and U2 . So we often take the finite set of natural numbers {1, . . . , n} (also written [n]) as the canonical label set of size n, and write F [n] for the set of F-structures built from this set. As a final note on the definition, it is tempting to assume that labels play the role of the “data” held by data structures. Instead, however, labels should be thought of as names for the locations within a structure. The idea is that data structures can be decomposed into a shape together with some sort of content [1, 15]. In this case, a labeled shape is some sort of structure built out of labels, and the content can be specified by a mapping from labels to data (which need not be injective).

• The left-hand side of each picture shows a canonical set of

labels (depicted as circled numbers), either of an arbitrary size, or of a size that is “interesting” for the species being defined. Although the canonical label set [n] is used, of course the labels could be replaced by any others. • In the middle is an arrow labeled with the name of the species

being defined. • On the right-hand side is a set of structures, or some sort of

schematic depiction of the construction of a “typical” structure (the species then being understood to construct such structures “in all possible ways”). When the name of a species is superimposed on a set of labels, it represents a structure of the given species built from the labels. Zero The species 0 (Figure 3) corresponds to the constantly void type constructor. That is, it yields no structures no matter what labels it is given as input. We are forced to cheat a bit in the Functor instance for 0, since Haskell does not allow empty case expressions.

data 0 a instance Functor 0 where fmap = ⊥

3. Regular species Although the formal definition of species is good to keep in mind as a source of intuition, in practice we take an algebraic approach, building up complex species from a small set of primitive species and species operations. We start our tour of the species menagerie with what I term regular species.1 These should seem like old friends: intuitively, regular species correspond to Haskell’s algebraic data types. We’ll step back to define regular species more abstractly in Section 3.2, but first let’s see how to build them.

instance Enumerable 0 where enumerate = [ ] Figure 3. The primitive species 0 One The species 1 (Figure 4) yields a single unit structure when applied to an empty set of labels, and no structures otherwise. In other words, there is exactly one structure of type 1 a, and it contains no locations where values of type a can be stored. It

1 There

is no widely accepted name for this class of species; I call them regular since they correspond to the regular tree types of Morris et al. [20].

148

corresponds to nullary constructors in algebraic data types. The unit structure built by 1 is shown in Figure 4 as a filled square, to emphasize the fact that it contains no labels.

data 1 a = 1 instance Functor 1 where fmap 1 = 1

infixl 6 + data (f + g) a = Inl (f a) | Inr (g a)

instance Enumerable 1 where enumerate [ ] = [1] enumerate = [ ]

instance (Functor f , Functor g) ⇒ Functor (f + g) where fmap h (Inl x ) = Inl (fmap h x ) fmap h (Inr x ) = Inr (fmap h x )

Figure 4. The primitive species 1

instance (Enumerable Enumerable enumerate ls = map + + map

The species of singletons The species of singletons, X (Figure 5), yields a single structure when applied to a singleton label set, and no structures otherwise. That is, X corresponds to the identity type constructor, which has exactly one way of building a structure with a single location.

f , Enumerable g) ⇒ (f + g) where Inl (enumerate ls) Inr (enumerate ls)

(+) :: (f a → g a) → (h a → j a) → (f + h) a → (g + j ) a (fg + hj ) (Inl fa) = Inl (fg fa) (fg + hj ) (Inr ha) = Inr (hj ha) Figure 6. Species sum

data X a = X a

and g, witnessed by a pair of functions, one in each direction. We also define the identity isomorphism as well as composition and inversion of isomorphisms.

instance Functor X where fmap f (X a) = X (f a) instance Enumerable X where enumerate [x ] = [X x ] enumerate = []

infix 1 ↔ data f ↔ g = (↔) {to :: ∀a. f a → g a, from :: ∀a. g a → f a }

Figure 5. The species X of singletons

ident :: f ↔ f ident = id ↔ id

Species sum We define species sum (Figure 6) to correspond to type sum, i.e. disjoint (tagged) union. Given species F and G and a set of labels U , the set of (F + G)-structures over U is the disjoint union of the sets of F- and G-structures over U :

(>>>) :: (f ↔ g) → (g ↔ h) → (f ↔ h) (fg ↔ gf ) >>> (gh ↔ hg ) = (gh ◦ fg) ↔ (gf ◦ hg) inv :: (f ↔ g) → (g ↔ f ) inv (fg ↔ gf ) = gf ↔ fg

(F + G)[U ] = F [U ] G[U ]. In other words, an (F + G)-structure is either an F-structure or a G-structure, along with a tag specifying which. For example, 1 + X corresponds to the familiar Maybe type constructor. We can generalize 0 and 1 by defining the species n, for each n 0, to be the species which generates n distinct structures for the empty label set, and no structures for any nonempty label set; n is isomorphic to 1 + · · · + 1. For example, 2 corresponds to the

Figure 7. Isomorphisms Armed with these definitions, Figure 8 presents the algebraic laws for sum in Haskell form, as implemented in the species library. The one technical issue to note is that for the congruences inSumL and inSumR (and the corresponding inProdL and inProdR shown in the next section), we must be careful to use lazy pattern matches, since the isomorphism between f and g may not be needed. Always forcing the proof of (f ↔ g) to weakhead normal form can cause some isomorphisms between recursive structures (such as the one shown in Figure 15) to diverge.

n

constantly Bool type constructor, data CBool a = CBool Bool . It is not hard to verify that, up to isomorphism, 0 is the identity for species addition, and that + is associative and commutative. Since these algebraic laws correspond directly to generic isomorphisms between structures, we can represent the laws as Haskell code. We define a type of embedding-projection pairs, shown in Figure 7. A value of type f ↔ g is an isomorphism between f

Species product Just as species sum corresponds to type sum, we define species product (Figure 9) to correspond to type product, in

149

For example, X • X (which can be abbreviated X2 ) is the species of ordered pairs. X yields no structures unless it is given a single label, so the only way to get an X • X structure is if we start with two labels and partition them into two singleton sets to pass on to the X’s. Of course, there are two ways to do this, reflecting the two possible orderings of the labels. Similarly, X3 is the species of ordered triples, with 3! = 6 orderings for the labels, and so on. Up to isomorphism, 1 is an identity for species product, and 0 is an annihilator. It is also not hard to check that • is associative and commutative, and distributes over + (as usual, all up to isomorphism). Thus, species form a commutative semiring. The isomorphisms justifying these algebraic laws are shown in Figure 10, although their straightforward implementations are omitted in the interest of space.

sumIdL :: 0 + f ↔ f sumIdL = (λ(Inr x ) → x ) ↔ Inr sumComm :: f + g ↔ g + f sumComm = swapSum ↔ swapSum where swapSum (Inl x ) = Inr x swapSum (Inr x ) = Inl x sumAssoc :: f + (g + h) ↔ (f + g) + h sumAssoc = reAssocL ↔ reAssocR where reAssocL (Inl x ) = Inl (Inl x ) reAssocL (Inr (Inl x )) = Inl (Inr x ) reAssocL (Inr (Inr x )) = Inr x reAssocR (Inl (Inl x )) = Inl x reAssocR (Inl (Inr x )) = Inr (Inl x ) reAssocR (Inr x ) = Inr (Inr x )

prodIdL prodIdR

inSumL :: (f ↔ g) → (f + h ↔ g + h) inSumL ∼ (fg ↔ gf ) = (fg + id) ↔ (gf + id )

:: 1 • f ↔ f :: f • 1 ↔ f

prodAbsorbL :: 0 • f ↔ 0

inSumR :: (f ↔ g) → (h + f ↔ h + g) inSumR ∼ (fg ↔ gf ) = (id + fg) ↔ (id + gf )

prodComm :: f • g ↔ g • f prodAssoc :: f • (g • h) ↔ (f • g) • h prodDistrib :: f • (g + h) ↔ (f • g) + (f • h)

Figure 8. Algebraic laws for sum

inProdL inProdR

such a way that the resulting structures contain each label exactly once. So, to form an (F • G)-structure over U , we split U into two disjoint subsets, and form an ordered pair of an F-structure built from the first subset and a G-structure built from the second. Doing this in all possible ways yields the species (F • G). Formally, F[U1 ] × G[U2 ], (F • G)[U ] =

:: (f ↔ g) → (f • h ↔ g • h) :: (f ↔ g) → (h • f ↔ h • g) Figure 10. Algebraic laws for product

Least fixed points and the implicit species theorem If we add a least fixed point operator μ, we now get the regular types or algebraic data types familiar to any functional programmer [20]. For example, the species L of linear orderings (or lists for short) can be defined as L = μ.1 + X • . That is, a list is either empty (1) or an element paired with a list (X•). For any set U of labels, L[U ] is the set of all linear orderings of U (Figure 11); of course, |L[n]| = n!.

U =U1 U2

where denotes repeated disjoint union and × denotes Cartesian product of sets.

infixl 7 • data (f • g) a = f a • g a

instance Enumerable [ ] where enumerate = Data.List .permutations

instance (Functor f , Functor g) ⇒ Functor (f • g) where fmap h (x • y) = fmap h x • fmap h y

listRec :: [ ] ↔ 1 + (X • [ ]) listRec = unroll ↔ roll where unroll [ ] = Inl 1 unroll (x : xs) = Inr (X x • xs) roll (Inl 1) = [] roll (Inr (X x • xs)) = x : xs

instance (Enumerable f , Enumerable g) ⇒ Enumerable (f • g) where enumerate ls = [x • y | (ﬂs, gls) ← splits ls ,x ← enumerate ﬂs ,y ← enumerate gls ] splits :: [a ] → [([a ], [a ])] splits [ ] = [([ ], [ ])] splits (x : xs) = (map ◦ ﬁrst ) (x :) ss + + (map ◦ second ) (x :) ss where ss = splits xs ﬁrst f (x , y) = (f x , y) second f (x , y) = (x , f y)

Figure 11. The species L of linear orderings Actually, mathematicians would not write L = μ.1 + X • , but simply L = 1 + X • L. This is not because they are being sloppy, but because of the implicit species theorem [4], which is a combinatorial analogue of the implicit function theorem from analysis. Suppose we have a species equation which implicitly defines F in terms of itself. If

Figure 9. Species product

150

F yields no structures on the empty label set, and is not trivially reducible to itself, then the implicit species theorem guarantees that there is a unique solution for F with F(0) = 0, which is exactly the least fixed point of the implicit equation. Of course, the criteria given above are somewhat vague; a more precise formulation is explained in Section 8.1. The species L, as defined above, does not actually meet these criteria, since it yields a structure on the empty label set. However, if we let L+ denote the species of nonempty lists, with L+ + 1 = L, then we have L+ = X • (L+ + 1), which does meet the criteria and hence has a unique solution. Thus, L = 1 + L+ is uniquely determined as well, and we are justified in forgetting about μ and simply manipulating the implicit equation L = 1 + X • L however we like. For example, expanding L in its own definition, we find that

bTreeRec :: BTree ↔ 1 + X • BTree • BTree bTreeRec = unroll ↔ roll where unroll Empty = Inl 1 unroll (Node x l r ) = Inr (X x • l • r ) roll (Inl 1) = Empty roll (Inr (X x • l • r )) = Node x l r parenRec :: Paren ↔ X + Paren • Paren parenRec = unroll ↔ roll where unroll (Leaf x ) = Inl (X x ) unroll (Pair l r ) = Inr (l • r ) roll (Inl (X x )) = Leaf x roll (Inr (l • r )) = Pair l r Figure 14. Implicit equations for B and P

L=1+X•L = 1 + X • (1 + X • L) = 1 + X + X2 • L

species theorem, but we can easily express it in terms of one that does. Suppose we happen to notice that there seem to always be the same number of B-structures over n labels as there are P-structures over n + 1 labels (say, by enumerating small instances). Is there some sort of isomorphism lurking here? In particular, can we pair an extra element with a BTree to get something isomorphic to a Paren? Well, pairing an extra element with a BTree corresponds to the species X • B. Let’s just do some algebra and see what we get:

Continuing this process, we find that L = 1 + X + X2 + X3 + . . . , which corresponds to the observation that a list is either empty, or a single element, or an ordered pair, or an ordered triple, and so on. 1 . We can also “solve” the implicit equation for L to obtain L = 1−X This may seem nonsensical at this point—we have defined neither subtraction nor division of species—but it is perfectly valid in the context of virtual species (Section 8.1), and directly corresponds to the exponential generating function for L (Section 7). As another example of recursive species and the power of the implicit species theorem, consider the Haskell data types shown in Figure 12.

X • B = X • (1 + X • B2 ) = X + X2 • B2 = X + (X • B)2

data BTree a = Empty | Node a (BTree a) (BTree a)

Thus, X • B satisfies the same implicit equation as P—so by the implicit species theorem, they must be isomorphic! Not only that, we can directly read off the isomorphism as the composition of the isomorphisms corresponding to our algebraic manipulations (Figure 15). Of course, coding this by hand is a bit of a pain, but it is not hard to imagine deriving it automatically: the species library cannot yet do this, but it is planned for a future release.

data Paren a = Leaf a | Pair (Paren a) (Paren a) Figure 12. Binary trees and binary parenthesizations BTree is the type of binary trees with data at internal nodes, and Paren is the type of binary parenthesizations, that is, full binary trees with data stored in the leaves (Figure 13).

bToP :: X • BTree ↔ Paren bToP = inProdR bTreeRec prodDistrib inSumL prodIdR inSumR ( inProdR ( inProdL prodComm inv prodAssoc) prodAssoc inProdL bToP inProdR bToP ) inv parenRec

Figure 13. Binary trees (top) and parenthesizations (bottom)

>>> >>> >>>

>>> >>> >>> >>> >>>

Figure 15. The isomorphism between X • B and P

It is not hard to write down equations implicitly defining species corresponding to these types: B = 1 + X • B2

Could we have come up with this isomorphism without the theory of species? Of course. This particular isomorphism is not even that complex. The point is that by boiling things down to their essentials, the theory allowed us to elegantly and easily derive the isomorphism with only a few lines of algebra.

2

P =X+P

Figure 14 shows the isomorphisms witnessing these implicit equations. Again, B itself does not fulfill the conditions of the implicit

151

3.2 Regular species, formally

with no symmetries, Haskell’s data types are perfectly adequate to express any data structure we could possibly think up.

We are now ready to state the precise definition of regular species. A first characterization is as follows:

3.3 Other operations on regular species

Definition 2. A species F is regular if it can be expressed in terms of 1, X, +, •, and least fixed point.

In addition to sum and product, the class of regular species is also closed under other fundamental operations.

This definition validates the promised intuition that regular species correspond to Haskell algebraic data types, since normal Haskell98 data type declarations provide exactly these features (forgetting for the moment about infinite data structures). However, there is a more direct characterization that makes apparent why this particular collection of species is interesting. We must first define what we mean by the symmetries of a structure. Recall that Sn denotes the termsymmetric group of order n, which has permutations of size n (that is, bijections between {1, . . . , n} and itself) as elements, and composition of permutations as the group operation.

Species composition Given species F and G, the composition F ◦ G is a species which builds “F-structures made out of G-structures”, with the underlying labels distributed to the G-structures so that each label occurs exactly once in the overall structure (Figure 17). However, in order to ensure we get only a finite number of (F ◦ G)structures of each size, G must not yield any structures on the empty label set. This corresponds exactly to the criterion for composing formal power series, namely, that the inner series have no constant term. Specifically, to build an (F ◦ G)-structure over a label set U , we • partition U into some number of nonempty disjoint parts, U =

Definition 3. A permutation σ ∈ Sn is a symmetry of an F-structure f ∈ F[U ] if and only if σ fixes f , that is, F↔ [σ](f ) = f.

U1 U2 · · · Uk ;

• create a G-structure on each of the Ui ; • create an F-structure on these G-structures.

For example, Figure 16 depicts a tree with a set of labels at each node. This structure has many nontrivial symmetries, such as the permutation which swaps 4 and 6 but leaves all the other labels unchanged; since 4 and 6 are in the same set, swapping them has no effect.

Doing this in all possible ways yields the set of (F ◦ G)-structures over U .

newtype (f ◦ g) a = C {unC :: f (g a)} instance (Functor f , Functor g) ⇒ Functor (f ◦ g) where fmap h = C ◦ (fmap ◦ fmap) h ◦ unC instance (Enumerable f , Enumerable g) ⇒ Enumerable (f ◦ g) where enumerate ls = [C y | p ← partitions ls , gs ← mapM enumerate p , y ← enumerate gs ]

Figure 16. A labeled structure with nontrivial symmetries However, the binary trees shown in Figure 1 have only the trivial symmetry, since permuting their labels in any nontrivial way yields different trees.

partitions :: [a ] → [[[a ]]] partitions [ ] = [[ ]] partitions (x : xs) = [ (x : ys) : p | (ys, zs) ← splits xs ,p ← partitions zs ]

Definition 4. A species F is regular if every F-structure has the identity permutation as its only symmetry; such structures are also called regular. It turns out that these two definitions are equivalent (with the slight caveat that we must allow countably infinite sums and products in the first definition). That species built from sum, product, and fixed point have no symmetries is not hard to see; less obvious is the fact that up to isomorphism, every species with no symmetries can be expressed in this way (a proof sketch is given in Section 8.1). Of course, since we cannot write down infinite sums or products in Haskell, there are some regular species which cannot be expressed as simple algebraic data types. For example, the regular species of prime-length lists,

Figure 17. Species composition For example, R = X • (L ◦ R) (where L denotes the species of linear orderings) defines the species of rose trees, as defined in Data.Tree, with each node having a data element and any number of children. We can also easily encode nested data types [6] (such types are sometimes called “non-regular”, although that nomenclature is confusing in the current context, since they do in fact correspond to regular species). For example, B = X+B◦X2 is the species of complete binary trees; a B-structure is either a single leaf, or a complete binary tree with pairs of elements at the leaves. It is not hard to verify that composition is associative (but not commutative), and that it has X as both a left and right identity. Composition also distributes over both sum and product from the right: (F+G)◦H = (F◦H)+(G◦H), and similarly for (F•G)◦H.

X2 + X3 + X5 + X7 + X11 + . . . , cannot be written as a simple algebraic data type.2 But aside from infinite sums and products, as long as we stick to data structures 2 Although

I am sure it can be expressed using GADTs and type-level arithmetic. . .

152

• (F • G) = F • G + F • G

As noted at the beginning of this section, regular species are closed under composition. Although we won’t prove this formally, it makes intuitive sense: if an F-structure has no symmetries, and in each location we put a G-structure which also has no symmetries, the resulting composed structure cannot have any symmetries either.

• (F ◦ G) = (F ◦ G) • G

Cardinality restriction For a species F and a natural number n, Fn denotes the species F restricted to label sets of cardinality n. That is, F[U ] |U | = n Fn [U ] = ∅ otherwise.

Differentiation Of course, no discussion of an algebra for data types would be complete without mentioning differentiation. There has been a great deal of fascinating work in the functional programming community on differentiating data structures [2, 14, 17, 18]. As usual, however, the mathematicians beat us to it! Intuitively, the derivative of a data type D is the type of Dstructures with a single “hole”, that is, a distinguished location not containing any data. This is useful, for example, in building zipper structures [14] which keep track of a movable focus within a data structure. We can make this precise in the context of species by using a “dummy label” to correspond to the hole. Formally, given a species F, its derivative F sends label sets U to the set of F-structures on the label set U ∪ {∗}, where ∗ is a new element distinct from all elements of U . That is,

More generally, if P is any predicate on natural numbers, FP denotes the restriction of F to label sets whose size satisfies P . For example, Leven is the species of lists of even length. We have Leven = 1 + X2 • Leven = 1 + X • Lodd and L = Leven + Lodd . More generally, for any species we have F = Feven + Fodd = F0 + F1 + F2 + . . . As a final note, we often write F+ as an abbreviation for F0 , the species of nonempty F-structures. Unfortunately, it is difficult to represent general cardinality restriction with a Haskell type, since we would have to somehow embed a predicate on integers into the type level.

F [U ] = F[U ∪ {∗}].

4. Unlabeled structures Before moving on to non-regular species, it’s worth pausing to make precise what is meant by the “shape” of a structure.

newtype Diﬀ f a = Diﬀ (f (Maybe a)) deriving Functor

Definition 5. For a species F, an unlabeled F-structure, or F-shape, is an equivalence class of labeled F-structures, where two structures s and t are considered equivalent if there is some relabeling σ such that F↔ [σ](s) = t.

instance Enumerable f ⇒ Enumerable (Diﬀ f ) where enumerate ls = map Diﬀ (enumerate (Nothing : map Just ls))

In other words, two labeled structures are equivalent if they are relabelings of each other. For example, Figure 19 shows three rose tree structures. The first two are equivalent, but the third is not equivalent to the first two, since there is obviously no way to turn it into the first two merely by changing the labels.

Figure 18. Species differentiation For example, L -structures are lists with a distinguished hole. Since the structure on either side of the distinguished location is also a list, we have L = L2 . (The reader may enjoy proving this formally using the implicit species theorem and the algebraic identities for differentiation listed below.) Figure 18 shows a representation of this idea in Haskell. It is important to note that Diﬀ f a will often admit values which are not among those listed by enumerate. For example, although as a Haskell type Diﬀ 1 Int is perfectly well inhabited by the value Diﬀ 1, enumerate will always produce the empty list at type Diﬀ 1 Int. Isomorphisms between structure types should map between values listed by enumerate and not necessarily between all Haskell values of the given types; thus, for example, we are justified in treating Diﬀ 1 as isomorphic to 0. Although regular species are closed under differentiation, it should be noted that there are regular species which are the derivative of a non-regular species (X, for example, is the derivative of the non-regular species E2 , to be defined in Section 5). Why refer to this operation as differentiation? Simply because, somewhat astonishingly, it satisfies all the same algebraic laws as the differentiation operation in calculus! Explaining the intuition behind these isomorphisms and expressing them in Haskell is left as a challenge for the reader.

Figure 19. Equivalent and inequivalent trees Although unlabeled structures formally consist of equivalence classes of labeled structures, we can informally think of them as normal structures built from “indistinguishable labels”; for a given species F, there will be one unlabeled F-structure for each possible “shape” that F-structures can take. For example, Figure 20 shows all the rose tree shapes on four nodes. For regular species, the distinction between labeled and unlabeled structures is uninformative. Since every possible permutation of the labels of a regular structure results in a distinct labeled structure, there are always exactly n! times as many labeled as unlabeled structures of size n. Thus, a method to enumerate all unlabeled structures of a regular species is easy. In fact, it is a slight simplification of the code we have already exhibited for enumerating labeled structures: instead of taking a list of labels as input, we take simply a natural number size, and output structures full of unit

• 1 = 0 • X = 1 • (F + G) = F + G

153

about the order of the elements in a Bag, we create an Eq instance to identify any two Bags with the same elements. We can also use E to build other interesting and useful species. For example: • X • E is the species of pointed sets, also known as the species

of elements and sometimes written ε. Pointed set structures consist of a single distinguished label paired with a set structure on the rest of the labels. Thus, there are precisely n labeled ε-structures on a set of n labels, one for each choice of the distinguished label. We can also think of an (X • E) structure as consisting solely of the distinguished label, since the set of remaining labels adds no information. In other words, we can treat E as a sort of “sink” for the elements we don’t care about, and the same technique can be used generally for describing structures containing only a subset of the labels.

Figure 20. Unlabeled rose trees of size 4 values. Enumerating unlabeled structures for non-regular species, however, is much more complicated; a partial implementation of unlabeled enumeration can be found in the species package. For example, we can enumerate all unlabeled sets of sets of size 4, corresponding to integer partitions of 4: > enumerateU (set ‘o‘ nonEmpty set) 4 :: [Comp Set Set ()] [{{(),(),(),()}},{{(),()},{(),()}}, {{()},{(),(),()}},{{()},{()},{(),()}}, {{()},{()},{()},{()}}]

• E•E is the species of subsets, sometimes abbreviated ℘ in refer-

ence to the power set operator. Again, ℘-structures technically consist of a subset of the labels paired with its complement, but we may (and often do) ignore the complement.

• (E ◦ E+ ) is the species of set partitions: its structures are

5. Beyond regular species

collections of nonempty sets.

As promised, the theory of combinatorial species can describe structures with nontrivial symmetries just as easily as regular structures. This section introduces common non-regular species and combinators.

The species of cycles The primitive species C of directed cycles (Figure 22) yields all directed cyclic orderings (known in the mathematical literature as necklaces) of the given labels. By convention, cycles are always nonempty, and are read clockwise when represented pictorially.

The species of sets The primitive species of sets, usually denoted E (from the French ensemble), represents unordered collections of elements. For any given set of labels, there is exactly one set structure, the set of labels itself. It is easy to see that E is not regular, since E-structures have every possible symmetry; permuting the elements of a set leaves the set unchanged. Although the standard mathematical name for this species is the species of sets, a better name for it from a computer science perspective is the species of bags. The term set usually indicates both that the order of the elements doesn’t matter and that there are no duplicate elements; the species of bags only embodies the former, since we can have non-injective mappings from labels to data. However, to model sets we can certainly restrict ourselves to injective mappings.

newtype Cycle a = Cycle [a ] deriving Functor instance Eq a ⇒ Eq (Cycle a) where Cycle xs ≡ Cycle ys = any (≡ ys) (rotations xs) where rotations xs = zipWith (+ +) (tails xs) (inits xs) instance Enumerable Cycle where enumerate [ ] = [ ] enumerate (x : xs) = (map (Cycle ◦ (x :)) ◦ permutations ) xs Figure 22. The species C of cycles

newtype Bag a = Bag [a ] deriving Functor

C is also non-regular, since each labeled C-structure is fixed by certain nontrivial permutations, namely, the ones which only “rotate” the labels. An example of an interesting species we can build using C is E◦C, the species of permutations, corresponding to the observation that every permutation can be decomposed into a collection of disjoint cycles. We note also that a cycle with a hole in it is isomorphic to a list, that is, C = L (Figure 23).

instance Eq a ⇒ Eq (Bag a) where Bag xs ≡ Bag ys = xs ‘subBag ‘ ys ∧ ys ‘subBag ‘ xs where subBag b = null ◦ foldl (ﬂip delete) b instance Enumerable Bag where enumerate ls = [Bag ls ] Figure 21. The species E of sets In Figure 21, we declare a Haskell data type for E by declaring Bag to be isomorphic to [ ] via a newtype declaration, and deriving a Functor instance for Bag from the existing instance for lists, using GHC’s newtype deriving extension. Since we don’t care

Cartesian product Given two species F and G, we may define the Cartesian product F × G by (F × G)[U ] = F[U ] × G[U ],

154

Figure 23. Differentiating a cycle where the × on the right denotes the standard Cartesian product of sets. That is, an (F × G)-structure is a pair of an F-structure and a G-structure, both of which are built over all the labels, instead of partitioning the labels as with normal product (Figure 24). However, instead of thinking of the labels as being duplicated, we think of an (F × G)-structure as two structures which are superimposed on the same label set. In particular, when specifying the content for an (F×G)-structure, we should still only map each label to a single piece of data.

Figure 25. A (B × (E ◦ E+ ))-shape functorial composite by (F G)[U ] = F[G[U ]], that is, F-structures over the set of all G-structures on U (Figure 26). Like (F ◦ G)-structures, an (F G)-structure is an Fstructure of G-structures, but instead of partitioning the labels U among the G-structures, we give all the labels to every G-structure. As with Cartesian product structures, (F G)-structures appear to contain each label multiple times, but in fact we should still think of them as containing each label once, with a rich structure superimposed on it.

data (f × g) a = f a × g a instance (Functor f , Functor g) ⇒ Functor (f × g) where fmap f (x × y) = fmap f x × fmap f y instance (Enumerable f , Enumerable g) ⇒ Enumerable (f × g) where enumerate ls = [x × y | x ← enumerate ls , y ← enumerate ls ]

data (f g) a = FC {unFC :: f (g a)}

Figure 24. Cartesian product of species

instance (Functor f , Functor g) ⇒ Functor (f g) where fmap h = FC ◦ (fmap ◦ fmap) h ◦ unFC

One interesting use of Cartesian product is to model some type class-polymorphic data structures, where the type class methods provide us with a second observable structure on the data elements. For example, a type constructor F with an Eq constraint on its argument can be modeled by the species

instance (Enumerable f , Enumerable g) ⇒ Enumerable (f g) where enumerate = map FC ◦ enumerate ◦ enumerate

F × (E ◦ E+ ). Structures of this species consist of an F-structure with a superimposed partition on the labels, with each part corresponding to an equivalence class. For example, Figure 25 shows a binary tree shape with a superimposed partition indicating which sets of elements are equal. Likewise, we can model an Ord constraint by superimposing an (L ◦ E+ )-structure, which additionally places an observable ordering on the equivalence classes. This works particularly well in conjunction with the approach of Bernardy et al. [5] for testing polymorphic functions. Because of parametricity, it suffices to test polymorphic functions on randomly generated shapes filled with carefully chosen data; if the function works correctly for the chosen data then by parametricity it will work correctly for any data. The above discussion shows that we can treat Eq and Ord constraints as part of the shape of an input structure, and choose data to match. Cartesian product has E as both a left and right identity, and is associative, commutative, and distributes over species sum. Again, implementing these laws as isomorphisms is left as an exercise for the reader.

Figure 26. Functor composition The functor composition operation is especially useful for defining species of graphs and relations. For example, recalling that ℘ = E • E is the species of subsets and E2 is the species of sets restricted to sets of size two, ℘ (E2 • E) defines the species of simple graphs. An (E2 • E)-structure is a set of two labels, which we can think of as an undirected edge, and a simple graph is a subset of the set of all possible edges. In fact, many graph-like species can be defined as ℘ G for a suitable species G. For example, G = X2 • E gives directed graphs without self-loops, and G = ε × ε gives directed graphs with selfloops allowed (recalling that ε = X • E is the species of elements). The reader may enjoy discovering how to represent the species of undirected graphs with self-loops allowed.

6. An embedded language of species We have defined a type corresponding to each primitive species and species operation, but we would also like to be able to write down

Functor composition The final species operation we will explore is functor composition. Given species F and G, we define their

155

7. Generating functions and counting

and compute with species expressions at the term level. The perfect way to do this is with a type class defining a domain-specific language of species expressions. The expressions can then be interpreted in different ways (for example, as exponential generating functions, cycle index series, abstract syntax trees, or random generation routines) depending on the types at which they are instantiated. The basic Species type class, as defined in the species library, is shown in Figure 27. The actual Species class contains additional methods, but this is the core essence.

What else can we do with combinatorial species? A key element of the story we haven’t seen yet is the correspondence between species and generating functions. Generating functions are an indispensable tool in combinatorics, and have a well-developed theory [23]. Much of their power lies in the surprising fact that many natural power series operations (addition, multiplication, substitution. . . ) have natural combinatorial interpretations (disjoint union, independent choice, partition. . . ). Every species can be associated with several different generating functions, each of which encodes certain aggregate information about the species. For example, we can associate to each species F an exponential generating function (egf) of the form xn fn , F(x) = n! n0

class (Diﬀerential .C s) ⇒ Species s where singleton :: s set :: s cycle :: s (◦) :: s → s → s (×) :: s → s → s () :: s → s → s ofSize :: s → (Z → Bool ) → s

where fn is the number of distinct labeled F-structures of size n. (Note that x is a purely formal parameter, and we need not concern ourselves with convergence; for a more detailed explanation of generating functions, see Wilf [23].) Thus we have, for example, • 0(x) =

Figure 27. The Species type class

0 0 x 0!

+

0 1 x 1!

+ · · · = 0,

• 1(x) = 1, • X(x) = x,

Some things may seem to be missing (0, 1, sum and product, differentiation) but these are actually provided by the Diﬀerential .C constraint (from the numeric-prelude package), which ensures that species are a differentiable ring. The remainder of the class requires primitive singleton, set, and cycle species; composition (◦), cartesian product (×), and functor composition () operations; and a cardinality-restricting operator ofSize. It is not hard to put together the Enumerable instances we have already seen into code which enumerates all the labeled structures of a given species. The user can then call the enumerate method on an expression of type Species s ⇒ s, along with some labels to use:

• L(x) = • E(x) =

1 n x = ex , n! n0

• C(x) =

(n − 1)! n x = − log(1 − x). n! n1

(Here is another good reason to call the species of sets E!) At first glance this may seem arbitrary, but quite the opposite is true: species sum, product, composition, and differentiation correspond precisely to the same operations on formal power series! For example, if F(x) and G(x) count the number of labeled F- and Gstructures as defined above, it is easy to see that F(x)+G(x) counts the number of labeled (F + G)-structures in the same way, since every (F+G)-structure is either an F-structure or a G-structure (with a tag). And although it is not as immediately apparent, we can verify that F(x)G(x) = (F • G)(x) as well:

xn xn fn gn F(x)G(x) = n! n! n0 n0 n xk xn−k = fk gn−k k! (n − k)!

-- cycles of lists > enumerate (cycle ‘o‘ (nonEmpty linOrd)) "abc" :: [Comp Cycle [] Char] [<"cba">,<"cab">,<"bca">,<"bac">,<"acb">, <"abc">,<"a","cb">,<"a","bc">,<"ca","b">, <"ac","b">,<"ba","c">,<"ab","c">, <"b","a","c">,<"a","b","c">] -- simple graphs on three vertices > enumerate (subset @@ ksubset 2) [1,2,3] :: [Comp Set Set Int] [{},{{1,2}},{{1,3}},{{1,3},{1,2}},{{2,3}}, {{2,3},{1,2}},{{2,3},{1,3}}, {{2,3},{1,3},{1,2}}]

n0 k=0 n

xn k!(n − k)! n0 k=0 n

n xn = fk gn−k n! k n0 k=0

=

Since the species library is able to automatically generate Species expressions representing any user-defined data type, we can also enumerate values of user-defined data types, such as Family Int: > [ , , ,

n! n 1 x = , n! 1 − x n0

enumerate family [1,2] :: [Family Int] Group 2 [Group 1 []], Group 2 [Monkey True 1] Group 2 [Monkey False 1] Group 1 [Group 2 []], Group 1 [Monkey True 2] Group 1 [Monkey False 2]]

fk gn−k

The expression in the outermost parentheses is precisely the number of labeled (F • G)-structures on a label set of size n: for each

k from 0 to n, there are nk ways to pick k of the n labels to put in the F-structure, fk ways to create an F-structure from them, and gn−k ways to create a G-structure from the remaining labels. The reader may also enjoy working out why species differentiation corresponds to exponential generating function differentiation. Seeing the correspondence between species composition and egf

We can also control how the enumeration happens by explicitly specifying the species to use for the Family type, rather than using the default.

156

substitution takes more work; the interested reader should look up Fa`a di Bruno’s formula. There are also generating function operations corresponding to Cartesian product and functor composition. Although they are not as natural as the other operations, they are simple to define and easy to compute. As a result, we can count labeled structures by interpreting species expressions as exponential generating functions, conveniently represented by infinite lazy lists of coefficients. In fact, this particular technique of counting labeled structures has been known in the functional programming community for some time [19, 21]. The species library defines an EGF type with an appropriate Species instance, and a labeled function to extract the coefficients from an egf. For example:

species are exactly what we need to make the implicit species theorem precise. First, we can write an implicit equation for F in the form F = H(X, F), where H is a two-sort species. For example, if H(X, Y) = 1 + X • Y then L = H(X, L) is the implicit equation defining the species of lists. Now the necessary conditions for the implicit species theorem to apply can now be stated precisely: • H(0, 0) = 0 (no structures on the empty label set) •

∂H (0, 0) ∂Y

= 0 (F does not trivially reduce to itself)

Extending the species library to handle general multisort species presents an interesting challenge, due to Haskell’s lack of kind polymorphism, and is a topic for further research. Virtual species It is possible to complete the semiring of species to a ring in a way analogous to the set-theoretic completion of the natural numbers to the integers. We consider pairs of species (F, G) where F is considered “positive” and G “negative”; more precisely, we define an equivalence relation on pairs of species such that (F, G) ∼ (H, K) if and only if F + K is isomorphic to G + H, and define virtual species as the equivalence classes of this relation. Virtual species allow us to define a multiplicative inverse for the species E of sets, and from there to define multiplicative and compositional inverses for other suitable species, solve differential equations, and define a combinatorial logarithm which generalizes the notion of structures built from connected components. Virtual species allow us to give a sensible and consistent meaning to equations like L = 1/(1 − X).

> take 10 . labeled $ 3 + x*x [3,0,2,0,0,0,0,0,0,0] > take 10 . labeled $ cycle ‘o‘ (nonEmpty linOrd) [0,1,3,14,90,744,7560,91440,1285200, 20603520] Thus, there are no C ◦ L+ structures of size 0, one with a single label, three with two labels, 14 of size three, and so on. The library can even compute generating functions for some recursively defined species, using a quadratically converging combinatorial analogue of the Newton-Raphson method. For example, once Dorothy has used the species library to derive all the appropriate instances for her Family type via Template Haskell, she can use the labeled function to count them:

Molecular species Definition 6. A species F is molecular if all F-structures are isomorphic (i.e. relabelings of one another).

> take 10 . labeled $ family [0,3,6,72,1368,36000,1213920,49956480,2427788160 ,136075645440]

For example, the species X2 of ordered pairs is molecular, since we can go from any ordered pair to any other by relabeling. On the other hand, the species L of linear orderings is not molecular, since any two list structures of different lengths are fundamentally non-isomorphic. We have the following three beautiful facts:

To each species we can also associate an ordinary generating function (ogf) and a cycle index series; the first counts unlabeled structures (shapes), and the second is a generalization of both exponential and ordinary generating functions which also keeps track of symmetries. There is not space to describe them here, but more information can be found in Bergeron et al. [4] or in the documentation for the species library, which includes facilities for computing with all three types of generating functions.

• The molecular species are precisely those that cannot be de-

composed as the sum of two nonzero species. • Every molecular species is equivalent to Xn “quotiented by

some symmetries”; in particular, the molecular species of size n are in one-to-one correspondence with the conjugacy classes of subgroups of Sn . This gives us a way to completely classify molecular species and to compute with them directly. For example, there are four conjugacy classes of subgroups of S3 , each representing a different symmetry on three locations: the trivial subgroup corresponds to X3 itself (no symmetry); swapping two locations yields X • E2 ; cycling the locations yields to C3 ; and identifying all the locations yields E3 .

8. Extensions and applications It should come as no surprise that we have barely scratched the surface; the theory of combinatorial species is both rich and deep. In closing, we will look at some extensions to the theory discussed here which may lead to a deeper understanding of functional programming and data types, as well as potential applications of the theory. At the time of writing, the species library does not include support for any of these extensions, but their inclusion is planned for future releases.

• Every species can be written uniquely (up to isomorphism and

reordering of terms) as a sum of molecular species. This, combined with the previous fact, immediately gives us a complete classification of all combinatorial species. It also provides a method for finding canonical representatives of virtual species: given a pair (F, G), decompose each into a sum of molecular species and cancel any that occur in both F and G. As a corollary, we can always detect when a species that “looks” virtual is actually non-virtual, such as L − 1.

8.1 Extensions Weighted species Assigning weights to the structures built by a species allows us to count and enumerate structures in much more refined ways. For example, we can define the species of binary trees, weighted by the number of leaves, and then easily count or enumerate only those trees with a certain number of leaves.

Now we see why species with no nontrivial symmetries can always be built from 1, X, +, and •: any species with no symmetries must be isomorphic to a sum of molecular species with no symmetries; but molecular species with no symmetries must be of the form Xn . Hence regular species are always of the form n0 + n1 X + n2 X2 + . . . with ni ∈ N. Adding a fixed point oper-

Multisort species We have only considered species which map a single set of labels to a set of structures, corresponding to type constructors of a single argument. However, all of the theory generalizes straightforwardly to multisort species, which build structures from multiple sets of labels (sorts). For example, multisort

157

ator allows us to write down certain infinite such sums using only finite expressions.

[7] Benjamin Canou and Alexis Darrasse. Fast and sound random generation for automated testing and benchmarking in objective Caml. In ML ’09: Proceedings of the 2009 ACM SIGPLAN workshop on ML, pages 61–70, New York, NY, USA, 2009. ACM.

8.2 Applications

[8] Jacques Carette and Gordon Uszkay. Species: making analytic functors practical for functional programming. Available at http://www. cas.mcmaster.ca/~ carette/species/, 2008. [9] Koen Claessen and John Hughes. QuickCheck: a lightweight tool for random testing of Haskell programs. In ICFP ’00: Proceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages 268–279, New York, NY, USA, 2000. ACM.

Automated testing One interesting application is to use species expressions as input to a test-generator-generator, for either random [9] or exhaustive [22] testing. In fact, Canou and Darrasse [7] have already created a library for random test generation in OCaml based on the ideas of combinatorial species. There has also been some interesting recent work by Dureg˚ard on automatic derivation of QuickCheck generators for algebraic data types [11], and by Bernardy et al. on using parametricity to improve random test generation [5]; combining these approaches with insights from the theory of species seems promising.

[10] Duncan Coutts, Isaac Potoczny-Jones, and Don Stewart. Haskell: batteries included. In Haskell ’08: Proceedings of the first ACM SIGPLAN symposium on Haskell, pages 125–126, New York, NY, USA, 2008. ACM. [11] Jonas Almstr¨om Dureg˚ard. AGATA: Random generation of test data. Master’s thesis, Chalmers University of Technology, December 2009.

Language design What if we had a programming language that actually allowed us to declare non-regular data types? What would such a language look like? Could it be made practical? Carette and Uszkay [8] have explored this question by creating a Haskell library allowing the user to program with species. Abbott et al. have explored a similar question from a more theoretical point of view, with their more general notion of quotient containers [3]. More work needs to be done to explain the precise relationship between containers and species, and to transfer these approaches into practical technology available to programmers.

[12] P. Flajolet, B. Salvy, and P. Zimmermann. Lambda-upsilon-omega: The 1989 cookbook. Technical Report 1073, Institut National de Recherche en Informatique et en Automatique, August 1989. 116 pages. [13] Philippe Flajolet and Bruno Salvy. Computer algebra libraries for combinatorial structures. Journal of Symbolic Computation, 20(56):653–671, 1995. [14] G´erard Huet. Functional pearl: The zipper. J. Functional Programming, 7:7–5, 1997. [15] C. Barry Jay and J. Robin B. Cockett. Shapely types and shape polymorphism. In ESOP ’94: Proceedings of the 5th European Symposium on Programming, pages 302–316, London, UK, 1994. SpringerVerlag.

Acknowledgments I would like to thank the anonymous reviewers for many detailed and helpful comments, and Jeremy Gibbons for the initial encouragement a year ago to write this paper. This work was partially supported by the National Science Foundation, under grant 0910786 TRELLYS.

[16] Andr´e Joyal. Une th´eorie combinatoire des S´eries formelles. Advances in Mathematics, 42(1):1–82, 1981. [17] Conor McBride. The Derivative of a Regular Type is its Type of OneHole Contexts. Available at http://www.cs.nott.ac.uk/~ ctm/ diff.ps.gz, 2001.

References

[18] Conor McBride. Clowns to the left of me, jokers to the right (pearl): dissecting data structures. In Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 287–295, San Francisco, California, USA, 2008. ACM.

[1] Michael Abbott, Thorsten Altenkirch, and Neil Ghani. Categories of Containers. In Foundations of Software Science and Computation Structures, pages 23–38. 2003. [2] Michael Abbott, Thorsten Altenkirch, Neil Ghani, and Conor McBride. Derivatives of Containers. In Typed Lambda Calculi and Applications, TLCA, volume 2701 of LNCS. Springer-Verlag, 2003.

[19] M. Douglas McIlroy. Power series, power serious. Journal of Functional Programming, 9(03):325–337, 1999. [20] Peter Morris, Thorsten Altenkirch, and Conor Mcbride. Exploring the regular tree types. 2004. [21] Dan Piponi. A small combinatorial library, November 2007. http:// blog.sigfpe.com/2007/11/small-combinatorial-library. html. [22] Colin Runciman, Matthew Naylor, and Fredrik Lindblad. Smallcheck and lazy smallcheck: automatic exhaustive testing for small values. In Haskell ’08: Proceedings of the first ACM SIGPLAN symposium on Haskell, pages 37–48, New York, NY, USA, 2008. ACM. [23] Herbert S. Wilf. Generatingfunctionology. Academic Press, 1990.

[3] Michael Abbott, Thorsten Altenkirch, Neil Ghani, and Conor McBride. Constructing Polymorphic Programs with Quotient Types. In 7th International Conference on Mathematics of Program Construction (MPC 2004), volume 3125 of LNCS. Springer-Verlag, 2004. [4] F. Bergeron, G. Labelle, and P. Leroux. Combinatorial species and tree-like structures. Number 67 in Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 1998. [5] Jean-Philippe Bernardy, Patrik Jansson, and Koen Claessen. Testing polymorphic properties. In ESOP 2010: Proceedings of the 19th European Symposium on Programming, pages 125–144, London, UK, 2010. Springer-Verlag. [6] Bird and Meertens. Nested datatypes. In MPC: 4th International Conference on Mathematics of Program Construction. LNCS, SpringerVerlag, 1998.

158

Author Index Achten, Peter ................................................... 49 Aswad, Mustafa K............................................ 91 Biernacki, Dariusz ........................................... 25 Bolingbroke, Maximilian .............................. 135 Chakravarty, Manuel M. T............................. 109 Dias, João ...................................................... 121 Dijkstra, Atze .................................................. 37 Elliott, Trevor .................................................. 79 Jeuring, Johan ................................................. 37 Koopman, Pieter ............................................. 49 Launchbury, John ............................................ 79 Löh, Andres ..................................................... 37 Loidl, Hans-Wolfgang .................................... 91 Magalhães, José Pedro .................................... 37 Maier, Patrick .................................................. 91 Mainland, Geoffrey ......................................... 67 Marlow, Simon ............................................... 91 Morris, J. Garrett.............................................. 61 Morrisett, Greg ................................................ 67 Ostermann, Klaus .............................................. 1 O’Sullivan, Bryan ......................................... 103 Peyton Jones, Simon ............................. 121, 135 Piróg, Maciej ................................................... 25 Plasmeijer, Rinus ............................................ 49 Ramsey, Norman ........................................... 121 Rendel, Tillmann ............................................... 1 Straka, Milan ................................................... 13 Terei, David A. .............................................. 109 Tibell, Johan .................................................. 103 Trinder, Phil .................................................... 91 van Groningen, John ....................................... 49 van Noort, Thomas .......................................... 49 Yorgey, Brent A............................................. 147

159

Haskell’10 Proceedings of the 2010 ACM SIGPLAN Haskell Symposium

Erlang’10 Proceedings of the 2010 ACM SIGPLAN Erlang Workshop

ICFP’10 Proceedings of the 2010 ACM SIGPLAN International Conference on Functional Programming

Haskell Language Report 2010

Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics 103)

Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics)

Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics)

Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics)

Proceedings of the 16th annual ACM-SIAM symposium on discrete algorithms

Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming : August 31-September 2, 2009, Edinburgh, Scotland

Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms

Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms

Proceedings of the 15th annual ACM-SIAM symposium on discrete algorithms

Proceedings of the Fourteenth Annual Acm-Siam Symposium on Discrete Algorithms

Proceedings of the 14th Annual Acm-Siam Symposium on Discrete Algorithms

Patras Logic Symposium: Proceedings

Proceedings of the Third Scandinavian Logic Symposium

L.E.J.Brouwer Centenary Symposium Proceedings

Proceedings of the Second Scandinavian Logic Symposium

Proceedings of Microwave Update 2010

Communications of ACM, 2010 vol 53 issue 12

Complete Proceedings of the NordiCHI 2010 Conference

Unifying Theories of Programming: Third International Symposium, UTP 2010, Shanghai, China, November 15-16, 2010, Proceedings

Communications of ACM 2010, vol 53 issue 10

Environment and the Formation of Galaxies: 30 years later: Proceedings of Symposium 2 of JENAM 2010

Communications of the ACM (March)

Communications of the ACM (April)

The Haskell school of expression

Star Clusters in the Era of Large Surveys: Proceedings of Symposium 5 of JENAM 2010

Generalized Recursion Theory II: Proceedings of the 1977 Oslo Symposium: Symposium Proceedings: 2nd, 1977

Operations Research Proceedings 2010

Haskell’10 Proceedings of the 2010 ACM SIGPLAN Haskell Symposium

Erlang’10 Proceedings of the 2010 ACM SIGPLAN Erlang Workshop

ICFP’10 Proceedings of the 2010 ACM SIGPLAN International Conference on Functional Programming

Haskell Language Report 2010

Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics 103)

Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics)

Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics)

Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Proceedings in Applied Mathematics)

Proceedings of the 16th annual ACM-SIAM symposium on discrete algorithms

Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming : August 31-September 2, 2009, Edinburgh, Scotland

Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms

Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms

Proceedings of the 15th annual ACM-SIAM symposium on discrete algorithms

Proceedings of the Fourteenth Annual Acm-Siam Symposium on Discrete Algorithms

Proceedings of the 14th Annual Acm-Siam Symposium on Discrete Algorithms

Patras Logic Symposium: Proceedings

Proceedings of the Third Scandinavian Logic Symposium

L.E.J.Brouwer Centenary Symposium Proceedings

Proceedings of the Second Scandinavian Logic Symposium

Proceedings of Microwave Update 2010

Communications of ACM, 2010 vol 53 issue 12

Complete Proceedings of the NordiCHI 2010 Conference

Unifying Theories of Programming: Third International Symposium, UTP 2010, Shanghai, China, November 15-16, 2010, Proceedings

Communications of ACM 2010, vol 53 issue 10

Environment and the Formation of Galaxies: 30 years later: Proceedings of Symposium 2 of JENAM 2010

Communications of the ACM (March)

Communications of the ACM (April)

The Haskell school of expression

Star Clusters in the Era of Large Surveys: Proceedings of Symposium 5 of JENAM 2010

Generalized Recursion Theory II: Proceedings of the 1977 Oslo Symposium: Symposium Proceedings: 2nd, 1977

Operations Research Proceedings 2010

Recommend Documents