This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
z’) return (y’ : runEval (roll q’ zs))
evalCluster :: forall a c w . Cluster a c => w c -> Int -> Strategy a -> Strategy a evalCluster _ n s x = return (decluster (cluster n x ‘using‘ cs)) where cs = evalTraversable s :: Strategy (c a)
parBuffer :: Int -> Strategy a -> Strategy [a] parBuffer n s = evalBuffer n (rpar ‘dot‘ s)
5.2
Thanks to the Traversable context (inherited from Cluster), we can lift the strategy s to a strategy cs which is applicable to the clustered input. Note that the type annotation in the where clause necessitates the explicit forall in the signature.
Clustering
When tuning the performance of parallel programs it is often important to increase the size of parallel computation, i.e. to use a coarser granularity, in order to achieve a better ratio of computation versus coordination costs. Implementations often contain mechanisms to automatically use coarser granularity on loaded processors. The scenario of fizzling sparks discussed in Section 3.1 is such an example, because the work of a spark is performed by an already running computation. However further improvements can be obtained by explicitly controlling thread granularity, and in the context of the original strategies we developed a range of clustering techniques [Loidl et al. 2001]. This section adapts these techniques for the new strategies and extends them.
With this infrastructure we can define a generic parMapCluster, a variant of parMap performing implicit clustering (based on the Cluster class) behind the scenes. parMapCluster :: forall a b c w . Cluster [b] c => w c -> Int -> Strategy b -> (a -> b) -> [a] -> [b] parMapCluster _ n s f xs = map f xs ‘using‘ evalCluster (__ :: w c) n (rpar ‘dot‘ evalList s)
Observe how a type annotation is used to emulate passing the (wrapped) cluster type c as a “type argument” to evalCluster; the double underscore __ is short for the bottom value undefined.
One way to obtain a coarser granularity is to collect computations on related elements of a data structure into “clusters.” To this end, we define a class Cluster containing cluster and decluster methods, as well as a method lift that turns an operation over the original data structure into one over such a clustered data structure.
To improve readability, instead of wrapping the type argument with fresh type variables, we can use a properly named phantom type: data ClusterWith :: (* -> *) -> *
class (Traversable c, Monoid a) => Cluster a c where cluster :: Int -> a -> c a decluster :: c a -> a lift :: (a -> b) -> c a -> c b
Now it is intuitive that parMapCluster (__ :: ClusterWith []) uses lists for clustering. 5.3
lift = fmap -- c is a Functor, via Traversable decluster = fold -- c is Foldable, via Traversable -- we require: decluster . cluster n == id
A Divide-and-conquer Pattern
One of the main strengths of strategies is the possibility of constructing abstractions over patterns of parallel computation. Thereby all code specifying the coordination of the program is confined to the pattern. Concrete applications can then instantiate the function parameters to get parallel execution for free. Such patterns are commonly known as algorithmic skeletons [Cole 1989].
By assuming the Traversable and Monoid contexts we get several operations for free. Through the implicit Functor context, we can use fmap to lift an operation over the base type to one in the cluster type. And through the Monoid and Foldable contexts (the latter implicit), we can use fold as the default for decluster — provided it is an inverse of cluster.
As an example we give the implementation of a divide-and-conquer pattern. It is parametrised by a function that specifies the operation to be applied on atomic arguments (f), a function to (potentially) divide the argument into two smaller values (divide), and a function to combine the results from the recursive calls (conquer). Additionally, we provide a function threshold that is used to limit the amount of parallelism, by using a sequential strategy for arguments below the threshold.
As an example we provide an instance for lists, clustered into lists of lists. Notably, we only have to provide a definition for the cluster method. instance Cluster [a] [] where cluster _ [] = [] cluster n xs = ys:cluster n zs where (ys,zs) = splitAt n xs
divConq :: -> -> -> -> ->
We aim to define a strategy combinator evalCluster :: Cluster a c => Int -> Strategy a -> Strategy a
98
(a a (a (b (a b
-> b) -> Bool) -> b -> b) -> Maybe (a,a))
------
compute the result the value par threshold reached? combine results divide
that programmers who do need to “hand-roll” their own strategies may want to wrap them in MkStrat. Thus, MkStrat marks the pieces of code where programmers incur “proof obligations” to establish identity safety.
divConq f arg threshold conquer divide = go arg where go arg = case divide arg of Nothing -> f arg Just (l0,r0) -> conquer l1 r1 ‘using‘ strat where l1 = go l0 r1 = go r0 strat x = do r l1; r r1; return x where r | threshold arg = rseq | otherwise = rpar
6.
• For all programs, the speedup and runtime results with original
and new strategies are very similar, giving us confidence that they specify the same parallel coordination for a range of programs and parallel paradigms (Figure 2).
All coordination aspects of the function are encoded in the strategy strat, which describes how the two subcomputations l1 and r1 should be evaluated. The thresholding predicate threshold provided by the caller places a bound on the depth of parallelism, and this is used by strat to decide whether to spark both l1 and r1 or to evaluate them directly. The definition of divConq achieves separation between the specifications of algorithm and parallelism, the latter being confined entirely to the definition of strat. 5.4
• The speedups achieved with the new strategies are slightly
better compared to those with the original strategies: a mean of 4.96 versus 4.83 on 7 cores across all applications (columns 3 & 2 of Table 2). • The new strategies fix the space leak outlined in Section 3, re-
ducing heap residency on a single core by 56.43% across all applications, and better support speculative parallelism (Section 6.4).
Improving Safety
The original strategy type a -> () embodies the key modularity goal of separating computation and coordination. As any original strategy can only ever return (), it can never change the result of a computation, up to divergence. Unfortunately, the new strategy type gives up this type safety. Strategies of the new type a -> Eval a should be identity functions, i.e. only evaluate their argument but never change its value; we term this property identity safety. However the type system cannot enforce this behaviour and it is all too easy to accidentally write flawed strategies, for instance:
• The overheads of the new strategies are low: mean sequential
run-time overhead is 3.84% (Table 1), and memory overheads are low for most programs (columns 8 – 11 of Table 2). 6.1
Apparatus
Our measurements are made on an eight-core, 8GB RAM, 6MB L2 cache, HP XW6600 Workstation comprising two Intel Xeon 5410 quad-core processors, each running at 2.33GHz. The benchmarks run under Linux Fedora 7 using a recent GHC development snapshot (6.13 as of 20.5.2010), and parallel packages 1.1.0.1 and 3.1.0.0, for original and new strategies, respectively. The data points reported are the median of 3 executions. We measure up to 7 cores as measurements on the eighth core are known to introduce some variability.
x:xs ‘using‘ \ _ -> parList rdeepseq xs
The intention of the programmer is to evaluate the tail of the list in parallel when the list is demanded. The strategy will do that, but then returns only the tail of the list. We propose a way of trading expressiveness for type checked identity safety. For this purpose, the module SafeStrategies7 clones the functionality and interface of Strategies, except for wrapping the strategy type with a newtype and providing an explicit strategy application operator.
Our benchmarks are 10 parallel applications from a range of application areas; 8 of these have been taken from existing benchmarks suites [Aswad et al. 2009; Loidl et al. 1999; Marlow et al. 2009] and 2 benchmarks, Coins and TransClos, have been developed afresh with the new strategies module. The programs are the computational kernels of realistic applications, cover a variety of parallel paradigms, and employ several important parallel programming techniques, such as thresholding to limit the amount of parallelism generated, and clustering to obtain coarser thread granularity.
newtype Strategy a = MkStrat (a -> Eval a) ($$) :: Strategy a -> a -> Eval a (MkStrat strat) $$ x = strat x
By hiding the constructor MkStrat when importing the module SafeStrategies, programmers can choose to treat Strategy as an abstract type, thereby restricting themselves to use only strategies constructed by the predefined and trusted (identity safe) strategy combinators (like evalList and evalTraversable). Since MkStrat is not available, the type checker will prevent programmers from “hand-rolling” their own strategies (e.g. the flawed strategy above will be rejected), thereby eliminating the danger of accidentally violating identity safety.
Genetic aligns RNA sequences, using divide-and-conquer parallelism and nested data parallelism. MiniMax performs an alphabeta search in a 2-player game tree, using divide-and-conquer parallelism and exploiting laziness to prune unnecessary subtrees. Queens solves the n-queens problem, using divide-and-conquer parallelism with an explicit threshold. LinSolv finds an exact solution to a set of linear equations, employing the data parallel multiple homomorphic images approach. Hidden performs hidden-line removal in 3D rendering and uses data parallelism via the parList strategy. Maze searches for a path in a 2D maze and uses speculative data parallelism. Sphere is a ray-tracer from the Haskell nofib suite, using nested data parallelism, implemented as parMaps. TransClos finds all elements that are reachable via a given relation from a given set of seed values, using a parBuffer strategy over an infinite list. Coins computes the number of ways of paying a given value from a given set of coins, using divide-and-conquer parallelism. MatMult performs matrix multiplication using data parallelism with implicit clustering.
Yet, programmers can still use the Eval monad freely. For instance, the non-modular example of task parallelism from Section 4.6 can be ported to SafeStrategies by inserting $$ after rpar and rseq. Of course, careless use of rpar may cause space leaks or lost parallelism, depending on the GC policy (Section 3), but that is a lesser concern than identity safety because it does not compromise functional correctness. Why does SafeStrategies export the constructor MkStrat at all, rather than making Strategy a proper abstract type? The reason is 7 currently
Evaluation
This section discusses our measurements in detail, but first we summarise the key results:
not distributed with the parallel package
99
Program
LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze Geom. Mean
Seq. Runtime (seconds) 23.40 83.12 21.11 36.98 42.49 25.51 18.85 33.46 4.61 40.93
Original Strategies +0.90 +0.77 +4.78 +0.87 +1.11 +1.37 +16.87 +2.96 +5.86 -2.22 +3.21
∆ Time (%) New Paradigm Strategies +1.97 Nested Data par +2.24 Data par +3.32 Nested Data par +3.22 D&C +2.12 D&C +6.12 D&C +18.14 Data par +3.97 D&C Data par +2.17 Data par -3.59 Nested Data par +3.84
Interestingly, the performance of the new strategies in the Queens and Sphere programs is better than in the original strategies. Examining the heap consumption reveals that with the new strategies the heap residency is significantly reduced: −24.11% for Queens and −14.53% for Sphere. This results in a lower total garbage collection time, which contributes to about half of the reduction in runtime. The reduction in residency is accounted for by the improved space behaviour of the new strategies: the space retained by superfluous sparks is being reclaimed. Granularity improvement: The comparison of generated versus converted sparks in Table 2 demonstrates the runtime system’s effective handling of potential parallelism (sparks). Even when an excessive number of sparks is generated, for example in Coins, the runtime-system converts only a small fraction of these sparks. As with any divide-and-conquer program, a thread generated for a computation close to the root will itself evaluate potential child computations, causing their corresponding sparks to fizzle. Hence the granularity of the generated sparks is automatically made coarser, reducing overheads, as can be seen from the speedups achieved. In general, the new strategies provide more opportunities for sparks to fizzle, as discussed in Section 3. This shows up in a lower number of converted sparks for all divide-and-conquer and nested data parallel programs. For single-level data parallelism as in Sphere, where sparks never share graph structures, there is little or no reduction in the number of converted sparks.
Table 1. Sequential runtime overheads.
6.2
Sequential Overhead
Table 1 shows the sequential runtime as baseline, and the difference of the 1 processor runtime with both original and new strategies. For the new strategies, we encounter a runtime overhead of at most +18.14% for the MatMult program, which is mainly due to the additional work in performing clustering. For all other programs the strategy overhead is significantly lower. Notably, the data parallel programs have a fairly low overhead, despite the additional traversal of a data structure to expose parallelism. Comparing the geometric mean of the runtime overheads imposed by both strategy versions we encounter only a slightly higher overhead for the new strategies: +3.84% compared to +3.21% with the original strategies. This justifies the new strategy approach of high-level generic abstractions. 6.3
6.4
Memory Management
Fixing the space leak: The new strategies fix the space leak outlined in Section 3. For example, for the parallel raytracer that exhibited the space leak8 with the original strategies, the heap residency drops from 1.6GB to 5.8MB with the new strategies on 1 core, and the runtime correspondingly drops by about a factor of 3. Comparing single core executions for all benchmark programs shows a mean reduction in residency of 56.43% with the new strategies. However, for multiple cores the heap measurements in Table 2 do not show a consistent reduction in residency for the new strategies. There are a number of factors contributing to the observed behaviour here:
Parallel Performance
Speedups: Figure 2 compares the absolute speedup curves (i.e. speedup relative to sequential runtime) for the applications with the original and new strategies. Both the runtime curves (not reported here) and speedup curves for the original and new strategies are very similar. The pattern is repeated in more detailed analysis, e.g. in columns 2 and 3 of Table 2. We conclude that the original and new strategies specify the same parallel coordination for a variety of programs representing a range of parallel paradigms, and several tuning techniques.
• With parallel processors available, garbage sparks tend to be
evaluated by other cores and hence fizzle, avoiding the space leak (but wasting some cycles). • Parallel evaluation itself tends to change the residency profile,
Performance: Table 2 analyses in detail the speedups, number of sparks and memory consumption of all applications, running on 7 cores of an 8 core machine with the original strategies and the new strategies, always using a ROOT garbage collection policy. The number of generated sparks was in all cases virtually identical between original strategies and new strategies, giving us further confidence that the two formulations are expressing the same parallelism. Small differences in the number of generated sparks arise because GHC has a non-deterministic execution model in which a particular expression may be evaluated multiple times at runtime [Harris et al. 2005].
in most cases increasing the residency compared to sequential execution. • Residency is recorded by sampling and hence the measured
value is noisy. Speculation: To assess the effectiveness of the garbage collection policies ROOT and WEAK, described in Section 3, for managing speculation we use a program that applies drop to a parallelised list, computing the number of primes up to a given value, thereby rendering the sparks on the dropped list elements speculative: sum $ drop ((m1-m0) ‘quot‘ 2) $ ([ length (primes n) | n <- [m0..m1] ] ‘using‘ parList rdeepseq)
In the cases where the new strategies exhibit poorer performance, the reduction in speedup is still very small: from 5.67 to 5.48 in the worst case for MiniMax. This reflects the low overhead associated with the new strategies, quantified in the previous sub-section.
With the WEAK policy almost all sparks of the original strategies are discarded, as expected. With the new strategies 3404 out of 10001 are converted, 32% fewer than with the ROOT policy, although this value changes considerably between executions. Most importantly, the WEAK policy prunes 4796 sparks, almost all of the 5000 speculative sparks. In contrast, the ROOT policy prunes only 3193 sparks, all of them due to fizzling.
In the case of MatMult heap residency roughly doubles with the new strategies. This is due to the new strategies composing the clusters to return the result value. The original strategies only use the clusters to express parallelism, but do not compose them into the final result. Despite the higher residency, however, the new strategies achieve a better speedup.
8 http://hackage.haskell.org/trac/ghc/ticket/2185
100
Original Strategies 7
5 4
LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze
6 5 Speedup
6
Speedup
New Strategies 7
LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze
3
4 3
2
2
1
1
0
0 1
2
3
4
5
6
7
1
2
3
Number of Cores
4
5
6
7
Number of Cores
Figure 2. Speedup graphs of the application kernels with original and new strategies. Speedup
LinSolv TransClos Sphere MiniMax Coins Queens MatMult Genetic Hidden Maze Geom. Mean
Orig. 6.59 6.04 4.95 5.67 5.61 4.58 5.04 4.95 4.66 2.05 4.83
New 6.44 5.81 5.67 5.48 5.53 5.49 5.39 5.02 4.66 2.01 4.96
Generated Sparks Orig. New 7562 7562 1041 1041 160 160 1464 1464 145925 146853 1589 1563 100 100 659 674 324 324 2723 2835
Converted Sparks Orig. New 7562 7562 1041 1040 160 160 1464 163 2702 1060 1589 636 100 100 659 166 324 324 2525 481
Allocated Heap Orig. (MB) 6050.10 80174.60 8636.40 30476.85 79833.20 14903.30 109.00 12180.20 4805.50 194122.00
New ∆% +0.15 +0.07 -1.14 -0.01 +1.59 -17.52 -6.97 -6.75 -0.01 +7.74 -2.51
Maximum Residency Orig. (KB) 7104.70 108.60 120943.30 98.05 302.10 19134.50 12272.80 493.90 2349.80 71.20
New ∆% +3.87 +1.47 -14.53 -7.17 +20.36 -24.11 +102.04 +7.88 -0.44 -33.15 +1.03
Table 2. Speedups, number of sparks and heap consumption on 7 cores. Only two of our application kernels use speculation: MiniMax and Maze. In the case of MiniMax the WEAK policy significantly reduces the variation of residencies over the number of cores, and in a 7-core execution residency is reduced by 83.4%. In the case of Maze residency remains unchanged. In both cases the speedup improves only slightly when employing a WEAK policy. Of course, the very inability of reclaiming speculative sparks with the ROOT policy discouraged any applications using them on a larger scale.
7.
Data parallel coordination, as in [Blelloch 1996] or implemented in Data Parallel Haskell [Chakravarty et al. 2007], supports the parallel evaluation of every element in a collection. This is a good match with Haskell’s powerful constructs for bulk data types, in particular lists. Data parallelism is often more implicit than evaluation strategies: the programmer simply identifies the collections to be evaluated in parallel. Strategies are more general in that they can express both control parallelism and data parallelism, although in terms of performance Data Parallel Haskell is designed to compile parallel programs down to highly optimised low-level loops over arrays, and hence should achieve significantly better absolute performance on data-parallel programs than would be possible using strategies.
Related Work
Parallel functional languages [Hammond and Michaelson 1999] typically embed high level coordination languages into high level computation languages. A range of high level coordination models have been used [Trinder et al. 2002], and this section relates the semi-explicit approach adopted by evaluation strategies to other approaches.
Entirely implicit coordination aims to minimise programmer input, typically using either profiling as in [Harris and Singh 2007] or parallel iteration as in [Grelck and Scholz 2003]. Few entirely implicit approaches other than parallel iteration have delivered acceptable performance [Nikhil and Arvind 2001]. Evaluation strategies provide more general parallel coordination than loop parallelism.
Skeleton based coordination, for instance [Loogen et al. 2005] or [Scaife et al. 2005], is popular in both imperative and functional languages, and exploits a small set of predefined skeletons. Each skeleton is a polymorphic higher-order function describing a common coordination pattern with an efficient parallel implementation [Cole 1989]. As polymorphic higher-order functions, evaluation strategies are similar to skeletons, but there are some key differences. Rather than a fixed set of skeletons, evaluation strategies are readily combined to form new strategies. Moreover, where skeletons are parametrised with computational arguments, a strategy is typically applied to a computation.
Recent work by [Prabhu et al. 2010] has shown that parallelism by speculating on future data dependencies can be provided as a safe (correctness-preserving) abstraction to programmers. As one might expect, their approach translates naturally into Haskell using par. This approach to speculation is complementary to the speculative parallelism afforded by strategies.
101
8.
Conclusion
M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, G. Keller, and S. Marlow. Data parallel Haskell: a status report. In DAMP ’07 — Workshop on Declarative Aspects of Multicore Programming, pages 10– 18, Nice, France, Jan. 2007. ACM Press.
The original strategies were developed in 1996 for Haskell 1.2, i.e. before monads, and using a compiler with relatively tame optimisations. The context for the new strategies is radically different. Monads, supported by rich libraries and syntactic sugar like do-notation, are now the preferred mechanism for sequencing computations, and are familiar to the rapidly growing Haskell user community. Applicative functors elegantly encode data structure traversals. Finally, the aggressive use of optimisations in mature Haskell implementations like GHC make bespoke efficiency specialisations unnecessary.
M. I. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, 1989. C. Grelck and S.-B. Scholz. SAC - from high-level programming with arrays to efficient parallel execution. Parallel Processing Letters, 13 (3):401–412, 2003. K. Hammond and G. Michaelson, editors. Research Directions in Parallel Functional Programming. Springer, 1999. T. Harris and S. Singh. Feedback directed implicit parallelism. In ICFP ’07 — Intl. Conf. on Functional Programming, pages 251–264, Freiburg, Germany, Oct. 2007. ACM Press. T. Harris, S. Marlow, and S. Peyton Jones. Haskell on a shared-memory multiprocessor. In Haskell ’05 — Workshop on Haskell, pages 49–61, Tallinn, Estonia, Sept. 2005. ACM Press. D. Jones Jr., S. Marlow, and S. Singh. Parallel performance tuning for Haskell. In Haskell ’09 — Symposium on Haskell, pages 81–92, Edinburgh, Scotland, Sept. 2009. ACM Press. H.-W. Loidl, P. W. Trinder, K. Hammond, S. B. Junaidu, R. G. Morgan, and S. L. Peyton Jones. Engineering parallel symbolic programs in GpH. Concurrency — Practice and Experience, 11(12):701–752, 1999. H.-W. Loidl, P. W. Trinder, and C. Butz. Tuning task granularity and data locality of data parallel GpH programs. Parallel Processing Letters, 11 (4):471–486, 2001. R. Loogen, Y. Ortega-Mall´en, and R. Pe˜na-Mar´ı. Parallel functional programming in Eden. J. Funct. Program., 15(3):431–475, 2005. S. Marlow, S. Peyton Jones, and S. Singh. Runtime support for multicore Haskell. In ICFP ’09 — Intl. Conf. on Functional Programming, pages 65–78, Edinburgh, Scotland, Sept. 2009. ACM Press. C. McBride and R. Paterson. Applicative programming with effects. J. Funct. Program., 18(1):1–13, 2008. E. Mohr, D. A. Kranz, and R. H. Halstead Jr. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst., 2(3):264–280, 1991. R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Morgan Kaufmann Publishers, 2001. P. Prabhu, G. Ramalingam, and K. Vaswani. Safe programmable speculative parallelism. In PLDI ’10 — Conf. on Programming Language Design and Implementation, pages 50–61, Toronto, Ontario, Canada, June 2010. ACM Press. N. Scaife, S. Horiguchi, G. Michaelson, and P. Bristow. A parallel SML compiler based on algorithmic skeletons. J. Funct. Program., 15(4): 615–650, 2005. P. W. Trinder, K. Hammond, H.-W. Loidl, and S. L. Peyton Jones. Algorithms + strategy = parallelism. J. Funct. Program., 8(1):23–60, 1998. P. W. Trinder, H.-W. Loidl, and R. F. Pointon. Parallel and distributed Haskells. J. Funct. Program., 12(4&5):469–510, 2002. Special Issue on Haskell.
The new strategy formulation capitalises on improved Haskell idioms and implementations to provide a modular and compositional notation for specifying pure deterministic parallelism. While it has some minor drawbacks: being relatively complex, providing relatively weak type safety, and requiring care to express control parallelism, the advantages are many and substantial. It provides clear, generic, and efficient specification of parallelism with low runtime overheads. It resolves a subtle space management issue associated with parallelism, better supports speculation, and is able to directly express parallelism embedded within lazy data structures. The new strategies are available as part of the Haskell parallel package (since Version 3); additional code and benchmarks can be downloaded from http://www.macs.hw.ac.uk/~dsg/gph/ papers/abstracts/new-strategies.html. We plan to further enhance and formalise the identity safety of the new strategies, following the direction discussed in Section 5.4. Moreover the genericity of the new strategies could be improved by automatically deriving instances of the NFData class.
Acknowledgments Thanks to Greg Michaelson, Simon Peyton Jones and the anonymous referees for constructive feedback. This research is supported by the EPSRC HPC-GAP (EP/G05553X) and EU FP6 SCIEnce (RII3-CT-2005-026133) projects.
References M. K. Aswad, P. W. Trinder, A. D. Al Zain, G. J. Michaelson, and J. Berthold. Low pain vs no pain multi-core Haskells. In TFP ’09 — Draft Proc. of Symp. on Trends in Functional Programming, pages 112– 130, Komarno, Slovakia, June 2009. C. Baker-Finch, D. J. King, and P. Trinder. An operational semantics for parallel lazy evaluation. In ICFP ’00 — Intl. Conf. on Functional Programming, pages 162–173, Montreal, Canada, Sept. 2000. ACM Press. G. E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3): 85–97, 1996.
102
Scalable I/O Event Handling for GHC Bryan O’Sullivan
Johan Tibell
Serpentine [email protected]
Google [email protected]
Abstract
that are several orders of magnitude more demanding than before. Our new code is designed to accommodate both the thread-based programming model of Concurrent Haskell (with no changes to existing application code) and the needs of event-driven applications.
We have developed a new, portable I/O event manager for the Glasgow Haskell Compiler (GHC) that scales to the needs of modern server applications. Our new code is transparently available to existing Haskell applications. Performance at lower concurrency levels is comparable with the existing implementation. We support millions of concurrent network connections, with millions of active timeouts, from a single multithreaded program, levels far beyond those achievable with the current I/O manager. In addition, we provide a public API to developers who need to create event-driven network applications.
1.
Background
2.1
The GHC concurrent runtime
GHC provides a multicore runtime system that uses a small number of operating system (OS) threads to manage the execution of a potentially much larger number of lightweight Haskell threads [6]. The number of operating system threads to use may be chosen at program startup time, with typical values ranging up to the number of CPU cores available1 . From the programmer’s perspective, programming in Concurrent Haskell is appealing due to the simplicity of the synchronous model. The fact that Haskell threads are lightweight, and do not have a one-to-one mapping to OS threads, complicates the implementation of the runtime system. When a Haskell thread must block, this cannot lead to an OS-level thread also being blocked, so the runtime system uses a single OS-level I/O event manager thread (which is allowed to block) to provide an event notification mechanism. The standard Haskell file and network I/O libraries are written to cooperate with the I/O event manager thread. When one of these libraries acquires a resource such as a file or a network socket, it immediately tells the OS to access the resource in a non-blocking fashion. When a client attempts to access (e.g. read or write, send or receive) such a resource, the library performs the following actions:
Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—Applicative (functional) languages; D.3.2 [Programming Languages]: Language Classifications—Concurrent, distributed and parallel languages; D.3.3 [Programming Languages]: Language Constructs and Features— Concurrent programming structures; D.3.4 [Programming Languages]: Processors—Runtime-environments General Terms
2.
Algorithms, Languages, Performance
Introduction
The concurrent computing model used by most Haskell programs has been largely stable for almost 15 years [10]. Despite the language’s many innovations in other areas, networked software is written in Haskell using a programming model that will be familiar to most programmers: a thread of control synchronously sends and receives data over a network connection. By synchronous, we mean that when a thread attempts to send data over a network connection, its continued execution will be blocked if the data cannot immediately be either sent or buffered by the underlying operating system. The Glasgow Haskell Compiler (GHC) provides an environment with a number of attractive features for the development of networked applications. It provides composable synchronization primitives that are easy to use [3]; lightweight threads; and multicore support [2]. However, the increasing demands of largescale networked software have outstripped the capabilities of crucial components of GHC’s runtime system. We have rewritten GHC’s event and timeout handling subsystems to be dramatically more efficient. With our changes, a modestly configured server can easily cope with networking workloads
1. Attempt to perform the operation. If it succeeds, resume immediately. 2. If the operation would need to block, the OS will instead cause it to fail and indicate (via EAGAIN or EWOULDBLOCK in Unix parlance) that it must be retried later. 3. The thread registers with the I/O event manager to be awoken when the operation can be completed without blocking. The sleeping and waking are performed using the lightweight MVar synchronization mechanism of Concurrent Haskell. 4. Once the I/O event manager wakes the thread, return to step 1. (The operation may fail repeatedly with a would-block error, e.g. due to a lost race against another thread for resources, or an OS buffer filling up.) As this sketch indicates, GHC provides a synchronous programming model using a lower-level event-oriented mechanism. It does so via a semi-public API that clients (e.g. the file and networking libraries) can use to provide blocking semantics.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright
1 GHC
also provides an “unthreaded” runtime, which does not support multiple CPU cores. We are concerned only with the threaded runtime.
103
for server-side applications on the public Internet is for most connections to be idle, the amount of useful work performed per call to select dwindles as the number of open connections increases. This repetitious book-keeping rapidly becomes a noticeable source of overhead. The I/O manager incurs further inefficiency by using ordinary Haskell lists to manage both events and timeouts. It has to walk the list of timeouts once per iteration of its main loop, to figure out whether any threads must be woken and when the next timeout expires. It must walk the the list of events twice per iteration: once to fill out the data structures to pass to select, and again after select has returned to see which threads to wake. Since select imposes such a small limit on the number of resources it can manage, we cannot easily illustrate the cost of using lists to manage events, but in section 9.2, we will demonstrate the clear importance of using a more efficient data structure for managing timeouts.
−− Block the current thread u n t i l data i s available −− on the given f i l e descriptor . threadWaitRead, threadWaitWrite :: Fd → IO ()
2.2
Timeout management and robust networking
Well designed network applications make careful use of timeouts to provide robustness in the face of a number of challenges. At internet scale, broken and malicious clients are widespread. As an example, a defensively written application will, if a newly connected client doesn’t send any data within a typically brief time window, unilaterally close the connection and clean up its resources. To support this style of programming, the System.Timeout module provides a timeout function: timeout :: Int → IO a → IO (Maybe a)
It initiates an IO action, and if the action completes within the specified time limit, returns Just its result, otherwise it aborts the action and returns Nothing. Concurrent Haskell also provides a threadDelay function that blocks the execution of a thread for a specified amount of time. Behind the scenes, the I/O event manager thread maintains a queue of pending timeouts. When a timeout fires, it wakes the appropriate application thread.
3.
5.
When we set out to improve the performance of GHC’s I/O manager, our primary goal was to increase the number of files, network connections, and timeouts GHC could manage by several orders of magnitude. We wanted to achieve this in the framework of the existing Concurrent Haskell model, retaining complete sourcelevel compatibility with existing Haskell code, and in a manner that could be integrated into the main GHC distribution with minimal effort. Secondarily, we wanted to sidestep the long dispute over whether events or threads make a better programming model for high-concurrency servers [11]. Since we needed to implement an event-driven I/O event manager in order to provide synchronous semantics to application programmers, we might as well design the event API cleanly and expose it publicly to those programmers who wish to use events2 . We desired to implement as much as possible of the new I/O event manager in Haskell, rather than delegating to a lower-level language. This wish was partly borne out of pragmatism: we initially thought that it might be more efficient to build on a portable event handling library such as libev or libevent2, but experimentation convinced us that the overhead involved was too high. With performance and aesthetics pushing us in the same direction, we were happy to forge ahead in Haskell. Architecturally, our new I/O event manager consists of two components. Our event notification library provides a clean and portable API, and abstracts the system-level mechanisms used to provide efficient event notifications (kqueue, epoll, and poll). We have also written a shim that implements the semi-public threadWaitRead and threadWaitWrite interfaces. This means that neither the core file or networking libraries, nor other low-level I/O libraries, require any changes to work with our new code, and they transparently benefit from its performance improvements.
Related work
Li and Zdancewic [9] began the push for higher concurrency in Haskell server applications with an application-level library that provides both event- and thread-based interfaces. We followed their lead in supporting both event-based and thread-based concurrency, but unlike their work, ours transparently benefits existing Haskell applications. In the context of the Java Virtual Machine, Haller and Odersky unify event- and thread-based concurrency via a Scala implementation of the actor concurrency model [1]. Much of their work is concerned with safely implementing lightweight threads via continuations on top of Java’s platform-level threads, resulting in an environment similar to the two-level threading of GHC’s runtime, with comparable concurrency management facilities. For several years, C programmers concerned with client concurrency have enjoyed the libev and libevent libraries. These enable an event- and callback-driven style of development that can achieve high levels of both performance and concurrency. Similar frameworks are available in other languages, e.g. Twisted for Python and Node.js for Javascript.
4.
Our approach
Shortcomings of the traditional I/O manager
Although the I/O manager in versions of GHC up to 6.12 is portable, stable, and performs well for low-concurrency applications, its imperfections make it inapplicable to the scale of operations required by modern networked applications. The I/O manager uses the venerable select system call for two purposes. It informs the OS of the resources it wishes to track for events, and the time until the next pending timeout should be triggered, and blocks until either an event occurs or the timeout fires. The select system call has well-known problems. Most obvious is the distressingly small fixed limit on the number of resources it can handle even under modern operating systems, e.g. 1,024 on Linux. In addition, the programming style enforced by select can be inefficient. The sizes of its programmer-visible data structures are linear in the number of resources to watch. They must be filled out, copied twice across the user/kernel address space boundary, and checked afresh for every invocation. Since the common case
6.
Interface to the I/O event manager
Our I/O event manager is divided into a portable front end and a platform-specific back end. The interface to the back end is simple, and is only visible to the front end; it is abstract in the public interface. 2 In
our experience, even in a language with first-class closures and continuations, writing applications of anything beyond modest size in an eventdriven style is painful.
104
data Backend = forall a. Backend { −− State specific to t h i s platform . _beState :: !a,
registerFd :: → → → →
−− Poll the back end for new events . The callback −− provided i s invoked once per f i l e descriptor with −− new events . _bePoll :: a → Timeout −− in milliseconds → (Fd → Events → IO ()) −− I/O callback → IO (),
EventManager IOCallback −− callback to invoke Fd −− f i l e descriptor of i n t e r e s t Events −− events to l i s t e n for IO FdKey
Because the I/O event manager has to accommodate being invoked from other threads as well as from the same thread in which it is running, registerFd wakes the I/O manager thread when invoked. A client remains registered for notifications until it explicitly drops its registration, and is thus called back on every step into the I/O event manager as long as an event remains pending. We find this level-triggered approach to event notification to be easier than edge triggering for client applications to use.
−− Register , modify , or unregister i n t e r e s t in the −− given events on the specified f i l e descriptor . _beModifyFd :: a → Fd −− f i l e descriptor → Events −− old events to watch for → Events −− new events to watch for → IO (),
unregisterFd :: EventManager → FdKey → IO ()
7.
−− Clean up platform−specific s t a t e upon destruction . _beDestroy :: a → IO () }
Implementation
By and large, the story of our efforts revolves around appropriate choices of data structure, with a few extra dashes of contextsensitive and profile-driven optimization thrown in.
A particular back end will provide a new action that fills out a Backend structure. For instance, the Mac OS X back end starts out as follows:
7.1
module System.Event.KQueue (new) where new :: IO Backend
GHC’s original I/O manager has to walk the entire list of blocked clients once per loop before calling select, and mutate the list afterwards to wake and filter out any clients that have pending events. A step through the I/O manager’s loop thus involves O(n) of traversal and mutation, where n is the number of clients. Our new I/O event manager registers file descriptors persistently with the operating system, using epoll on Linux and kqueue on Mac OS X, so the I/O event manager no longer needs to walk through all clients on each step through the list. Instead, we maintain a finite map from file descriptor to client, which we can look up for each triggered event. This map is based on Leijen’s implementation of Okasaki and Gill’s purely functional Patricia tree [7]. The new I/O event manager’s loop thus involves O(m log n) traversal, and negligible mutation, where m is the number of clients with events pending. This works well in the typical case where m n.
On a Unix-influenced platform, typically more than one back end will be available. For instance, on Linux, epoll is the most efficient back end, but select and poll are available. On Mac OS X, kqueue is usually preferred, but again select and poll are also available. Our public API thus provides a default back end, but allows a specific back end to be used (e.g. for testing). −− Construct the f a s t e s t back end for t h i s platform . newDefaultBackend :: IO Backend newWith :: Backend → IO EventManager new :: IO EventManager new = newWith =<< newDefaultBackend
7.2
For low-level event-driven applications, a typical event loop involves running a single step through the I/O event manager to check for new events, handling them, doing some other work, and repeating. Our interface to the I/O event manager supports this approach.
Economical I/O event management
Cheap timeouts
−− Returns an indication of whether the I/O event manager −− should continue , and a modified timeout queue . step :: EventManager → TimeoutQueue −− current pending timeouts → IO (Bool, TimeoutQueue)
In the original I/O manager, GHC maintains pending timeouts in an ordered list, which it partly walks and mutates on every iteration. Inserting a new timeout thus has O(n) cost per operation, as does each step through the I/O manager’s loop. The I/O event manager needs to perform two operations efficiently during every step: remove all timeouts that have expired, and find the next timeout to wait for. Since we need both efficient update by key and efficient access to the minimum value, we use a priority search queue. Ours is based on that of Hinze [4], so insertion and deletion have O(log n) cost. A step through our new loop has O(m log n) cost, where m is the number of expired timeouts (typically m n, so we win on performance).
To register for notification of events on a file descriptor, clients use the registerFd function.
8.
−− Cookie describing an event r e g i s t r a t i o n . data FdKey
Writing fast networking code is tricky business. We have variously encountered:
init :: EventManager → IO ()
War stories, lessons learned, and scars earned
• Tunable kernel variables (15 at the last count) that regulate ob-
−− A set of events to wait for . newtype Events instance Monoid Events evtRead, evtWrite :: Events
scure aspects of the networking stack in ways that are important at scale; • Abstruse kernel infelicities (e.g. Mac OS X lacking the NOTE_EOF
argument to kqueue, even though it has been present in other BSD variants since 2003);
−− A synchronous callback into the application . type IOCallback = FdKey → Events → IO ()
105
• Performance bottlenecks in GHC that required expert diagnosis
“black hole,” i.e. a closure that is being evaluated. From that point on, all the other threads would become blocked on black holes: as one thread called atomicModifyIORef and found a black hole inside, it would deposit a new black hole inside that depended on its predecessor. A black hole is a special kind of thunk that is invisible to applications, so we could not play any of the usual seq tricks to jolly evaluation along. When we encountered this problem, the black hole queue was implemented as a global linear list, which was scanned during every GC. Most of the time, this choice of data structure was not a problem, but it became painful with thousands of threads. In response, Simon Marlow performed a wholesale replacement of GHC’s black hole mechanism. Instead of a single global black hole queue, GHC now queues a blocked thread against the closure upon which it is blocking. His work has fixed our problem.
(section 8.2); • An inability to stress the software enough, due to lack of 10-
gigabit Ethernet hardware (gigabit Ethernet is easily saturated, even with obsolete hardware). In spite of these difficulties, we are satisfied with the performance we have achieved to date. To give a more nuanced flavour of the sorts of problems we encountered, we have chosen to share a few in more detail. 8.1
Efficiently waking the I/O event manager
In a concurrent application with many threads, the I/O event manager thread spends much of its time blocked, waiting for the operating system to notify it of pending events. A thread that needs to block until it can perform I/O has no way to tell how long the I/O event manager thread may sleep for, so it must wake the I/O event manager in order to ensure that its I/O request can be queued promptly. The original implementation of event manager wakeup in GHC uses a Unix pipe, which clients use to transmit one of several kinds of single-byte control message to the I/O event manager thread. The delivery of a control message has the side effect of waking the I/O event manager if it is blocked. Because a variety of control message types exist, the original event manager reads and inspects a single byte from the pipe at a time. If several clients attempt to wake the event manager thread before it can service any of their requests, it acts as if it has been woken several times in succession, potentially performing unneeded work. More damagingly, this design is vulnerable to the control pipe filling up, since a Unix pipe has a fixed-size buffer. If control messages are lost due to a pipe overflow, an application may deadlock3 . As a result, we invested some effort in ameliorating the problem. Our principal observation was that by far the most common control message is a simple “wake up.” We have accordingly special-cased the handling of this message. On Linux, when possible, we use the kernel’s eventfd facility to provide fast wakeups. No matter how many clients send wakeup requests in between checks by the I/O event manager, it will receive only one notification. While other operating systems do not provide a comparably fast facility, we still have a trick up our sleeves. We dedicate a pipe to delivering only wakeup messages. To issue a wakeup request, a client writes of a single byte to this pipe. When the I/O event manager is notified that data is available on this pipe, it issues a single read system call to gather all currently buffered wakeups. It does not need to inspect any of the data it has read, since they must all be wakeups, and the fixed size of the pipe buffer guarantees that it will not be subject to unnecessary wakeups, regardless of the number of clients requesting. This means that we no longer need to worry about wakeup messages that cannot be written for want of buffer space, so the thread doing the waking can safely use a non-blocking write. 8.2
8.3
Bunfight at the GC corral
When a client application registers a new timeout, we must update the data structure that we use to manage timeouts. Originally, we stored the priority search queue inside an IORef, and each client manipulated the queue using atomicModifyIORef. Alas, this led to a bad interaction with GHC’s generational garbage collector. Since our client-side use of atomicModifyIORef did not force the evaluation of the data inside the IORef, the IORef would accumulate a chain of thunks. If the I/O event manager thread did not evaluate those thunks promptly enough, they would be promoted to the old generation and become roots for all subsequent minor garbage collections (GCs). When the thunks eventually got evaluated, they would each create a new intermediate queue that immediately became garbage. Since the thunks served as roots until the next major GC, these intermediate queues would get copied unnecesarily in the next minor GC, increasing GC time. We had created a classic instance of the generational “floating garbage” problem. The effect on performance of the floating garbage problem was substantial. For example, with 20,000 threads sleeping, we saw variations in our threadDelay microbenchmark performance of up to 34%, depending on how we tuned the GC and whether we simply got lucky. We addressed this issue by having clients store a list of edits to the queue, instead of manipulating it directly. type TimeoutEdit = TimeoutQueue → TimeoutQueue
While maintaining a list of edits doesn’t eliminate the creation of floating garbage, it reduces the amount of copying at each minor GC enough that these substantial slowdowns no longer occur.
9.
Empirical results
We gathered Linux results on commodity quad-core server-class R Xeon R X3230 hardware with 4GB of RAM, and 2.66GHz Intel CPUs running 64-bit Debian 4.0. We used version 6.12.1 of GHC for all measurements, running server applications on three cores with GHC’s parallel garbage collector disabled4 . When measuring network application performance, we used an idle gigabit Ethernet network.
The great black hole pileup
Our use of an IORef to manage the timeout queue yielded a problem that was especially difficult to diagnose, with a symptom of programs unpredictably running thousands of times slower. In our threadDelay benchmark, thousands of threads compete to update the single timeout management IORef atomically. If one of these threads was pre-empted while evaluating the thunk left in the IORef by atomicModifyIORef, then the thunk would become a
9.1
Performance of event notification
To evaluate the raw performance of event notification, we wrote two HTTP servers. Each uses the usual Haskell networking libraries, and we compiled each against both the original I/O manager (labeled “(old)” in graphs) and our rewrite (labeled “(new)”).
3 Indeed,
4 The
one of our microbenchmarks inadvertantly provided a demonstration of how easy it was to provoke a deadlock under heavy load!
first release of the parallel GC performed poorly on loosely coupled concurrent applications. This problem has since been fixed.
106
pong (new, epoll) pong (new, poll) pong (old)
Requests per second
Requests per second
pong (new) pong (old) file (new) file (old) 20000 15000 10000 5000
15000 10000 5000 0 1
0 1
10
100
1000 10000 Request latency (ms)
10000 1000 100 10 1 10
100
100
1000 10000
1000 100 10 1 1
0.1 1
10
Concurrent idle clients
Concurrent active clients Request latency (ms)
20000
10
100
1000 10000
1000 10000 Figure 2. Requests served per second (top) and latency per request (bottom), with 64 active connections and varying numbers of idle connections.
Figure 1. Requests served per second (top) and latency per request (bottom) for two HTTP server benchmarks, with all clients busy, under old and new I/O managers.
9.2
Performance of timeout management
We developed a simple microbenchmark to measure the performance of the threadDelay function, and hence the efficiency of the timeout management code. We measured its execution time, with the runtime system set to use two OS-level threads. As the upper graph of figure 3 indicates, GHC’s traditional I/O manager exhibits O(n2 ) behaviour when managing numerous timeouts. In comparison, the lower graph of figure 3 shows that the new timeout managament code has no problem coping with millions of simultaneously active timeouts. The performance of our microbenchmark did not begin to degrade until we had three million threads and timeouts active on a system with 4GB of RAM. Even for smaller numbers of threads, the new timeout management code is far more efficient than the old, as figure 4 shows.
The first, pong, simply responds immediately to any HTTP request with a response of “Pong!”. The second, file, opens and serves the contents of a 4,332-byte file. We used the ApacheBench tool to measure performance while varying client concurrency. In figure 1, all client connections are active simultaneously; none are idle. Under these conditions of peak load, the epoll back end exhibits throughput and latency comparable to the original I/O manager. Notably, the new I/O event manager handles far more concurrent connections than the 1,016 or so that the original I/O manager is capable of. To create a workload that corresponds more closely to conditions for real applications, we open a variable number of idle connections to the server, then measure the performance of a series of requests where we always use 64 concurrently active clients. Figure 2 illustrates the effects on throughput and latency of the pong microbenchmark when we vary the number of idle clients. For completeness, we measured the performance of both the epoll and poll back ends. The original and epoll managers show similar performance up to the 1,024 limit that select can handle, but while the performance of poll is erratic, the epoll back end is solid until we have 50,000 idle connections open5 . In general, the small limit that select imposes on the number of concurrently managed resources prevents us from seeing any interesting changes in the behaviour of the original I/O manager, because applications fall over long before any curves have an opportunity to change shape. We find this disappointing, as we were looking forward to a fair fight.
10.
Future work
We have integrated our event management code into GHC, and it will be available to all applications as of GHC 6.14. Our future efforts will revolve around Windows support and further performance improvements. 10.1
Windows support
As we are primarily Unix developers, our work to date leaves GHC’s event management on Windows unchanged. We believe that our design can accommodate the Windows model of scalable event notification via I/O completion ports. 10.2
Lower overhead
We were a little surprised that epoll is consistently slightly slower than select. This might be in part because we currently issue
5 We have tested the new event manager with as many as 300,000 idle client
connections.
107
Execution time (secs)
We hope to create a benchmark that stresses the I/O event manager in such a way that we can either find bottlenecks in, or demonstrate a performance improvement via, multicore scaling.
35 30 25 20 15 10 5 0
10.4
0
5
10
15
20
25
Execution time (secs)
Thousands of running threads 25 20 15
A.
10 5 0 500
1000 1500 2000 2500 3000
Acknowledgments We owe especial gratitude to Simon Marlow for his numerous detailed conversations about performance, and for his heroic fixes to GHC borne of the tricky problems we encountered. We would also like to thank Brian Lewis and Gregory Collins for their early contributions to the new event code base.
Figure 3. Performance of the threadDelay benchmark, run under the existing I/O event manager (top) and our rewritten manager (bottom).
Execution time (secs)
Additional materials
The source code of the original, standalone version of our event management library and our benchmarks are available at http://github.com/tibbe/event . 0
References
100 old new
10
[1] P. Haller and M. Odersky. Actors that unify threads and events. In Proceedings of the International Conference on Coordination Models and Languages, 2007. [2] T. Harris, S. Marlow, and S. Peyton Jones. Haskell on a sharedmemory multiprocessor. In Haskell ’05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61. [3] T. Harris, S. Marlow, S. Peyton Jones, and M. Herlihy. Composable memory transactions. In PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 48–60.
1 0.1 0.01 0
10
20
30
40
50
60
Thousands of running threads
[4] R. Hinze. A simple implementation technique for priority search queues. In Proceedings of the 2001 International Conference on Functional Programming, pages 110–121.
Figure 4. Comparative performance of old and new I/O managers on the threadDelay microbenchmark. Note the logarithmic scale on the y-axis, needed to make the numbers for the new manager distinguishable from zero.
[5] D. Jones Jr., S. Marlow, and S. Singh. Parallel performance tuning for Haskell. In Proceedings of the 2009 Haskell Symposium. [6] S. Marlow, S. Peyton Jones, and W. Thaller. Extending the Haskell foreign function interface with concurrency. In Haskell ’04: Proceedings of the ACM SIGPLAN workshop on Haskell, pages 57–68. URL http: //www.haskell.org/~simonmar/papers/conc-ffi.pdf. [7] C. Okasaki and A. Gill. Fast mergeable integer maps. In Workshop on ML, pages 77–86, 1998. [8] B. O’Sullivan. Criterion, a new benchmarking library for Haskell. http://bit.ly/rUuAa, 2009. [9] L. Peng and S. Zdancewic. Combining events and threads for scalable network services. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 189–199. [10] S. Peyton Jones, A. Gordon, and S. Finne. Concurrent Haskell. In POPL ’96: Proceedings of the 1996 Annual Symposium on Principles of Programming Languages, pages 295–308.
two epoll ctl system calls per event notification: one to queue it with the kernel, and one to dequeue it. In contrast, the original I/O manager performs none. If we used epoll in edge-triggered mode, we could eliminate one call to epoll ctl to dequeue an event6 . 10.3
Better performance tools
When we were diagnosing performance problems with the I/O event manager, we made heavy use of existing tools, such as the Criterion benchmarking library [8], GHC’s profiling tools, and the ThreadScope event tracing and visualisation tool [5]. As useful as those tools are, when we made our brief foray into multicore event dispatching, we lacked data that could help us to pin down any performance bottleneck. If we could integrate the new Linux perf analysis tools with ThreadScope, we might gain a broader systemic perspective on where performance problems are occurring.
Improved scaling to multiple cores
In theory, an application should be able to improve both throughput and latency by distributing its event management load across multiple cores. We already support running many instances of the low-level I/O event manager at once, with each instance managing a disjoint set of files or network connections. 6 As
a side note, the BSD kqueue mechanism is cleaner than epoll in this one respect, combining queueing, dequeueing, and checking for multiple events into a single system call. However, the smaller number of trips across the user/kernel address space boundary does not appear to result in better performance, and the kqueue mechanism is otherwise more cluttered and difficult to use than epoll.
[11] R. von Behren, J. Condit, and E. Brewer. Why events are a bad idea (for high-concurrency servers). In HotOS IX: 9th Workshop on Hot Topics in Operating Systems, 2003.
108
An LLVM Backend for GHC David A. Terei
Manuel M. T. Chakravarty
University of New South Wales {davidt,chak}@cse.unsw.edu.au
Abstract
portability of C as GNU C itself has been ported to many architectures. However, exploiting GNU C extensions only partially solves the problems of compiling via C, as C compiler optimisations are often ineffective for code generated from high-level languages — much static information gets lost in the translation. Since the exact set of supported extensions depends on the particular version of GNU C, this approach also introduces new dependencies. In the case of GHC, the use of the GNU C compiler as a backend also increases compilation time significantly. As a response, GHC eventually included support for native code generators that directly produce assembly code. These are currently only fully functional for the x86 and the SPARC architecture. The desire to reap the benefits of compiling via C, while avoiding the problems, inspired the development of low-level intermediate languages that can be conveniently targeted by high-level compilers and that provide the basis for code generators shared by multiple compilers. Of particular interest is C--, as its design has been heavily influenced by the experience with GHC [24]. Although using C-- as an intermediate language is technically a very promising approach, it comes with a huge practical problem: it is only worthwhile to develop a portable compiler backend if it is used by many compilers, but compiler writers do not want to commit to a backend technology unless they know it is widely used and supported. As a consequence, a variant of C-- is currently used as a low-level intermediate language in GHC, but there is no useful general-purpose backend based on C-- that GHC could target. Currently, the most promising backend framework is the Low Level Virtual Machine (LLVM), which comes with support for justin-time compilation and life-long program analysis and optimisation [19]. An LLVM-based C compiler, named clang, recently gained significant traction as an alternative to the GNU C compiler.1 Hence, it is very likely that LLVM will be further developed and is a suitable target of a long-term strategy. In this paper, we describe the design of a new GHC backend that leverages LLVM. We illustrate the problems that we encountered, such as conflicting register conventions and GHC’s tablesnext-to-code optimisation, and our approach to solving them. We also present a quantitative analysis of both the performance of the backend itself and of the code it produces. In particular, we compare it to the C backend and the native code generator of GHC. The overall outcome is very promising: the new LLVM backend matches the performance of the existing backends on most code and outperforms the existing backends by up to a factor of 2.8 on tight loops with high register pressure on the x86 architecture. In summary, we make the following technical contributions:
In the presence of ever-changing computer architectures, highquality optimising compiler backends are moving targets that require specialist knowledge and sophisticated algorithms. In this paper, we explore a new backend for the Glasgow Haskell Compiler (GHC) that leverages the Low Level Virtual Machine (LLVM), a new breed of compiler written explicitly for use by other compiler writers, not high-level programmers, that promises to enable outsourcing of low-level and architecture-dependent aspects of code generation. We discuss the conceptual challenges and our backend design. We also provide an extensive quantitative evaluation of the performance of the backend and of the code it produces. Categories and Subject Descriptors D.3.2 [Language Classifications]: Applicative (functional) languages; D.3.4 [Processors]: Code generation, Retargetable compilers General Terms
1.
Design, Languages, Performance
Introduction
The Glasgow Haskell Compiler (GHC) began with a backend translating code for the Spineless Tagless G-machine (STG-machine) to C [23]. The idea was that targeting C would make the compiler portable due to the ubiquity of C compilers. At the time, this was a popular approach [7, 14, 27]. By leveraging C as a portable assembly language, the authors of compilers for higher-level languages hoped to save development effort, reuse the engineering work invested into the backend of optimising C compilers, and achieve portability across multiple architectures and operating systems. Unfortunately, it turned out that C is a less than ideal intermediate language, especially for compilers of lazy functional languages with their non-standard control flow [24]. In particular, C does not support proper tail calls, first-class labels, access to the runtime stack for garbage collection, and many other desirable features. This is not surprising, as C was never designed to act as an intermediate language for high-level compilers. Nevertheless, it complicates the work of the compiler writers, as they have to work around those limitations of C-based backends. Moreover, the resulting machine code is less efficient than that of backends which generate native assembly code. GHC and other high-level compilers partially mitigate these shortcomings by targeting the GNU C compiler and using some of its many language extensions, such as global registers and first-class labels. This doesn’t detract too much from the
• We qualitatively compare GHC’s existing backends and the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright
capabilities of the LLVM framework (Sections 3 & 4). • We present a design for an LLVM backend for GHC, including
new methods to solve long standing problems, such as GHC’s 1 To
109
a large part due to the backing and financial support of Apple.
It can generate efficient code, without any of the tricks used by the C backend, such as post-processing the assembly. Implementing a NCG, however, requires detailed knowledge of the target architecture and a considerable amount of work. Much of this work is replicated for each platform GHC supports. It is also quite difficult to implement code generators that generate high-quality code, as this requires many optimisations and fine tuning for each architecture to achieve optimal register and instruction selection. With the NCG, each platform, such as x86, SPARC, and PowerPC has to perform their own code generation with only the register allocator being shared among all platforms. As a result, the NCG, which includes three code generators (for x86, SPARC, and PowerPC) is at about 20,570 lines of code nearly four times the size of the C backend. One of the main advantages of the NCG is that it can generally compile a Haskell program in half the time of the C backend. Also, despite its far larger size compared to the C backend, it is arguably the simpler of the two.
fixed register assignment and tables-next-to-code optimisation (Section 5). • We present a quantitative analysis of the performance of the two
old backends and our new LLVM backend (Section 6). We discuss related work in Section 7 and conclude in Section 8.
2.
The case for a new backend
As we outlined in the previous section, GHC used to have two backends: (1) the C backend, which generates GNU C code, and (2) the native code generator, which generates assembly code for a few architectures. We will briefly review these two backends before giving our motivation for developing a third, the LLVM backend. 2.1
The C backend
The C backend is based on the STG-machine, an abstract machine that was designed to support the compilation of higher-order, lazy functional languages [23]. GHC’s C backend generates C code which contains extensions that are specific to the GNU C compiler. These extensions facilitate the generation of more efficient code by storing the virtual machine registers of the STG-machine in concrete registers of the target hardware, by supporting tail calls with first-class labels, and by removing some overhead due to removing superfluous function entry and exit code. The resulting dependence on GNU C has two major drawbacks: Firstly, portability is limited by the portability of GNU C (it is not very well supported on Microsoft Windows, for example), and even on architectures that are supported by GNU C, the generated code can be of poor quality – as, for example, the code produced by the GNU C backend for the SPARC architecture. Secondly, as GHC not only exploits the extensions, but also the particular form of assembly generated, there are also dependencies on the version of the C compiler. Therefore, changes in the code generator of the GNU C compiler can break, and have in the past broken, GHC. The GHC C code generator consists of a reasonably manageable 5,400 lines of code. However, to achieve even better efficiency than possible with only exploiting GNU C extensions, GHC opts to postprocess the assembly generated by the C compiler. More precisely, it uses a Perl script to match specific patterns of assembly code and to rewrite them to better-optimised assembly code. This script is of course heavily dependent on the target architecture and also on the specific version of GNU C. In particular, it rearranges code blocks to implement the tables-next-to-code scheme of GHC, which we will discuss on more detail in Section 3.5. For obvious reasons, this script is hard to maintain (it is not a coincidence it is nicknamed “the evil mangler”), and has been responsible for more than one tricky bug. Finally, another serious shortcoming of the C backend is its relatively long compilation time. GHC generates sizeable C files and GNU C requires considerable time to turn them into assembly code. Unfortunately, the long compilation time does not result in carefully optimised code, as one might hope. On the contrary, the generated assembly leaves much to be desired. This is not so much the fault of the GNU C compiler as a consequence of GHC generating non-idiomatic C code. 2.2
2.3
The LLVM backend
A GHC backend on the basis of a portable compiler framework, such as the Low Level Virtual Machine (LLVM) [19] , promises to combine the benefits of the C backend and the NCG with few or none of their disadvantages. The idea behind the C backend was to outsource the considerable development and maintenance effort required to implement a compiler backend to the developers of C compilers — after all, this is an area that is fairly far away from where GHC innovates. Compared to the NCG and C backend, an LLVM has the following to offer: Offloading of work. Building a high performance compiler backend is a huge amount of work, LLVM for example was started around 10 years ago. Going forward, GHC’s LLVM backend should be a lot less work to maintain and extend than either the C backend or NCG. It will also benefit from any future improvements to LLVM which has a particularly bright looking future given the community and industrial support behind it [28]. Optimisation passes. GHC does a great job of producing fast Haskell programs. However, there are a large number of lower level optimisations, particularly the kind that require machine specific knowledge, that it doesn’t currently implement. Some examples of these include partial redundancy elimination (PRE), loop unrolling and strength reduction. Through LLVM we gain these and many more for free. The LLVM Framework. Perhaps the most appealing feature of LLVM is that it has been designed from the start to be a compiler framework. Individual optimisations can be chosen and ordered at compile time, as well as new optimisation passes dynamically loaded. LLVM also offers a choice of register allocators and even an interpreter and JIT compiler. For a research driven project like GHC this is a great benefit and makes LLVM a very fun and useful tool to experiment with. The Community. The LLVM project now includes far more then LLVM: it is an entire compiler tool chain with a C/C++ compiler, assembly tools, a linker, a debugger, and static analysis tools. The work of the community on these projects and also the work of third party compilers, such as GHC, benefit all compilers based on LLVM and improve tool support.
The native code generator
GHC’s native code generator (NCG) was developed to avoid the problems of the C backend. It directly generates assembly code. Just as with C code generation, imperative code is generated from a representation of a Haskell program in the language of the STGmachine. As a result, and because appropriate care is taken, code generated by the NCG is binary compatible with code generated by the C backend. GHC’s native code generator shares the usual advantages and disadvantages of a backend that produces assembly.
3.
How GHC works
Before discussing LLVM and how it fits into GHC’s compilation process in Sections 4 and 5, this section details the aspects of GHC’s design that are relevant to the LLVM backend.
110
Haskell
LLVM IR, respectively. The generation of assembly from C and LLVM IR is then left to supporting tools, namely the GNU C compiler and the LLVM tools, respectively. The C backend additionally runs a post-processing tool, a Perl script with target code-specific regular expressions, over the assembly to further optimise it. We will discuss the exact reasons for starting from Cmm in the LLVM backend in detail in Section 5. Before we can do this, we first need to introduce some of the design of the STG-machine and the Cmm intermediate language.
Core STG Cmm LLVM LLVM IR
3.2
NCG C Backend Assembly C
Assembly
The STG-machine essentially comprises three parts: (1) the STGlanguage from Figure 1, (2) an abstract machine configuration consisting of a register set, heap, stack, etc., and (3) an operational semantics that defines in which way the various constructs of the STG-language alter the machine configuration upon execution. The first component, the STG-language itself, is not important for the LLVM backend as we translate the lower-level Cmm to LLVM IR. However, the remaining two components, the machine configuration as well as the operational semantics are crucial to understanding the LLVM backend as it is the ultimate purpose of the backend to map these two components onto the target machine architecture — or more precisely, to map it onto the LLVM machine configuration and LLVM IR code, respectively. In theory, it is LLVM’s responsibility to map STG-configurations and programs encoded in LLVM IR to the various concrete architectures. In practice, the design of the LLVM backend requires us to understand how LLVM IR maps to concrete architectures to generate efficient code. In particular, we need to represent the heap, stack, and machine registers of the STG-machine on top of LLVM. As Cmm is specialised to GHC and the translation of STG-language programs, the Cmm code follows certain idioms and includes specific language constructs to handle the components of the STG machine configuration. Of particular importance is the register set of the STG-machine, which we will call STG registers in the following.
Assembly Mangler Object Code
Figure 1. The GHC pipeline
3.1
Spineless Tagless G-Machine
The GHC pipeline
Figure 1 outlines GHC’s compilation pipeline. The three main intermediate languages in that pipeline are: Core. GHC’s main optimisation engine is implemented in the form of a large number of program analysis and source-to-source transformation steps on the intermediate language Core. Core is a typed lambda calculus —specifically, it is an instance of System FC (X) [26]— and as such far removed from the process of target code generation. Hence, it plays no role in the design of the LLVM backend. Nevertheless, it is central to one of the areas of major innovation in GHC, which highlights our previous point that target code generation is essentially an unwelcome distraction to most GHC developers.
3.3
STG Registers
Abstract machines usually define the most frequently accessed components of their machine state as abstract machine registers to suggest that these are mapped to hardware registers for optimal performance. In the case of GHC, these abstract machine registers are also central for the interaction with the runtime system (RTS), which is written in C, Cmm, and some snippets of assembly. The STG registers function as an interface between generated code and the runtime system. In other words, the mapping of STG registers to the hardware registers and memory locations of the target architecture are hard-wired into the runtime system. The STG Registers include a stack and heap pointer, as well as a set of general registers that are used for argument passing. Currently, there are two different ways in which GHC implements STG registers; they are called unregisterised and registerised mode, respectively. Unregisterised mode is the simple approach where all STG registers are stored in memory on the heap. Due to the frequent use of STG registers in GHC-generated code, this simple approach comes with a significant performance penalty and is mainly meant for porting and bootstrapping GHC on a new architecture. In unregisterised mode, GHC’s C backend generates standard C code and omits the post-processing phase indicated in Figure 1. In contrast, registerised mode uses the hardware registers of the target architectures to store at least the most important STG registers — this process if often referred to as register pinning. As there are far too many STG machine registers to store them all in
STG-language. This is an A-normalised [12] lambda calculus, which serves as the language of the Spineless Tagless GMachine (STGM) [23] – the abstract machine that defines GHC’s execution model. This execution model was originally the basis of the C backend, and hence, strongly impacts a number of the design choices in the target code generation. We will discuss the STG-machine and its impact on code generation in more detail in the following subsection. Cmm. Cmm is a variant of the C-- language [24], which in turn could be described as a subset of C with extensions to facilitate the use as a low-level intermediate language — for example, it supports tail calls and the integration with a garbage collector. As most of the more complex features of C--, such as its runtime system, aren’t supported in Cmm, it is even closer to being a subset of C. In fact, before GHC included Cmm, it used an intermediate language called Abstract C instead. Cmm is the starting point for the two original code generators, the C backend and the NCG, and it also serves as the input to our LLVM backend. It is central to developing a GHC backend and we will discuss it in detail in Subsection 3.4. The dependence of GHC’s code generators on Cmm is obvious in Figure 1, where the pipeline splits after Cmm depending on which backend is used. The NCG generates assembly directly from Cmm, whereas the C backend and the new LLVM backend generate C and
111
section "data" { fibmax: bits32 35; }
real registers though, some still need to be stored in memory. This technique alone has a dramatic effect on the speed of programs. As these registers are used by GHC’s C-based runtime system, implementing the STG registers in a different manner then either of the two currently supported would involve significant changes to the RTS, increasing the development and maintenance effort. Hence an appropriate mapping of the STG registers can be a considerable challenge for a backend since it requires explicit control over register allocation, something not offered by many compiler targets, including LLVM. 3.4
fib() { bits32 bits32 bits32 bits32
count = R1; = 0; = 1; 0;
Cmm if (count == 1 || bits32[fibmax] < count) { n = 1; goto end; }
As depicted in Figure 1, Cmm is the final backend-independent intermediate representation used by GHC, and serves as a common starting point for the backend code generators. Cmm is based on the C-- language, but with numerous additions and omissions. The most important difference is that Cmm doesn’t support any of C--’s runtime interface features. In C--, these features provide support for implementing accurate garbage collection and exception handling. Instead of involving Cmm, GHC uses a portable garbage collector, implemented in the runtime system, that requires no explicit support from the backends, but depends on the idiomatic generation of Cmm code by GHC. Overall, Cmm is designed to be a minimal procedural language. It supports just the features needed to efficiently abstract away hardware and little more. The prominent features of the language are:
for: if (count > 1) { count = count - 1; n = n2 + n1; n2 = n1; n1 = n; goto for; } end: R1 = n; jump StgReturn; }
1. Unlimited variables, abstracting real hardware registers; 2. Simple type system of either bit types or float types; 3. Powerful label type and sections which can be used to implement higher-level data types;
Figure 2. Cmm example: Fibonacci numbers
4. Functions and function calling with efficient tail call support. Functions don’t support arguments though, the STG registers are used instead to explicitly implement the calling convention used;
Info pointer
Payload
Info table
5. Explicit control flow with functions being comprised of blocks and branch statements;
Type-specific fields (reversed)
6. Direct memory access;
Object type Layout info
7. A set of global variables that represent the STG registers; and 8. Code and data order in Cmm is preserved in the compiled code. GHC uses this property for implementing one particular optimisation, which we will examine in detail in the next subsection.
Entry code
Cmm greatly simplifies the task of a backend code generator as the non-strict and functional aspects of Haskell have already been handled and the code generators instead only need to deal with a fairly simple procedural language. Figure 2 displays an example of Cmm. It demonstrates a large portion of the Cmm language, such as its types, variables, control structures and use of code and data labels. 3.5
count; n2; n2 n1; n1 n; n =
Figure 3. GHC’s optimised TNTC heap layout But before getting into the details of the LLVM backend, let us review the reason for the onerous constraint of the Cmm intermediate language. GHC uses it to implement an optimisation known as tables-next-to-code (TNTC). The basic idea is to lay the metadata of a closure right before the code for the closure itself. The metadata, which we call an info-table, is required by the runtime system for each closure. With that layout, both the closure’s evaluation code and its metadata can be accessed from a single pointer. A graphical representation of this layout is in Figure 3. The first word of a closure is its info pointer that refers to the first instruction of the closure’s entry code. The remaining fields of the closure, its payload, contains a function’s free variables or a data constructor’s arguments. A closure’s entry code is executed when the closure is evaluated — i.e., when a lazily evaluated piece of code, a thunk, is forced or
Cmm data & code layout
One of the requirements Cmm places on a backend is that the generated object code has the same order of the data and code sections as the Cmm code has. If a data and code section are adjacent in the Cmm code they are expected to be adjacent in the final object code. This is a problematic requirement as C compilers and other tools take the liberty to reorder code and data sections. Hence, this requirement accounts for much of the magic performed by the Perl script realising the assembly post-processing for the GHC’s C backend. It turns out to be a problem for the LLVM backend, too.
112
Info pointer
define i32 @pow( i32 %M, i32 %N ) { LoopHeader : br label %Loop Loop : %res = phi i32 [1, %LoopHeader], [%res2, %Loop] %i = phi i32 [0, %LoopHeader], [%i2, %Loop] %res2 = mul i32 %res , %M %i2 = add i32 %i, 1 %cond = icmp ne i32 %i2 , %N br i1 %cond , label %Loop , label %Exit Exit : ret i32 %res2 }
Payload
Info table Object type Layout info
Entry code
Type-specific fields
Figure 4. GHC Unoptimised Heap Layout
Figure 5. LLVM code to raise an integer to a power
when a function closure is entered to apply it to some arguments. The entry code of closures representing data structures that are in normal form returns a value identifying the corresponding data constructor or similar. By indexing backwards from a closure’s info pointer, the runtime system can access the info-table that contains layout information to assist garbage collection and other metadata. Without the TNTC optimisation, the first word of a closure does not refer directly to the entry code, but instead to the info-table, as depicted in Figure 4. The info-table, in turn, explicitly stores a pointer to the entry code in an additional field. Hence, without the TNTC optimisation, info tables use one more word of memory and, more importantly, executing a closure’s entry code, when it is evaluated, requires two pointer lookups instead of one. This is costly as Haskell code creates and evaluates closures at a rapid rate. In summary, due to the frequent closure entry of Haskell code, the GHC designers chose to bake a layout constraint into Cmm that is hard to meet with conventional backend technology, such as compiling via C or using a general-purpose framework, such as LLVM. We will look into the capabilities of LLVM in more detail in the following section.
4.
occupies less storage than the textual format and can be read more efficiently. The LLVM IR is low-level and assembly-like, but it maintains higher-level static information in the form of type and dataflow information — the latter due to using static single assignment (SSA) form [9]. SSA form guarantees that every variable is only assigned once (and never updated), and hence, strongly related to functional programming [6]. The design goal in combining a low-level language with high-level static information is to retain sufficient static information to enable aggressive optimisation, while still being low-level enough to efficiently support a wide variety of programming languages. The main features of LLVM’s assembly language are: 1. Unlimited virtual registers, abstracting real hardware registers; 2. Low-level assembly with higher-level type information; 3. Static single assignment form (SSA) with phi (φ) function; 4. Functions and function calling with efficient tail call support; 5. Explicit control flow with functions comprising blocks and branch statements; and
How LLVM works
The Low Level Virtual Machine (LLVM) is an open source, mature optimising compiler framework whose development started in 2000 as part of Lattner’s Masters thesis [18]. Today, it provides a highperformance static compiler backend, but can also be used to build just-in-time compilers and to provide mid-level analyses and optimisation in a compiler pipeline. Its main innovation is in the area of life-long program analysis and optimisation [19] — i.e., it supports program analysis and optimisation at compile time, link time, and runtime. Our GHC LLVM backend currently ignores the link-time and runtime analysis and optimisation capabilities and uses LLVM as a conventional static backend. Hence, we are most concerned with LLVM’s abstract machine language that serves as input the LLVM pipeline. 4.1
6. Direct memory access, as well as a type-safe address calculation instruction, getelementptr facilitating optimisations. The single-assignment property of the SSA form requires the use of phi (φ) functions in the presence of low-level control flow with explicit branches. A phi function selects the value to be assigned to a virtual register in dependence on the edge of the control-flow graph along which execution reached the phi function. SSA form is well-established as a type of intermediate representation that simplifies the implementation of code analysis and optimisation. The above feature list of the LLVM IR has much in common with the feature list of Cmm (in Section 3.4). We will compare the two in detail in Section 5.1, where we discuss the translation. For the moment, let’s have a look at a concrete piece of LLVM IR code. The code in Figure 5 contains one complete LLVM function, which is made up of a list of basic blocks, each preceded by a label. The function has three basic blocks, those being LoopHeader, Loop, and Exit. All control flow in LLVM is explicit, so each basic block must end with a branch (br) or return statement (ret). Variable names preceded by a percent symbol, such as %res and %i, denote virtual registers. Virtual registers are introduced by the unique assignment that defines them — just as in a let-binding. All operations are annotated with type information, such as i32, which implies an integer type of 32 bits. Finally, the Loop block starts with two phi functions. The first one assigns to %res either the constant 1 or the value stored in register %res2 depending on whether
The LLVM assembly language
The LLVM assembly language, LLVM IR, is the input language which LLVM accepts for code generation. However, it also acts as LLVM’s internal intermediate representation for program analysis and optimisation passes. The IR has three equivalent representations: a textual representation (the assembly form), an in-memory representation, and a binary representation. The textual representation is useful in a compiler pipeline where individual tools communicate via files, as well as for human inspection. The in-memory representation is used internally, but also whenever a compiler links to LLVM as a library to avoid the overhead of file input and output. The binary representation is used for compact storage — it
113
simplifies maintaining ABI compatibility between the new LLVM backend and the existing backends, which is important to enable linking to modules and libraries compiled with other backends. Finally, there is ongoing work in GHC to move to a new Cmm code generator and a slightly changed Cmm representation [25]. Once complete, this work should improve the code generated by all backends, making it complementary to the LLVM backend instead of competitive. Despite all backends compiling off Cmm, the design and implementation of a translation phase to LLVM IR raises a number of conceptual problems that are unique to the LLVM backend: (1) the mapping of Cmm language constructs to LLVM IR, (2) the generation of LLVM’s SSA form and especially of the phi functions, (3) an efficient implementation of the STG registers, and (4) the implementation of Cmm’s strict code and data layout constraints. We will address these issues individually in the following subsections.
execution entered the Loop block from the block LoopHeader or from Loop itself. All LLVM code is defined as part of an LLVM module, with modules serving as compilation units. An LLVM module consists of four parts: meta information, external declarations, global variables, and function definitions. Meta information can be used to define the endianness of the module, as well as the alignment and size of various LLVM types for the architecture the code will be compiled to. Global variables are as expected, and are prefixed with the @ symbol, as are functions, to indicate that they are actually pointers to the data and have global scope. This also distinguishes them from local variables which are prefixed with the % symbol. 4.2
Comparing C-- and LLVM
As mentioned previously, GHC’s Cmm language is based on the C-- language. The goals of the C-- project were not unlike those of the LLVM project. There is, however, an important difference between the two: LLVM is geared towards supporting aggressive optimisation of a universal language and C-- towards supporting high-level language features such as garbage collection with no overhead. It is interesting though that despite these differences both projects independently developed very similar features. This might suggest that a universal low-level language needs to support a certain essential set of features. It is also interesting to see that over its lifetime, LLVM’s design has in some areas moved towards that of C--. A few examples of features that C-- supported in its initial design and that LLVM only added later are:
5.1
• LLVM’s type system originally was similar to C, support-
ing signed and unsigned variations of char, byte, int and long. Now its type system is much closer to C--, having simply bitsN types and not distinguishing between signed or unsigned. LLVM also used to exclusively use overloaded operations, such as addition and division, but now increasingly has separate instructions for the different types. • LLVM at first did not support declaring the calling convention
of functions and calls, they have only been added later. • LLVM originally contained a malloc and free instruction but
these have very recently been removed. • LLVM now has direct support for implementing garbage col-
lection. This is not as complex or as versatile as the compile and runtime interface supported by C--, but it works in a similar manner: frontend compilers annotate their code with safe points and call special functions in their LLVM code that trigger a compiler plug-in, which they need to supply, to add the information needed by their garbage collector to the code. Our backend doesn’t use this support, though as the garbage collector implemented by GHC doesn’t require it.
5.
Compiling Cmm to LLVM IR
The Cmm and LLVM IR were designed with a similar goal in mind: to be a minimal language to abstract hardware. The primary difference is LLVM’s broader focus, aiming to support multiple languages and aggressive optimisation of the code, whereas Cmm, as used in GHC, is skewed towards compiling Haskell-like languages. To support high-level data types, Cmm uses a label system that works like assembly labels for implementing data types. There is no type information, and arrays and record structures are implemented in the same manner. LLVM instead supports such high-level data types, such as arrays and structures explicitly. Nevertheless, translating between the two is fairly straightforward, especially since many of the harder cases, such as a Cmm data structure with labels at non-start positions, aren’t used by GHC and so don’t need to be supported — these features were inherited from C--. Another minor difference is that LLVM’s preferred way of accessing memory is a special instruction, getelementptr, that takes a pointer type, such as an array, and an index, returning a type-safe pointer. In contrast, Cmm uses explicit pointer arithmetic. LLVM supports this, too, but it prevents some worthwhile code optimisations. Initially we simply used pointer arithmetic in the LLVM backend but have recently changed to using the getelementptr instruction, primarily as part of some work to give LLVM better aliasing information. Many other aspects of Cmm and LLVM IR are rather similar and instead of discussing all features in detail, Table 1 provides a summary of the relationship. As a concrete example, consider the translation of the Cmm code of Figure 2 into unoptimised LLVM code in Figure 6. After improving the code with LLVM’s optimiser, the code is more compact as shown in Figure 7. By relying on the LLVM optimiser, instead of trying to generate better-optimised LLVM code directly, we could keep the LLVM backend simpler — after all, we want to offload as much work as possible onto LLVM.
LLVM backend design
5.2
As shown in Figure 1, our LLVM backend uses Cmm as its input language, just like the other two backends. In principle, we could have used STG-language as our input language, to try and use the higher-level information in the STG-language to generate better code. However, that would have meant duplicating much of the existing functionality in the STG-to-Cmm phase, which not only deals with sophisticated language features, such as laziness and partial application, but also runtime system considerations, such as the generation of metadata for the garbage collector. Instead of replicating this functionality, it seems more economical to fix any shortcomings in the Cmm code generator and in the Cmm language if and when we identify any situation where the LLVM backend doesn’t have the information it needs. This hasn’t happened so far. Moreover, sharing as much code as possible between the backends
Dealing with LLVM SSA form
As we discussed in Section 4, LLVM code must be in SSA form — i.e., all LLVM virtual registers are immutable, single-assignment variables. In contrast, all of Cmm’s variables are mutable; so, we need to handle the conversion to SSA form as part of the LLVM backend. The conversion of arbitrary code into SSA form is a well understood problem; however, it requires a fair amount of implementation work. Thankfully, LLVM provides us with an alternative option that is far simpler: instead of LLVM’s virtual registers, we can use stack allocated variables. We translate each mutable Cmm variable into an LLVM variable allocated on the stack using the alloca instruction. This instruction returns a pointer into the stack that can be read from and written to just like any other memory location in LLVM by using
114
%fibmax_struct = type {i32} @fibmax = global %fibmax_struct {i32 35}
Cmm
LLVM Basic Types Fixed set of integer and float- Support for any size N bit ing point types: type and a fixed set of floating point types: i8, i16, 132, 164, i128 i1, i2, i3... i32, ... iN float, double, x86-fp80, fp128 f32, f64, f80, f128 High Level Types Supports a label type that Has explicit support for high represent the address of the level types such as arrays and location it’s declared at. This structures: can be used to implement higher level types such as arrays and structures. Array: cmmLabel {i32, i32, Array: [4 x i32] i32, i32} Structure: cmmLabel {i32, Structure: {i32, float, douf32, f64} ble} Variables Unlimited typed local vari- Unlimited typed local and ables. Global data is repre- global variables, however sented through untyped la- LLVM’s use of SSA form bels, all load and store oper- means Cmm local variables ations are instead typed. don’t map directly to LLVM local variables. Stack allocated variables are used instead. Code Module Structure A module consists of global A module consists of global variables and functions. variables, functions, type aliases, metadata and a data Functions don’t support layout specification. arguments or return values, the STG registers are used Functions support argufor this purpose instead. ments and a single return Functions consist of a list value. of blocks. All control flow between blocks is explicit. Expressions Literals, memory reads, ma- Literals, memory reads and chine operations and STG machine operations. LLVM register reads has a full coverage of the machine operations that Cmm supports and not much more, the mapping is nearly 1 to 1. Statements Comments, assignment, LLVM directly supports all memory writes, uncondi- of the statements that Cmm tional branch, conditional supports and a few more. branch, multi-way branch, tail calls and function calls. Interestingly LLVM also supports many of the Cmm Cmm also supports calls CallishMachOp instructions to a group of functions called and in a similar manner CallishMachOp. These are where in LLVM they are maths functions such as sin termed intrinsic functions. and log. On hardware which supports them they should become CPU instructions, otherwise they are turned into calls to the C standard library.
define cc 10 void @fib( i32 %Base_Arg, i32 %Sp_Arg, i32 %Hp_Arg,i32 %R1_Arg) { cP: [...] %R1_Var = alloca i32, i32 1 store i32 %R1_Arg, i32* %R1_Var %cf = alloca i32, i32 1 %cl = alloca i32, i32 1 %ck = alloca i32, i32 1 %cj = alloca i32, i32 1 %nQ = load i32* %R1_Var store i32 %nQ, i32* %cf store i32 0, i32* %cl store i32 1, i32* %ck store i32 0, i32* %cj %nR = load i32* %cf %nS = icmp eq i32 %nR, 1 br i1 %nS, label %cT, label %nU nU: %nV = bitcast %fibmax_struct* @fibmax to i32* %nX = load i32* %nV %nY = load i32* %cf %nZ = icmp ult i32 %nX, %nY br i1 %nZ, label %cT, label %n10 n10: [...] Figure 6. Partial output of LLVM backend with Figure 2 as input %fibmax_struct = type { i32 } @fibmax = global %fibmax_struct { i32 35 } define cc 10 void @fib( i32 %Base_Arg, i32 %Sp_Arg, i32 %Hp_Arg,i32 %R1_Arg) { cP: %nS = icmp eq i32 %R1_Arg, 1 br i1 %nS, label %c12, label %nU nU: %nX = load i32* getelementptr inbounds (%fibmax_struct* @fibmax, i32 0, i32 0) %nZ = icmp ult i32 %nX, %R1_Arg br i1 %nZ, label %c12, label %c13.preheader c13.preheader: [...] } Figure 7. Output of LLVM Optimiser with Figure 6 as input
explicit load and store instructions. Code using stack-allocated variables instead of virtual registers is generally slower, but LLVM includes an optimisation pass called mem2reg, which is designed to correct this. This pass turns explicit stack allocation into the use of virtual registers in a manner that is compatible with the SSA restriction by using phi functions. In effect, mem2reg implements the SSA conversion for us. 5.3
Handling the STG registers
The efficient treatment of the STG registers was one of the major challenges we faced in writing the LLVM backend. While the LLVM backend could easily implement the STG registers using un-
Table 1. Mapping of Cmm and LLVM languages
115
.text 12 sJ8_info: movl $base_SystemziIO_hPrint2_closure, (%ebp) movl $base_GHCHandleziFD_stdout_closure, -4(%ebp) addl $-4, %ebp jmp base_GHCziIOziHandleziText_hPutChar1_info
registerised mode, where they are all stored in memory, this would lead to poor performance. Hence, we need to support registerised mode to map as many of the STG registers as possible to hardware registers, and we want to do that such that it yields the same register mapping as used by the other two backends. This is crucial for ABI compatibility, as discussed. Register mapping is a straight forward affair for the NCG as it has full control over register allocation. GHC’s NCG register allocator is aware of the special status of the STG registers and simply reserves the appropriate hardware registers for exclusive use as STG registers — i.e., it aliases the STG register with some hardware registers. The situation is similar for the C backend. Although, ANSI C does not offer control over hardware-specific registers, GNU C provides an extension, called global register variables, which facilitates the same approach of reserving fixed hardware registers for specific STG registers throughout the code. LLVM does not provide this option. Instead, our solution is to implement a new calling convention for LLVM that passes the first n arguments of a function call in specific hardware registers. We choose the hardware registers that we would like to associate with STG registers. Then, the LLVM backend compiles each Cmm function such that the corresponding LLVM function uses the new calling convention with the appropriate number of parameters. Furthermore, the generated LLVM code passes the correct STG registers as the first n arguments to that call. As a consequence, the values of the STG registers are in the appropriate hardware registers on entry to any function. This is in contrast to the other two backends, where the STG registers are also pinned to their hardware registers throughout the body of a function. However, to guarantee the correct register assignment of function entry it is sufficient to ensure that the runtime system finds the registers in the correct place and that LLVM code can call and be called from code generated by other backends. In fact, it is an improvement over the strategy of the other two backends, as the n hardware registers can temporarily be used for other purposes in function bodies if LLVM’s register allocator decides it is worthwhile spilling them. In most cases though, simply leaving the STG registers in the hardware registers is the best allocation and LLVM is capable of recognising this. The only down side of a new calling convention is that its addition requires modifying the LLVM source code. However, our extension has recently been accepted upstream by the LLVM developers. It is now included in public versions of LLVM since version 2.7, which was released in May of 2010 — i.e., as long as version 2.7 or later is being used, GHC works with a standard LLVM installation. 5.4
.text 11 sJ8_info_itable: .long Main_main1_srt-sJ8_info .long 0 .long 327712 Figure 8. GNU Assembler Subsections to implement TNTC in ascending numerical order and creates only one section including all the numbered subsections. In other words, the subsections are purely a structure that exists in the assembly, but does not appear explicitly in the object code. To guarantee that a closure’s metadata appears immediately before the closure’s entry code, we simply place the metadata in section ’text n’ and the entry code in section ’text hn + 1i’, making sure that no other code or functions use those subsections. This is illustrated at an example in Figure 8. While this approach works well it does create a portability problem as it only works with the GNU Assembler. Fortunately this covers two of the three major platforms GHC is supported on, Linux and Windows. On Mac OS X though we are unable to use this technique and so for now have resorted to post processing the assembly produced by LLVM in a manner similar to the C backend. While this is regrettable it is important to note that the mangler2 used by the LLVM backend consists of only about 180 lines of Haskell code, of which half is documentation. The C mangler by comparison is around 2000 lines of Perl code as it has to handle multiple platforms and far more than simple assembly code rearrangement. We are planning, for the future, to move to a purely LLVM-based solution by extending LLVM to explicitly support associating a global variable with a function. This approach might enable better code optimisation; for example, by performing global constant propagation with the info table values.
6.
Evaluation of the LLVM backend
Next, we evaluate the new LLVM backend in comparison to GHC’s existing C backend and NCG. The evaluation is in two parts: first, we consider the complexity of the backends themselves, and second, we analyse the performance of the generated code. The backend complexity is a primary concern for GHC developers, whereas code performance concerns both developers and users of GHC.
Handling Cmm Data and Code layout
As discussed in Section 3.5, Cmm’s requirement to preserve the ordering of data and code layout is uncommon and causes problems with backends other than the NCG (which generates the assembly sections explicitly). Unfortunately, GHC uses this property of Cmm to implement the tables-next-to-code optimisation, which is fairly significant with an about 5% reduction in runtimes. As the C backend can’t meet the layout requirement with either ANSI C or through one of the GNU C extensions, it resorts to an extra pass over the assembly code produced by the GNU C compiler to rearrange assembly sections and rewrite the code. The LLVM backend faces the same problem as the C backend, as there is no explicit support for ordering code sections in LLVM. Fortunately, we found a technique to realise the ordering constraint by using the sub-sections feature of the GNU Assembler (gas). This feature facilitates the specification of a numbered subsection whenever assembly is placed into a particular assembly section. When gas compiles the assembly to object code, it combines subsections
6.1
Complexity of backend implementations
As a simple metric for the code complexity of the three backends, we compare their respective code size. This gives us an indication of the amount of work initially required to implement them as well as the effort that is spend on maintenance. Table 2 displays the code size of the various components of the three backends. The LLVM backend is the smallest at 3,133 lines of code. The C backend is over 70% larger at 5,382 lines. The NCG is by far the largest, being 6 times larger than the LLVM backend and 4 times larger than the C backend — it totals 20,570 lines. In addition to plain code size, we also need to consider the structural and conceptual complexity, particularly for the C backend which doesn’t seem an unreasonable size. The C backend con2 We are tentatively calling this pass, the Righteous Mangler, in line with the established naming convention
116
NCG and C backend against LLVM backend (∆%)
Lines of code of the GHC backends C
NCG
LLVM
Total Compiler Includes Assembly Processor Total Shared X86 SPARC PowerPC Total Compiler LLVM Module
5382 1122 2201 2059 20570 7777 5208 4243 3342 3133 1865 1268
Program atom comp lab zift cryptarithm1 hidden integer integrate simple transform treejoin wave4main wheel-sieve2 (79 more) -1 s.d. +1 s.d. Average
Table 2. GHC backend code sizes sists of three distinct components: (1) the actual compiler that maps Cmm to C code; (2) the C header files included in the generated C code; and (3) the evil mangler, a Perl script post-processing the generated assembly. The C headers define a large number of macros and data structures that decrease the work required by the code generator and also deal with platform specific issues, such as word size. The C headers are fairly sophisticated, but the arguably most complex part of the C backend is the evil mangler, which is implements the TNTC optimisation as well as a variety of other optimisations, such as removing each C function prologues and epilogues. The fragile Perl code uses a large number of regular expressions that need to be updated regularly for new versions of GCC. The NCG consists of a shared component plus a platformspecific component for each supported architecture. The design is fairly typical of a NCG. The shared component consists a general framework for abstracting and driving the pipeline as well as a register allocator. Each platform specific component is responsible for the rest of the pipeline, principally consisting of instruction selection and pretty printing of the assembler code. The LLVM backend comprises two components: (1) the compiler itself and (2) a code module for interfacing with LLVM. It has none of the complexity of the C backend with its sophisticated assembly post-processing or of the NCG with its size and architecture-specific code. It’s also nearly platform independent; it needs to know the word size, endianness and the mapping of STG registers to hardware registers. All of this information is already used elsewhere in GHC and so isn’t specific to the LLVM backend. 6.2
NCG Runtime -6.2 -4.6 -3.4 4.7 -1.0 2.8 1.5 4.0 -2.8 6.8 -3.4 .. -3.1 3.0 -0.1
C Runtime -0.8 -0.9 0.9 10.9 -1.3 8.3 16.6 5.7 4.6 12.4 -2.8 .. -2.0 7.6 2.7
Table 3. NoFib runtimes of all three backends Main_runExperiment_entry() c1GU: Hp = Hp + 36; if (Hp > HpLim) goto c1GX; I32[Hp - 32] = s1Gh_info; I32[Hp - 24] = I32[Sp + 12]; I32[Hp - 20] = I32[Sp + 0]; I32[Hp - 16] = I32[Sp + 4]; I32[Hp - 12] = I32[Sp + 8]; I32[Hp - 8] = ghczmprim_GHCziTypes_ZC_con_info; I32[Hp - 4] = I32[Sp + 12]; I32[Hp + 0] = Hp - 32; R1 = Hp - 6; Sp = Sp + 16; jump (I32[Sp + 0]) (); c1GY: R1 = Main_runExperiment_closure; jump stg_gc_fun (); c1GX: HpAlloc = 36; goto c1GY; Figure 9. A typical Cmm function produced by GHC
Performance
end by 0.1%. The C backend comes in last, trailing 2.7% behind the LLVM backend. The tables only includes individual benchmarks where the runtimes vary significantly between backends. We investigated each of these benchmarks individually to determine the cause of the difference. We didn’t find any cases where the C backend had any conceptual advantage. Where the C backend performed poorly, this was a combination of the at times awkward mapping of Haskell to C and performance bugs with GCC. Comparing the NCG against the LLVM backend, further testing showed that the performance difference was the greatest for atom, hidden and wave4main. However, in all three cases, no particular feature of the code generation seemed to be responsible. It was just a matter of a better default instruction selection. All this raises an important question: why do we get such similar results with very different code generators? Especially, if we consider that GHC’s NCG optimisation pass consists of just branchchain elimination and constant folding, whereas LLVM implements around 60 different optimisation passes. We conjecture that the reason for the similar performance is the Cmm code they all use as in-
To compare the quality of the generated code, we will consider the runtime, but also other metrics, such as compilation times and the size of the compiled code. We used a Core 2 Duo 2.2GHz machine running a 32 bit Linux OS and set the runtimes of the LLVM backend to be the baseline, except where otherwise specified. So positive percentages for either the C backend or NCG mean that they are slower than the LLVM backend by that percentage and negative percentages mean they are faster by that percentage. First, we investigate the NoFib benchmark suite and then some interesting individual examples. Also, as the NCG is GHC’s default backend and as the GHC developers are looking at deprecating the C backend, we will focus on comparing against the NCG. NoFib [22] is the standard benchmark suite for GHC. It is developed alongside GHC and used by developers to test the performance impact of changes to GHC. In Table 3, we see the runtimes of the NCG and C backend against the LLVM backend. There is little difference between the three backends. The NCG comes out with the best overall runtime, ahead of the LLVM back-
117
NCG and C backend against LLVM backend (∆%)
LLVM optimiser against O0 (∆%) O1 -4.9 3.0 -1.0
-1 s.d. +1 s.d. Average
O2 -6.7 2.1 -2.4
Metric Object File Sizes Compilation Times
O3 -6.3 4.0 -1.3
NCG -12.8 -64.8
C Backend -5.2 +35.6
Table 6. NoFib: Object file sizes and compile times
Table 4. NoFib runtimes of LLVM at different optimisation levels. mmult #cores NCG LLVM Speed up
1 13.38s 4.64s 2.88
8 1.68s 0.62s 2.71
laplace 1 4.75s 2.98s 1.59
8 1.44s 1.15s 1.25
evaluated both benchmarks using the Criterion benchmarking library [21] and looking at the resulting kernel density estimates. Figure 12 shows the kernel density estimates for both benchmarks using the three backends. For zip3, the LLVM backend comes out clearly in front with a mean runtime of 334ms, the C backend second with 423ms and the NCG last with a mean of 590ms. The generated Cmm code consists of 3 functions that produce the three enumerated lists. Each calls a common comparator function that checks whether the end of the list has been reached. The LLVM backend aggressively inlines the comparator function, saving a jump instruction for each of the three list enumerations. The C backend generates remarkably similar code to the NCG, the difference simply seems to be in the ordering of some branches and basic blocks, with the C backend choosing the correct hot path. For hailstone, we see that the C backend comes out in front with a mean of 567ms, the LLVM backend second with a mean of 637ms and the NCG last with a disappointing mean of 2.268s. The LLVM and C backends perform well for two reasons: (1) they both perform significantly better instruction selection and (2) they both inline a large amount of code. The C backend outperforms the LLVM backend due to slightly better branch ordering. Table 6 lists the summary of the compile times and object file sizes for the NoFib suite. Both the NCG and C backend produce smaller code. This is currently a deficiency of LLVM: it does not yet optimise for code size. For compile times, the LLVM backend sits between the NCG and C backend. This is due to LLVM’s additional optimisation passes, which incur an overhead compared to the NCG. We saw the considerable benefit of these optimisations in the runtimes of the Repa, zip3, and hailstone benchmarks.
fft 1 8.88s 8.75s 1.01
7 2.06s 2.02s 1.02
Table 5. LLVM versus NCG performance for Repa benchmarks import qualified Data.Vector.Unboxed as U main = print $ U.sum $ U.zipWith3 (\x y z -> x * y * z) (U.enumFromTo 1 (100000000::Int)) (U.enumFromTo 2 (100000001::Int)) (U.enumFromTo 7 (100000008::Int)) Figure 10. Vector Zip3 benchmark collatzLen :: Int -> Word32 -> Int collatzLen c 1 = c collatzLen c n = collatzLen (c+1) $ if n ‘mod‘ 2 == 0 then n ‘div‘ 2 else 3*n+1 pmax x n = x ‘max‘ (collatzLen 1 n, n) main = print $ foldl pmax (1,1) [2..1000000] Figure 11. Hailstone benchmark
6.3 put: it just isn’t easily optimised. Much of the Cmm code that GHC produces is essentially memory bound, a side effect of Haskell being a lazy evaluated language and so there is often very little register pressure or choice in the instruction selection, which is why the NCG is able to perform close to LLVM. Figure 9 contains some typical Cmm code, illustrating the problem. To further test our conjecture, we ran the NoFib benchmarks with the optimisation level of LLVM set to the various supported default groups: -O0, -O1, -O2 and -O3. The results are in Table 4. NoFib however doesn’t tell us the full story concerning performance, specifically due to the idiomatic Cmm of GHC. This becomes, for example, apparent in code using stream fusion [8] and highly optimised array code, such as that of the parallel array library Repa [17]. The Cmm code of the compute-intensive, tight inner loops of these libraries generally suffer from high register pressure and can benefit from smart instruction ordering, which leaves considerable scope for LLVM to optimise performance. The considerable impact that LLVM’s optimisations can have on such code is quantified in Table 5, where we compare the single-threaded and multi-threaded performance of NCG and LLVM-generated code for three Repa benchmarks (see [17] for details on these benchmarks). We further investigated two simple benchmarks featuring tight loops: (1) zip3, Figure 10, uses the high-performance vector library, based on array fusion; and (2) hailstone, Figure 11, relies on list-fusion from the standard Prelude and unboxed integers. We
LLVM’s type system
An advantage of LLVM is its fairly strong type system and static checking of compiled code. While it doesn’t approach the level of sophistication that Haskell programmers are used to, it does offer a system similar to C’s. All variables and memory locations are typed, all operations obey strict type rules, and pointers are carefully distinguished from other types. For example to conduct pointer arithmetic, a pointer must first be cast to an integer of word width type for all arithmetic and then cast back to a pointer. We found this type system very helpful while implementing the LLVM backend, especially as there is usually little compiler support for such a low-level task as code generation. To quantify the benefit of LLVM’s checks, we scanned the source code revision history for the backend, looking at the bugs we still had to fix after the backend was able to compile a whole Haskell program. There were 15 fixes in total, after which the backend was capable of compiling GHC itself. Of these 15, 10 fixes were motivated by compile time errors generated by LLVM. Some of these were obvious bugs that would have also been discovered by a traditional assembler, but a few were more subtle. They generally related to pointer handling, such as one bug where we returned the pointer itself instead of the value it pointed to. For the 5 bugs that LLVM didn’t pick up, two were related to generating incorrect function and data labels, one was an incorrectly compiled negate operation, one an incorrectly compiled label offset operation and one was due to a bug in the LLVM optimiser.
118
Densities of execution times for "zipWith3 (llvm)"
Densities of execution times for "zipWith3 (asm)"
0.1
8.0e-2
8.0e-2
estimate of probability density
6.0e-2
6.0e-2
4.0e-2
6.0e-2
4.0e-2
2.0e-2
4.0e-2
2.0e-2
0
2.0e-2
0 333 ms
335 ms
338 ms 340 ms execution time
343 ms
345 ms
580 ms
Densities of execution times for "hailstone (llvm)"
0 590 ms
600 ms 610 ms execution time
620 ms
630 ms
Densities of execution times for "hailstone (asm)" 6.0e-2
7.0e-2
6.0e-2
estimate of probability density
estimate of probability density
0 630 ms
0 635 ms
640 ms
645 ms 650 ms execution time
655 ms
660 ms
445 ms
450 ms
Densities of execution times for "hailstone (c)"
1.0e-2
1.0e-2
1.0e-2
440 ms
2.0e-2
2.0e-2
2.0e-2
430 ms 435 ms execution time
3.0e-2
3.0e-2
3.0e-2
425 ms
4.0e-2
4.0e-2
4.0e-2
420 ms
5.0e-2
5.0e-2
5.0e-2
415 ms
estimate of probability density
330 ms
Densities of execution times for "zipWith3 (c)"
0.1
estimate of probability density
estimate of probability density
8.0e-2
0.0 s
0 1.00 s
2.00 s
3.00 s 4.00 s 5.00 s execution time
6.00 s
7.00 s
8.00 s
564 ms
566 ms
568 ms
570 ms 572 ms execution time
574 ms
576 ms
Figure 12. Runtimes of Hailstone and Zip3 benchmarks • LDC: A compiler for the D programming language using
The combination of LLVM’s type system with the SSA form that requires every operation to be explicit is a significant help for compiler development. It’s also rewarding to see type systems being used at such a low level. Although it was originally designed to enabled more aggressive optimisations, it does increase safety.
7.
LLVM as a backend target [2]. • llvm-lua: A compiler for the Lua programming language using
LLVM as a backend target [3]. • SAFECode: A compiler that takes LLVM bitcode as input and
uses static analysis to produce a memory safe version of the same program [10].
Related work
There is a large amount of work in the compiler community around LLVM, including static code generation, just-in-time code generators, and static analysis tools. Schie recently added an LLVM backend to the Utrecht Haskell Compiler3 (UHC), a research Haskell compiler, designed around the idea of implementing the compiler as a series of compilers [11, 29]. UHC’s usual backend is a C code generator. His work produced some impressive results, which included a 10% reduction in the runtime of compiled programs. The UHC LLVM backend however didn’t reach the stage of being able to handle the full Haskell language, instead working with a subset of Haskell. The results with the UHC backend can hardly be compared to our work, as GHC by default already generates code that is on average 40 times faster than that of UHC. Other projects using LLVM include the following:
Using LLVM isn’t the only approach though to provide a relatively portable, easy to target, high performance compiler target, there are also high-level virtual machines, such as Microsoft’s Common Language Runtime (CLR) [16], or the Java Virtual Machine (JVM) [20]. These are in some ways quite similar to LLVM, they all provide a virtual instruction set that abstracts away the underlying hardware and can be targeted by multiple programming languages. The functional programming language Clojure [15] for example targets the JVM with great success. Targeting the JVM or CLR also has the added benefits of direct access to the high quality libraries that come with both platforms. There are trade offs though, using a high level virtual machine means that many of your choices are made for you. Features such as garbage collection and exception handling are provided and as such you need to be able to efficiently map your language onto the design of these services, which may not always be possible, or at least efficient. Neither the CLR or JVM provide an option of compiling your code to native machine code, both use interpreters and just-in-time compilation for execution which generally leads to lower performance. LLVM doesn’t include these high-level services and enables us to use infrastructure optimised for Haskell. It also permits us to choose between static compilation or interpretation with just-in-time compilation. As well as the work around LLVM, there is also work being done in GHC around code generation. Ramsey et al. are redesigning the architecture of GHC’s backend pipeline [25] to improve the code generated by GHC. A large part of this work is the design of a dataflow optimisation framework called Hoopl that can be used to easily compose together distinct passes. While there is some
• Clang: A C, C++ and Objective-C compiler using LLVM as a
backend target [1]. • OpenJDK Project Zero: A version of Sun Microsystems open
source JVM, OpenJDK, which uses zero assembly. LLVM is used as a replacement for the usual just-in-time compiler [4]. • Pure: A functional programming language based on term
rewriting. Pure uses LLVM as a just-in-time compiler [13]. • Unladen Swallow: Google backed Python virtual machine with
an LLVM based just-in-time compiler [5].
3 UHC
was previously known as the Essential Haskell Compiler (EHC)
119
overlap in the optimisations that can be done with Hoopl and those implemented by LLVM, this work is mostly complementary to the LLVM backend as it is intended to replace the current STG to Cmm code generator with a much more modular design, not just duplicate optimisations present in LLVM. The end result of the work will be more efficient Cmm code passed to the LLVM code generator.
8.
2006 ACM SIGPLAN conference on Programming language design and implementation, pages 144–157, New York, NY, USA, 2006. ACM. ISBN 1-59593-320-4. [11] A. Dijkstra, J. Fokker, and S. D. Swierstra. The structure of the Essential Haskell Compiler or coping with compiler complexity. IFL ’07: Proceedings of the 19th International Symposium on Implementation and Application of Functional Languages, pages 107–122, 2007. [12] C. Flanagan, A. Sabry, B. F. Duba, and M. Felleisen. The essence of compiling with continuations. In Proceedings ACM SIGPLAN 1993 Conf. on Programming Language Design and Implementation, PLDI’93, volume 28(6), pages 237–247. ACM Press, 1993.
Conclusion
Our LLVM backend is clearly simpler, conceptually and in terms of lines of code, than the two previous backends. It effectively outsources a sophisticated part of GHC’s compilation pipeline and frees developer resources to concentrate on issues that are more directly relevant to the Haskell community. Our quantitative analysis shows that the LLVM backend, already in its current form, generates code that is on par with GHC’s native code generator (the more efficient of the two current backends). For tight loops, as generated by the vector package, we even see a clear performance advantage of our backend. The biggest disadvantage of the LLVM backend is currently its comparatively high compilation times with respect to the native code generator. This is partly to be expected as the LLVM backend is performing a lot more optimisation work on the code. We do expect to be able to improve on the compilation speed though as currently the LLVM backend interfaces with LLVM by using intermediate files and calling the LLVM command line tools, which wastes time parsing and pretty printing LLVM code. LLVM can also be used as a shared library, using entirely in-memory representations of the LLVM IR. By using this facility, we should be able to improve compilation speeds significantly.
[13] A. Gr¨af. The Pure programming language. http://code.google. com/p/pure-lang/, 2009. [14] F. Henderson, Z. Somogyi, and T. Conway. Compiling logic programs to C using GNU C as a portable assembler. In Proceedings of the ILPS’95 Postconference Workshop on Sequential Implementation Technologies for Logic Programming Languages, 1995. [15] R. Hickey. The Clojure programming language. In DLS ’08: Proceedings of the 2008 symposium on Dynamic languages, pages 1–1, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-270-2. doi: http://doi.acm.org/10.1145/1408681.1408682. [16] E. International. Standard ECMA-355 - Common Language Infrastructure (CLI). Technical Report 4th Edition. ECMA International, June 2006. [17] G. Keller, M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, and B. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2010, Sept. 2010. [18] C. Lattner. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, December 2002. [19] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In CGO ’04: Proceedings of the International Symposium on Code Generation and Optimization, page 75. IEEE Computer Society, 2004. ISBN 0-7695-2102-9. [20] T. Lindholm and F. Yellin. Java Virtual Machine Specification. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, April 1999. ISBN 0201432943. [21] B. O’Sullivan. criterion: Robust, reliable performance measurement and analysis. http://hackage.haskell.org/package/ criterion, 2010. [22] W. Partain. The nofib benchmark suite of haskell programs. pages 195–202, London, UK, 1993. Springer-Verlag.
Acknowledgements. We are grateful to Chris Lattner, Euguene Todler, Ben Lippmeier and Simon Marlow for technical advice during the development of the LLVM backend. We furthermore thank Ben Lippmeier for providing the Repa benchmark results. We thank Gabriele Keller for improving a draft of this paper. The first author thanks Microsoft Research, Cambridge, for hosting him as an intern during the writing of this paper and to further improve the GHC’s LLVM backend.
References [1] clang: A C language family frontend for LLVM. http://clang. llvm.org/, 2010. [2] LDC: LLVM D Compiler. http://www.dsource.org/projects/ ldc, 2010. [3] llvm-lua, jit/static compiler for lua using llvm on the backend. http: //code.google.com/p/llvm-lua/, 2010. [4] Openjdk - zero-assembler project. http://openjdk.java.net/ projects/zero/, 2010. [5] Unladen Swallow: A faster implementation of Python. http:// code.google.com/p/unladen-swallow/, 2010. [6] A. W. Appel. SSA is functional programming. ACM SIGPLAN Notices, 33(4):17–20, 1998. [7] P. Codognet and D. Diaz. wamcc: compiling Prolog to C. In Proceedings of the Twelfth International Conference on Logic Programming, pages 317–331, 1995. [8] D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: From lists to streams to nothing at all. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, Apr. 2007. [9] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451–490, October 1991. [10] D. Dhurjati, S. Kowshik, and V. Adve. Safecode: enforcing alias analysis for weakly typed languages. In PLDI ’06: Proceedings of the
[23] S. L. Peyton Jones. Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. Journal of Functional Programming, 2(2), 1992. [24] S. L. Peyton Jones, N. Ramsey, and F. Reig. C--: A portable assembly language that supports garbage collection. In PPDP ’99: Proceedings of the International Conference PPDP’99 on Principles and Practice of Declarative Programming, pages 1–28. Springer-Verlag, 1999. ISBN 3-540-66540-4. [25] N. Ramsey, J. Dias, and S. Peyton Jones. Hoopl: Dataflow optimization made simple. In ACM SIGPLAN Haskell Symposium 2010. ACM Press, 2010. [26] M. Sulzmann, M. Chakravarty, S. Peyton Jones, and K. Donnelly. System F with type equality coercions. In ACM SIGPLAN International Workshop on Types in Language Design and Implementation (TLDI’07). ACM, 2007. [27] D. Tarditi, P. Lee, and A. Acharya. No assembly required: compiling Standard ML to C. ACM Lett. Program. Lang. Syst., 1(2):161–177, 1992. ISSN 1057-4514. [28] The LLVM Team. The LLVM compiler infastructure: LLVM users. http://llvm.org/Users.html. [29] J. van Schie. Compiling Haskell to LLVM. Master’s thesis, Department of Information and Computing Sciences, Utrecht University, 2008.
120
Hoopl: A Modular, Reusable Library for Dataflow Analysis and Transformation Norman Ramsey
Jo˜ao Dias
Simon Peyton Jones
Tufts University [email protected]
Tufts University [email protected]
Microsoft Research [email protected]
• Hoopl is purely functional. Although pure functional languages
.
are not obviously suited to writing standard algorithms that transform control-flow graphs, pure functional code is actually easier to write, and far easier to write correctly, than code that is mostly functional but uses a mutable representation of graphs (Ramsey and Dias 2005). When analysis and transformation are interleaved, so that graphs must be transformed speculatively, without knowing whether a transformed graph will be retained or discarded, pure functional code offers even more benefits.
Abstract Dataflow analysis and transformation of control-flow graphs is pervasive in optimizing compilers, but it is typically entangled with the details of a particular compiler. We describe Hoopl, a reusable library that makes it unusually easy to define new analyses and transformations for any compiler written in Haskell. Hoopl’s interface is modular and polymorphic, and it offers unusually strong static guarantees. The implementation encapsulates state-of-the-art algorithms (interleaved analysis and rewriting, dynamic error isolation), and it cleanly separates their tricky elements so that they can be understood independently.
• Hoopl is polymorphic. Just as a list library is polymorphic in the
list elements, so is Hoopl polymorphic, both in the nodes that inhabit graphs and in the dataflow facts that analyses compute over these graphs (Section 4). • The paper by Lerner, Grove, and Chambers is inspiring but ab-
Readers: Code examples are indexed at http://bit.ly/cZ7ts1.
stract. We articulate their ideas in a concrete, simple API, which hides a subtle implementation (Sections 3 and 4). You provide a representation for facts, a transfer function that transforms facts across nodes, and a rewrite function that can use a fact to justify rewriting a node. Hoopl “lifts” these node-level functions to work over control-flow graphs, solves recursion equations, and interleaves rewriting with analysis. Designing APIs is surprisingly hard; after a dozen significantly different iterations, we offer our API as a contribution.
Categories and Subject Descriptors D.3.4 [Processors]: Optimization, Compilers; D.3.2 [Language Classifications]: Applicative (functional) languages, Haskell General Terms Algorithms, Design, Languages
1.
Introduction
A mature optimizing compiler for an imperative language includes many analyses, the results of which justify the optimizer’s codeimproving transformations. Many analyses and transformations— constant propagation, live-variable analysis, inlining, sinking of loads, and so on—should be regarded as particular cases of a single general problem: dataflow analysis and optimization. Dataflow analysis is over thirty years old, but a recent, seminal paper by Lerner, Grove, and Chambers (2002) goes further, describing a powerful but subtle way to interleave analysis and transformation so that each piggybacks on the other.
• Because clients can perform very local reasoning (“y is live be-
fore x:=y+2”), analyses and transformations built on Hoopl are small, simple, and easy to get right. Moreover, Hoopl helps you write correct optimizations: statically, it rules out transformations that violate invariants of the control-flow graph (Sections 3 and 4.3), and dynamically, it can help find the first transformation that introduces a fault in a test program (Section 5.5). • Hoopl implements subtle algorithms, including (a) interleaved
analysis and rewriting, (b) speculative rewriting, (c) computing fixed points, and (d) dynamic fault isolation. Previous implementations of these algorithms—including three of our own— are complicated and hard to understand, because the tricky pieces are implemented all together, inseparably. In this paper, each tricky piece is handled in just one place, separate from the others (Section 5). We emphasize this implementation as an object of interest in its own right.
Because optimizations based on dataflow analysis share a common intellectual framework, and because that framework is subtle, it is tempting to try to build a single, reusable library that embodies the subtle ideas, while making it easy for clients to instantiate the library for different situations. Although such libraries exist, as we discuss in Section 6, they have complex APIs and implementations, and none interleaves analysis with transformation. In this paper we present Hoopl (short for “higher-order optimization library”), a new Haskell library for dataflow analysis and optimization. It has the following distinctive characteristics:
Our work bridges the gap between abstract, theoretical presentations and actual compilers. Hoopl is available from http://ghc. cs.tufts.edu/hoopl and also from Hackage (version 3.8.6.0). One of Hoopl’s clients is the Glasgow Haskell Compiler, which uses Hoopl to optimize imperative code in GHC’s back end.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright
Hoopl’s API is made possible by sophisticated aspects of Haskell’s type system, such as higher-rank polymorphism, GADTs, and type functions. Hoopl may therefore also serve as a case study in the utility of these features.
121
2.
fact {x=7} and the instruction z:=x>5, a pure analysis could produce the outgoing fact {x=7, z=True} by simplifying x>5 to True. But the subsequent transformation must perform exactly the same simplification when it transforms the instruction to z:=True! If instead we first rewrite the node to z:=True, then apply the transfer function to the new node, the transfer function becomes wonderfully simple: it merely has to see if the right hand side is a constant. You can see code in Section 4.6.
Dataflow analysis & transformation by example
A control-flow graph, perhaps representing the body of a procedure, is a collection of basic blocks—or just “blocks.” Each block is a sequence of instructions, beginning with a label and ending with a control-transfer instruction that branches to other blocks. The goal of dataflow optimization is to compute valid dataflow facts, then use those facts to justify code-improving transformations (or rewrites) on a control-flow graph.
Another example is the interleaving of liveness analysis and deadcode elimination. As mentioned in Section 1, it is sufficient for the analysis to say “y is live before x:=y+2”. It is not necessary to have the more complex rule “if x is live after x:=y+2 then y is live before it,” because if x is not live after x:=y+2, the assignment x:=y+2 will be transformed away (eliminated). When several analyses and transformations can interact, interleaving them offers even more compelling benefits; for more substantial examples, consult Lerner, Grove, and Chambers (2002).
As a concrete example, we show constant propagation with constant folding. On the left we show a basic block; in the middle we show facts that hold between statements (or nodes) in the block; and on the right we show the result of transforming the block based on the facts: Before Facts After ------------{}------------x := 3+4 x := 7 ----------{x=7}-----------z := x>5 z := True -------{x=7, z=True}------if z goto L1 then goto L1 else goto L2
But the benefits come at a cost. To compute valid facts for a program that has loops, an analysis may require multiple iterations. Before the final iteration, the analysis may compute a fact that is invalid, and a transformation may use the invalid fact to rewrite the program (Section 4.7). To avoid unjustified rewrites, any rewrite based on an invalid fact must be rolled back; transformations must be speculative. As described in Section 4.7, Hoopl manages speculation with minimal cooperation from the client.
Constant propagation works from top to bottom. In this example, we start with the empty fact. Given that fact and the node x:=3+4, can we make a transformation? Yes: constant folding can replace the node with x:=7. Now, given this transformed node and the original fact, what fact flows out of the bottom of the transformed node? The fact {x=7}. Given the fact {x=7} and the node z:=x>5, can we make a transformation? Yes: constant propagation can replace the node with z:=7>5. Now, can we make another transformation? Yes: constant folding can replace the node with z:=True. The process continues to the end of the block, where we can replace the conditional branch with an unconditional one, goto L1.
While it is wonderful that we can create complex optimizations by interleaving very simple analyses and transformations, it is not so wonderful that very simple analyses and transformations, when interleaved, can exhibit complex emergent behavior. Because such behavior is not easily predicted, it is essential to have good tools for debugging. Hoopl’s primary debugging tool is an implementation of Whalley’s (1994) search technique for finding fault-inducing transformations (Section 5.5).
The example above is simple because it has only straight-line code; control flow makes dataflow analysis more complicated. For example, consider a graph with a conditional statement, starting at L1:
3.
Representing control-flow graphs
Hoopl is a library that makes it easy to define dataflow analyses— and transformations driven by these analyses—on control-flow graphs. Graphs are composed from smaller units, which we discuss from the bottom up:
L1: x=3; y=4; if z then goto L2 else goto L3 L2: x=7; goto L3 L3: ...
• A node is defined by Hoopl’s client; Hoopl knows nothing about
Because control flows to L3 from two places (L1 and L2), we must join the facts coming from those two places. All paths to L3 produce the fact y=4, so we can conclude that y=4 at L3. But depending on the the path to L3, x may have different values, so we conclude “x=”, meaning that there is no single value held by x at L3. The final result of joining the dataflow facts that flow to L3 is the fact x=∧ y=4 ∧ z=.
the representation of nodes (Section 3.2). • A basic block is a sequence of nodes (Section 3.3). • A graph is an arbitrarily complicated control-flow graph: basic
blocks connected by edges (Section 3.4). 3.1
Shapes: Open and closed
Forward and backward. Constant propagation works forward, and a fact is often an assertion about the program state, such as “variable x holds value 7.” Some useful analyses work backward. A prime example is live-variable analysis, where a fact takes the form “variable x is live” and is an assertion about the continuation of a program point. For example, the fact “x is live” at a program point P is an assertion that x is used on some program path starting at P. The accompanying transformation is called dead-code elimination; if x is not live, this transformation replaces the node x:=e with a no-op.
In Hoopl, nodes, blocks, and graphs share an important new property: a shape. A thing’s shape tells us whether the thing is open or closed on entry and open or closed on exit. At an open point, control may implicitly “fall through;” at a closed point, control transfer must be explicit and to a named label. For example,
Interleaved analysis and transformation. Our first example interleaves analysis and transformation. Interleaving makes it easy to write effective analyses. If instead we had to finish analyzing the block before transforming it, analyses would have to “predict” the results of transformations. For example, given the incoming
• A label is closed on entry (because in Hoopl we do not allow
• A shift-left instruction is open on entry (because control can
fall into it from the preceding instruction), and open on exit (because control falls through to the next instruction). • An unconditional branch is open on entry, but closed on exit
(because control cannot fall through to the next instruction). control to fall through into a branch target), but open on exit. • The shape of a function-call node is up to the client. If a
call always returns to its inline successor, it could be open on
122
data Node e x where Label :: Label Assign :: Var -> Expr Store :: Expr -> Expr Branch :: Label Cond :: Expr -> Label -> Label ... more constructors ...
-> -> -> -> ->
Node Node Node Node Node
C O O O O
data O data C
O O O C C
-- Open -- Closed
data Block n e x where BFirst :: n C O BMiddle :: n O O BLast :: n O C BCat :: Block n e O -> Block n O x
-> -> -> ->
Block Block Block Block
n n n n
C O O e
O O C x
Figure 1. A typical node type as it might be defined by a client data Graph GNil :: GUnit :: GMany :: -> -> ->
entry and exit. But if a call could return in multiple ways— for example by returning normally or by raising an exception— then it has to be closed on exit. GHC uses calls of both shapes. Blocks and graphs have shapes too. For example the block
n e x where Graph n O O Block n O O -> Graph n O O MaybeO e (Block n O C) LabelMap (Block n C C) MaybeO x (Block n C O) Graph n e x
x:=7; y:=x+2; goto L data MaybeO ex t where JustO :: t -> MaybeO O t NothingO :: MaybeO C t
is open on entry and closed on exit, which we often abbreviate “open/closed.” We may also refer to an “open/closed block.” The shape of a thing determines that thing’s control-flow properties. In particular, whenever E is a node, block, or graph,
newtype Label -- abstract newtype LabelMap a -- finite map from Label to a addBlock :: NonLocal n => Block n C C -> LabelMap (Block n C C) -> LabelMap (Block n C C) blockUnion :: LabelMap a -> LabelMap a -> LabelMap a
• If E is open on entry, it has a unique predecessor; if it is closed,
it may have arbitrarily many predecessors—or none. • If E is open on exit, it has a unique successor; if it is closed, it
may have arbitrarily many successors—or none. 3.2
Nodes class NonLocal n where entryLabel :: n C x -> Label successors :: n e C -> [Label]
The primitive constituents of a control-flow graph are nodes. For example, in a back end a node might represent a machine instruction, such as a load, a call, or a conditional branch; in a higher-level intermediate form, a node might represent a simple statement. Hoopl’s graph representation is polymorphic in the node type, so each client can define nodes as it likes. Because they contain nodes defined by the client, graphs can include arbitrary data specified by the client, including (say) method calls, C statements, stack maps, or whatever.
Figure 2. The block and graph types defined by Hoopl
3.3
Blocks
Hoopl combines the client’s nodes into blocks and graphs, which, unlike the nodes, are defined by Hoopl (Figure 2). A Block is parameterized over the node type n as well as over the flag types that make it open or closed at entry and exit.
The type of a node specifies its shape at compile time. Concretely, the type constructor for a node has kind *->*->*, where the two type parameters are type-level flags, one for entry and one for exit. Each type parameter may be instantiated only with type O (for open) or type C (for closed).
The BFirst, BMiddle, and BLast constructors create one-node blocks. Each of these constructors is polymorphic in the node’s representation but monomorphic in its shape. Why not use a single constructor of type n e x -> Block n e x, which would be polymorphic in a node’s representation and shape? Because by making the shape known statically, we simplify the implementation of analysis and transformation in Section 5.
As an example, Figure 1 shows a typical node type as it might be defined by one of Hoopl’s clients. The type parameters are written e and x, for entry and exit respectively. The type is a generalized algebraic data type; the syntax gives the type of each constructor. For example, constructor Label takes a Label and returns a node of type Node C O, where the “C” says “closed on entry” and the “O” says “open on exit”. The types Label, O, and C are defined by Hoopl (Figure 2). In other examples from Figure 1, constructor Assign takes a variable and an expression, and it returns a Node open on both entry and exit; constructor Store is similar. Finally, control-transfer nodes Branch and Cond (conditional branch) are open on entry and closed on exit. Types Var and Expr are private to the client, and Hoopl knows nothing about them.
The BCat constructor concatenates blocks in sequence. It makes sense to concatenate blocks only when control can fall through from the first to the second; therefore, two blocks may be concatenated only if each block is open at the point of concatenation. This restriction is enforced by the type of BCat, whose first argument must be open on exit and whose second argument must be open on entry. It is impossible, for example, to concatenate a Branch immediately before an Assign. Indeed, the Block type guarantees statically that any closed/closed Block—which compiler writers normally call a “basic block”—consists of exactly one first node (such as Label in Figure 1), followed by zero or more middle nodes (Assign or Store), and terminated with exactly one last node (Branch or Cond). Enforcing these invariants by using GADTs is one of Hoopl’s innovations.
Nodes closed on entry are the only targets of control transfers; nodes open on entry and exit never perform control transfers; and nodes closed on exit always perform control transfers.1 Because of the position each shape of node occupies in a basic block, we often call them first, middle, and last nodes respectively. 1 To
obey these invariants, a node for a conditional-branch instruction, which typically either transfers control or falls through, must be represented as a two-target conditional branch, with the fall-through path in a separate
block. This representation is standard (Appel 1998), and it costs nothing in practice: such code is easily sequentialized without superfluous branches.
123
3.4
Graphs Part of optimizer
Hoopl composes blocks into graphs, which are also defined in Figure 2. Like Block, the data type Graph is parameterized over both nodes n and over its shape at entry and exit (e and x). Graph has three constructors. The first two deal with the base cases of open/open graphs: an empty graph is represented by GNil while a single-block graph is represented by GUnit.
Control-flow graphs Nodes in a control-flow graph Dataflow fact F Lattice operations Transfer functions Rewrite functions Analyze-and-rewrite functions
More general graphs are represented by GMany, which has three fields: an optional entry sequence, a body, and an optional exit sequence. • If the graph is open on entry, it contains an entry sequence of
Specified Implemented by by US YOU
US YOU
YOU US US US US
YOU YOU YOU YOU US
How many
One One type per intermediate language One type per logic One set per logic One per analysis One per transformation Two (forward, backward)
Table 3. Parts of an optimizer built with Hoopl
type Block n O C. We could represent this sequence as a value of type Maybe (Block n O C), but we can do better: a value of Maybe type requires a dynamic test, but we know statically, at compile time, that the sequence is present if and only if the graph is open on entry. We express our compile-time knowledge by using the type MaybeO e (Block n O C), a type-indexed version of Maybe which is also defined in Figure 2: the type MaybeO O a is isomorphic to a, while the type MaybeO C a is isomorphic to ().
3.5
Edges, labels and successors
Although Hoopl is polymorphic in the type of nodes, it still needs to know how control may be transferred from one node to another. Within a block, a control-flow edge is implicit in every application of the BCat constructor. An implicit edge originates in a first node or a middle node and flows to a middle node or a last node.
• The body of the graph is a collection of closed/closed blocks.
Between blocks, a control-flow edge is represented as chosen by the client. An explicit edge originates in a last node and flows to a (labelled) first node. If Hoopl is polymorphic in the node type, how can it follow such edges? Hoopl requires the client to make the node type an instance of Hoopl’s NonLocal type class, which is defined in Figure 2. The entryLabel method takes a first node (one closed on entry, as per Section 3.2) and returns its Label; the successors method takes a last node (closed on exit) and returns the Labels to which it can transfer control.
To facilitate traversal of the graph, we represent the body as a finite map from label to block. • The exit sequence is dual to the entry sequence, and like the
entry sequence, its presence or absence is deducible from the static type of the graph. Graphs can be spliced together nicely; the cost is logarithmic in the number of closed/closed blocks. Unlike blocks, two graphs may be spliced together not only when they are both open at splice point but also when they are both closed—and not in the other two cases:
In Figure 1, the client’s instance declaration for Node would be instance NonLocal Node where entryLabel (Label l) = l successors (Branch b) = [b] successors (Cond e b1 b2) = [b1, b2]
gSplice :: Graph n e a -> Graph n a x -> Graph n e x gSplice GNil g2 = g2 gSplice g1 GNil = g1 gSplice (GUnit b1) (GUnit b2) = GUnit (b1 ‘BCat‘ b2)
Again, the pattern matching for both functions is exhaustive, and the compiler checks this fact statically. Here, entryLabel cannot be applied to an Assign or Branch node, and any attempt to define a case for Assign or Branch would result in a type error.
gSplice (GUnit b) (GMany (JustO e) bs x) = GMany (JustO (b ‘BCat‘ e)) bs x gSplice (GMany e bs (JustO x)) (GUnit b2) = GMany e bs (JustO (x ‘BCat‘ b2)) gSplice (GMany e1 bs1 (JustO x1)) (GMany (JustO e2) bs2 x2) = GMany e1 (bs1 ‘blockUnion‘ (b ‘addBlock‘ bs2)) x2 where b = x1 ‘BCat‘ e2 gSplice (GMany e1 bs1 NothingO) (GMany NothingO bs2 x2) = GMany e1 (bs1 ‘blockUnion‘ bs2) x2
This definition illustrates the power of GADTs: the pattern matching is exhaustive, and all the shape invariants are checked statically. For example, consider the second-to-last equation for gSplice. Since the exit sequence of the first argument is JustO x1, we know that type parameter a is O, and hence the entry sequence of the second argument must be JustO e2. Moreover, block x1 must be closed/open, and block e2 must be open/closed. We can therefore concatenate x1 and e2 with BCat to produce a closed/closed block b, which is added to the body of the result.
While the client provides this information about nodes, it is convenient for Hoopl to get the same information about blocks. Internally, Hoopl uses this instance declaration for the Block type: instance NonLocal n => NonLocal (Block n) where entryLabel (BFirst n) = entryLabel n entryLabel (BCat b _) = entryLabel b successors (BLast n) = successors n successors (BCat _ b) = successors b Because the functions entryLabel and successors are used to track control flow within a graph, Hoopl does not need to ask for the entry label or successors of a Graph itself. Indeed, Graph cannot be an instance of NonLocal, because even if a Graph is closed on entry, it need not have a unique entry label.
4.
Using Hoopl to analyze and transform graphs
Now that we have graphs, how do we optimize them? Hoopl makes it easy; a client must supply these pieces:
We have carefully crafted the types so that if BCat is considered as an associative operator, every graph has a unique representation. To guarantee uniqueness, GUnit is restricted to open/open blocks. If GUnit were more polymorphic, there would be more than one way to represent some graphs, and it wouldn’t be obvious to a client which representation to choose—or if the choice made a difference.
• A node type (Section 3.2). Hoopl supplies the Block and Graph
types that let the client build control-flow graphs out of nodes. • A data type of facts and some operations over those facts (Sec-
tion 4.1). Each analysis uses facts that are specific to that par-
124
data FwdPass m n f = FwdPass fp_lattice :: DataflowLattice f , fp_transfer :: FwdTransfer n f , fp_rewrite :: FwdRewrite m n f
ticular analysis, which Hoopl accommodates by being polymorphic in the fact type. • A transfer function that takes a node and returns a fact trans-
former, which takes a fact flowing into the node and returns the transformed fact that flows out of the node (Section 4.2).
------- Lattice ---------data DataflowLattice f = DataflowLattice fact_bot :: f , fact_join :: JoinFun f type JoinFun f = OldFact f -> NewFact f -> (ChangeFlag, f) newtype OldFact f = OldFact f newtype NewFact f = NewFact f data ChangeFlag = NoChange | SomeChange
• A rewrite function that takes a node and an input fact, performs
a monadic action, and returns either Nothing or Just g, where g is a graph that should replace the node (Sections 4.3 and 4.4). For many code-improving transformations, The ability to replace a node by a graph is crucial. These requirements are summarized in Table 3. Because facts, transfer functions, and rewrite functions work together, we combine them in a single record of type FwdPass (Figure 4).
------- Transfers ---------newtype FwdTransfer n f -- abstract type mkFTransfer :: (forall e x . n e x -> f -> Fact x f) -> FwdTransfer n f
Given a node type n and a FwdPass, a client can ask Hoopl to analyze and rewrite a graph. Hoopl provides a fully polymorphic interface, but for purposes of exposition, we present a function that is specialized to a closed/closed graph: analyzeAndRewriteFwdBody :: ( CkpointMonad m -- Roll back speculative actions , NonLocal n ) -- Extract non-local flow edges => FwdPass m n f -- Lattice, transfer, rewrite -> [Label] -- Entry point(s) -> Graph n C C -- Input graph -> FactBase f -- Input fact(s) -> m ( Graph n C C -- Result graph , FactBase f ) -- ... and its facts Given a FwdPass and a list of entry points, the analyze-and-rewrite function transforms a graph into an optimized graph. As its type shows, this function is polymorphic in the types of nodes n and facts f; these types are chosen by the client. The type of the monad m is also chosen by the client.
------- Rewrites ---------newtype FwdRewrite m n f -- abstract type mkFRewrite :: FuelMonad m => (forall e x . n e x -> f -> m (Maybe (Graph -> FwdRewrite m n f thenFwdRw :: FwdRewrite m n f -> FwdRewrite m n -> FwdRewrite m n f iterFwdRw :: FwdRewrite m n f -> FwdRewrite m n noFwdRw :: Monad m => FwdRewrite m n
n e x))) f f f
------- Fact-like things, aka "fact(s)" ----type family Fact x f :: * type instance Fact O f = f type instance Fact C f = FactBase f
As well as taking and returning a graph, the function also takes input facts (the FactBase) and produces output facts. A FactBase is a finite mapping from Label to facts (Figure 4); if a Label is not in the domain of the FactBase, its fact is the bottom element of the lattice. For example, in our constant-propagation example from Section 2, if the graph represents the body of a procedure with parameters x, y, z, we would map the entry Label to a fact x = ∧ y = ∧ z = , to specify that the procedure’s parameters are not known to be constants.
------- FactBase ------type FactBase f = LabelMap f -- A finite mapping from Labels to facts f mkFactBase :: DataflowLattice f -> [(Label, f)] -> FactBase f ------- Rolling back speculative rewrites ---class Monad m => CkpointMonad m where type Checkpoint m checkpoint :: m (Checkpoint m) restart :: Checkpoint m -> m ()
The client’s model of analyzeAndRewriteFwdBody is as follows: Hoopl walks forward over each block in the graph. At each node, Hoopl applies the rewrite function to the node and the incoming fact. If the rewrite function returns Nothing, the node is retained as part of the output graph, the transfer function is used to compute the outgoing fact, and Hoopl moves on to the next node. But if the rewrite function returns Just g, indicating that it wants to rewrite the node to the replacement graph g, Hoopl recursively analyzes and may further rewrite g before moving on to the next node. A node following a rewritten node sees up-to-date facts; that is, its input fact is computed by analyzing the replacement graph.
------- Optimization fuel ---type Fuel = Int class Monad m => FuelMonad m where getFuel :: m Fuel setFuel :: Fuel -> m () Figure 4. Hoopl API data types 4.1
A rewrite function may take any action that is justified by the incoming fact. If further analysis invalidates the fact, Hoopl rolls back the action. Because graphs cannot be mutated, rolling back to the original graph is easy. But rolling back a rewrite function’s monadic action requires cooperation from the client: the client must provide checkpoint and restart operations, which make m an instance of Hoopl’s CkpointMonad class (Section 4.7).
Dataflow lattices
For each analysis or transformation, the client must define a type of dataflow facts. A dataflow fact often represents an assertion about a program point, but in general, dataflow analysis establishes properties of paths: • An assertion about all paths to a program point is established
by a forward analysis. For example the assertion “x = 3” at point P claims that variable x holds value 3 at P, regardless of the path by which P is reached.
Below we flesh out the interface to analyzeAndRewriteFwdBody, leaving the implementation for Section 5.
125
• An assertion about all paths from a program point is established
block’s successors. For example, consider doing constant propagation (Section 2) on the following graph, whose entry point is L1:
by a backward analysis. For example, the assertion “x is dead” at point P claims that no path from P uses variable x.
L1: x=3; goto L2 L2: y=x+4; x=x-1; if x>0 then goto L2 else return
A set of dataflow facts must form a lattice, and Hoopl must know (a) the bottom element of the lattice and (b) how to take the least upper bound (join) of two elements. To ensure that analysis terminates, it is enough if every fact has a finite number of distinct facts above it, so that repeated joins eventually reach a fixed point.
Forward analysis starts with the bottom fact {} at every label except the entry L1. The initial fact at L1 is {x=,y=}. Analyzing L1 propagates this fact forward, applying the transfer function successively to the nodes of L1, and propagating the new fact {x=3,y=} to L2. This new fact is joined with the existing (bottom) fact at L2. Now the analysis propagates L2’s fact forward, again applying the transfer function, and propagating the new fact {x=2, y=7} to L2. Again the new fact is joined with the existing fact at L2, and the process repeats until the facts reach a fixed point.
In practice, joins are computed at labels. If fold is the fact currently associated with a label L, and if a transfer function propagates a new fact fnew into label L, Hoopl replaces fold with the join fold fnew . And Hoopl needs to know if fold fnew = fold , because if not, the analysis has not reached a fixed point. The bottom element and join operation of a lattice of facts of type f are stored in a value of type DataflowLattice f (Figure 4). As noted in the previous paragraph, Hoopl needs to know when the result of a join is equal to the old fact. It is often easiest to answer this question while the join itself is being computed. By contrast, a post facto equality test on facts might cost almost as much as a join. For these reasons, Hoopl does not require a separate equality test on facts. Instead, Hoopl requires that fact_join return a ChangeFlag as well as the join. If the join is the same as the old fact, the ChangeFlag should be NoChange; if not, the ChangeFlag should be SomeChange.
A transfer function has an unusual sort of type: not quite a dependent type, but not a bog-standard polymorphic type either. The result type of the transfer function is indexed by the shape (i.e., the type) of the node argument: If the node is open on exit, the transfer function produces a single fact. But if the node is closed on exit, the transfer function produces a collection of (Label,fact) pairs: one for each outgoing edge. The collection is represented by a FactBase; auxiliary function mkFactBase (Figure 4) joins facts on distinct outgoing edges that target the same label. The indexing is expressed by Haskell’s (recently added) indexed type families. A forward transfer function supplied by a client, which is passed to mkFTransfer, is polymorphic in e and x (Figure 4). It takes a node of type n e x, and it returns a fact transformer of type f -> Fact x f. Type constructor Fact is a species of type-level function: its signature is given in the type family declaration, and its definition is given by two type instance declarations. The first declaration says that a Fact O f, which comes out of a node open on exit, is just a fact f. The second declaration says that a Fact C f, which comes out of a node closed on exit, is a mapping from Label to facts.
To help clients create lattices and join functions, Hoopl includes functions and constructors that can extend a fact type f with top and bottom elements. In this paper, we use only type WithTop, which comes with value constructors that have these types: PElem :: f -> WithTop f Top :: WithTop f Hoopl provides combinators which make it easy to create join functions that use Top. The most useful is extendJoinDomain, which uses auxiliary types defined in Figure 4:
4.3
extendJoinDomain :: (OldFact f -> NewFact f -> (ChangeFlag, WithTop f)) -> JoinFun (WithTop f)
We compute dataflow facts in order to enable code-improving transformations. In our constant-propagation example, the dataflow facts may enable us to simplify an expression by performing constant folding, or to turn a conditional branch into an unconditional one. Similarly, facts about liveness may allow us to replace a dead assignment with a no-op.
A client supplies a join function that consumes only facts of type f, but may produce either Top or a fact of type f—as in the example of Figure 5 below. Calling extendJoinDomain extends the client’s function to a proper join function on the type WithTop a, guaranteeing that joins involving Top obey the appropriate algebraic laws.
A FwdPass therefore includes a rewrite function, whose type, FwdRewrite, is abstract (Figure 4). A programmer creating a rewrite function chooses the type of a node n and a dataflow fact f. A rewrite function might also want to consume fresh names (e.g., to label new blocks) or take other actions (e.g., logging rewrites). So that a rewrite function may take actions, Hoopl requires that a programmer creating a rewrite function also choose a monad m. So that Hoopl may roll back actions taken by speculative rewrites, the monad must satisfy the constraint CkpointMonad m, as explained in Section 4.7 below. The programmer may write code that works with any such monad, may create a monad just for the client, or may use a monad supplied by Hoopl.
Hoopl also provides a value constructor Bot and type constructors WithBot and WithTopAndBot, along with similar functions. Constructors Top and Bot are polymorphic, so for example, Top also has type WithTopAndBot a. It is also common to use a lattice that takes the form of a finite map. In such lattices it is typical to join maps pointwise, and Hoopl provides a function that makes it convenient to do so: joinMaps :: Ord k => JoinFun f -> JoinFun (Map.Map k f)
4.2
The rewrite function and the client’s monad
When these choices are made, the easy way to create a rewrite function is to call the function mkFRewrite in Figure 4. The client supplies a function r, which is specialized to a particular node, fact, and monad, but is polymorphic in the shape of the node to be rewritten. Function r takes a node and a fact and returns a monadic computation, but what result should that computation return? Returning a new node is not good enough: in general, it must be possible for rewriting to result in a graph. For example,
The transfer function
A forward transfer function is presented with the dataflow fact coming into a node, and it computes dataflow fact(s) on the node’s outgoing edge(s). In a forward analysis, Hoopl starts with the fact at the beginning of a block and applies the transfer function to successive nodes in that block, until eventually the transfer function for the last node computes the facts that are propagated to the
126
we might want to remove a node by returning the empty graph, or more ambitiously, we might want to replace a high-level operation with a tree of conditional branches or a loop, which would entail returning a graph containing new blocks with internal control flow.
Appealing to this model, we see that • A function mkFRewrite rw never rewrites a replacement graph;
this behavior is shallow rewriting. • When a function r1 ‘thenFwdRw‘ r2 is applied to a node,
It must also be possible for a rewrite function to decide to do nothing. The result of the monadic computation returned by r may therefore be Nothing, indicating that the node should not be rewritten, or Just g, indicating that the node should be replaced with g: the replacement graph.
if r1 replaces the node, then r2 is used to transform the replacement graph. And if r1 does not replace the node, Hoopl tries r2. • When a function iterFwdRw r rewrites a node, iterFwdRw r
is used to transform the replacement graph; this behavior is deep rewriting. If r does not rewrite a node, neither does iterFwdRw r.
The type of mkFRewrite in Figure 4 guarantees that the replacement graph g has the same shape as the node being rewritten. For example, a branch instruction can be replaced only by a graph closed on exit. 4.4
• Finally, noFwdRw never replaces a graph.
For convenience, we also provide the function deepFwdRw, which is the composition of iterFwdRw and mkFRewrite.
Shallow rewriting, deep rewriting, rewriting combinators, and the meaning of FwdRewrite
Our combinators satisfy the algebraic laws that you would expect; for example, noFwdRw is a left and right identity of thenFwdRw. A more interesting law is
When a node is rewritten, the replacement graph g must itself be analyzed, and its nodes may be further rewritten. Hoopl can make a recursive call to analyzeAndRewriteFwdBody—but how should it rewrite the replacement graph g? There are two common cases:
iterFwdRw r = r ‘thenFwdRw‘ iterFwdRw r Unfortunately, this law cannot be used to define iterFwdRw: if we used this law to define iterFwdRw, then when r returned Nothing, iterFwdRw r would diverge.
• Rewrite g using the same rewrite function that produced g.
This procedure is called deep rewriting. When deep rewriting is used, the client’s rewrite function must ensure that the graphs it produces are not rewritten indefinitely (Section 4.8).
4.5 When the type of nodes is not known
• Analyze g without rewriting it. This procedure is called shallow
We note above (Section 4.2) that the type of a transfer function’s result depends on the argument’s shape on exit. It is easy for a client to write a type-indexed transfer function, because the client defines the constructor and shape for each node. The client’s transfer functions discriminate on the constructor and so can return a result that is indexed by each node’s shape.
rewriting. Deep rewriting is essential to achieve the full benefits of interleaved analysis and transformation (Lerner, Grove, and Chambers 2002). But shallow rewriting can be vital as well; for example, a backward dataflow pass that inserts a spill before a call must not rewrite the call again, lest it attempt to insert infinitely many spills.
What if you want to write a transfer function that does not know the type of the node? For example, a dominator analysis need not scrutinize nodes; it needs to know only about labels and edges in the graph. Ideally, a dominator analysis would work with any type of node n, provided only that n is an instance of the NonLocal type class. But if we don’t know the type of n, we can’t write a function of type n e x -> f -> Fact x f, because the only way to get the result type right is to scrutinize the constructors of n.
An innovation of Hoopl is to build the choice of shallow or deep rewriting into each rewrite function, through the use of the four combinators mkFRewrite, thenFwdRw, iterFwdRw, and noFwdRw shown in Figure 4. Every rewrite function is made with these combinators, and its behavior is characterized by the answers to two questions: Does the function rewrite a node to a replacement graph? If so, what rewrite function should be used to analyze the replacement graph recursively? To answer these questions, we present an algebraic datatype that models FwdRewrite with one constructor for each combinator:
There is another way; in place of a single function that is polymorphic in shape, Hoopl also accepts a triple of functions, each of which is polymorphic in the node’s type but monomorphic in its shape:
data Rw r = Mk r | Then (Rw r) (Rw r) | Iter (Rw r) | No
mkFTransfer3 :: -> -> ->
Using this model, we specify how a rewrite function works by giving a reference implementation: the function rewrite, below, computes the replacement graph and rewrite function that result from applying a rewrite function r to a node and a fact f. The code is in continuation-passing style; when the node is rewritten, the first continuation j accepts a pair containing the replacement graph and the new rewrite function to be used to transform it. When the node is not rewritten, the second continuation n is the (lazily evaluated) result.
(n C O -> f (n O O -> f (n O C -> f FwdTransfer
-> Fact O f) -> Fact O f) -> Fact C f) n f
We have used this interface to write a number of functions that are polymorphic in the node type n: • A function that takes a FwdTransfer and wraps it in logging
code, so an analysis can be debugged by watching facts flow through nodes
rewrite :: Monad m => FwdRewrite m n f -> n e x -> f -> m (Maybe (Graph n e x, FwdRewrite m n f)) rewrite r node f = rew r (return . Just) (return Nothing) where rew (Mk rw) j n = do { mg <- rw node f ; case mg of Nothing -> n Just g -> j (g, No) } rew (r1 ‘Then‘ r2) j n = rew r1 (j . add r2) (rew r2 j n) rew (Iter r) j n = rew r (j . add (Iter r)) n rew No j n = n add nextrw (g, r) = (g, r ‘Then‘ nextrw)
127
• A pairing function that runs two passes interleaved, not sequen-
tially, potentially producing better results than any sequence: pairFwd :: Monad m => FwdPass m n f -> FwdPass m n f’ -> FwdPass m n (f, f’) • An efficient dominator analysis in the style of Cooper, Harvey,
and Kennedy (2001), whose transfer function is implemented using only the functions in the NonLocal type class
-- Type and definition of the lattice type ConstFact = Map.Map Var (WithTop Lit) constLattice :: DataflowLattice ConstFact constLattice = DataflowLattice { fact_bot = Map.empty , fact_join = joinMaps (extendJoinDomain constFactAdd) } where constFactAdd _ (OldFact old) (NewFact new) = if new == old then (NoChange, PElem new) else (SomeChange, Top)
variable holds is unknown. We represent these facts using a finite map from a variable to a fact of type WithTop Lit (Section 4.1). A variable with a constant value maps to Just (PElem k), where k is the constant value; a variable with a non-constant value maps to Just Top; and a variable with an unknown value maps to Nothing (it is not in the domain of the finite map). The definition of the lattice (constLattice) is straightforward. The bottom element is an empty map (nothing is known about what any variable holds). The join function is implemented with the help of combinators provided by Hoopl. The client writes a simple function, constFactAdd, which compares two values of type Lit and returns a result of type WithTop Lit. The client uses extendJoinDomain to lift constFactAdd into a join function on WithTop Lit, then uses joinMaps to lift that join function up to the map containing facts for all variables.
--------------------------------------------------- Analysis: variable equals a literal constant varHasLit :: FwdTransfer Node ConstFact varHasLit = mkFTransfer ft where ft :: Node e x -> ConstFact -> Fact x ConstFact ft (Label _) f = f ft (Assign x (Lit k)) f = Map.insert x (PElem k) f ft (Assign x _) f = Map.insert x Top f ft (Store _ _) f = f ft (Branch l) f = mapSingleton l f ft (Cond (Var x) tl fl) f = mkFactBase constLattice [(tl, Map.insert x (PElem (Bool True)) f), (fl, Map.insert x (PElem (Bool False)) f)] ft (Cond _ tl fl) f = mkFactBase constLattice [(tl, f), (fl, f)]
The forward transfer function varHasLit is defined using the shape-polymorphic auxiliary function ft. For most nodes n, ft n simply propagates the input fact forward. But for an assignment node, if a variable x gets a constant value k, ft extends the input fact by mapping x to PElem k. And if a variable x is assigned a non-constant value, ft extends the input fact by mapping x to Top. There is one other interesting case: a conditional branch where the condition is a variable. If the conditional branch flows to the true successor, the variable holds True, and similarly for the false successor, mutatis mutandis. Function ft updates the fact flowing to each successor accordingly. Because ft scrutinizes a GADT, it cannot use a wildcard to default the uninteresting cases.
--------------------------------------------------- Rewriting: replace constant variables constProp :: FuelMonad m => FwdRewrite m Node ConstFact constProp = mkFRewrite cp where cp node f = return $ liftM nodeToG $ mapVN (lookup f) node mapVN = mapEN . mapEE . mapVE lookup f x = case Map.lookup x f of Just (PElem v) -> Just $ Lit v _ -> Nothing
The transfer function need not consider complicated cases such as an assignment x:=y where y holds a constant value k. Instead, we rely on the interleaving of transformation and analysis to first transform the assignment to x:=k, which is exactly what our simple transfer function expects. As we mention in Section 2, interleaving makes it possible to write very simple transfer functions without missing opportunities to improve the code. Figure 5’s rewrite function for constant propagation, constProp, rewrites each use of a variable to its constant value. The client has defined auxiliary functions that may change expressions or nodes:
--------------------------------------------------- Simplification ("constant folding") simplify :: FuelMonad m => FwdRewrite m Node f simplify = deepFwdRw simp where simp node _ = return $ liftM nodeToG $ s_node node s_node :: Node e x -> Maybe (Node e x) s_node (Cond (Lit (Bool b)) t f) = Just $ Branch (if b then t else f) s_node n = (mapEN . mapEE) s_exp n s_exp (Binop Add (Lit (Int n1)) (Lit (Int n2))) = Just $ Lit $ Int $ n1 + n2 -- ... more cases for constant folding
type MaybeChange a = a -> Maybe a mapVE :: (Var -> Maybe Expr) -> MaybeChange mapEE :: MaybeChange Expr -> MaybeChange mapEN :: MaybeChange Expr -> MaybeChange mapVN :: (Var -> Maybe Expr) -> MaybeChange nodeToG :: Node e x -> Graph Node e x
The client composes mapXX functions to apply lookup to each use of a variable in each kind of node; lookup substitutes for each variable that has a constant value. Applying liftM nodeToG lifts the final node, if present, into a Graph.
----------------------------------------- Defining the forward dataflow pass constPropPass = FwdPass { fp_lattice = constLattice , fp_transfer = varHasLit , fp_rewrite = constProp ‘thenFwdRw‘ simplify }
Figure 5 also gives another, distinct function for constant folding: simplify. This function rewrites constant expressions to their values, and it rewrites a conditional branch on a boolean constant to an unconditional branch. To rewrite constant expressions, it runs s_exp on every subexpression. Function simplify does not check whether a variable holds a constant value; it relies on constProp to have replaced the variable by the constant. Indeed, simplify does not consult the incoming fact, so it is polymorphic in f.
Figure 5. The client for constant propagation and constant folding (extracted automatically from code distributed with Hoopl)
4.6
Expr Expr (Node e x) (Node e x)
The FwdRewrite functions constProp and simplify are useful independently. In this case, however, we want both of them, so we compose them with thenFwdRw. The composition, along with the lattice and the transfer function, goes into constPropPass (bottom of Figure 5). Given constPropPass, we can improve a graph g by passing constPropPass and g to analyzeAndRewriteFwdBody.
Example: Constant propagation and constant folding
Figure 5 shows client code for constant propagation and constant folding. For each variable, at each program point, the analysis concludes one of three facts: the variable holds a constant value of type Lit, the variable might hold a non-constant value, or what the
128
4.7
• A transformation that uses deep rewriting must not return a re-
Checkpointing the client’s monad
placement graph which contains a node that could be rewritten indefinitely.
When analyzing a program with loops, a rewrite function could make a change that later has to be rolled back. For example, consider constant propagation in this loop, which computes factorial:
Under these conditions, the algorithm terminates and is sound.
i = 1; prod = 1; L1: if (i >= n) goto L3 else goto L2; L2: i = i + 1; prod = prod * i; goto L1; L3: ...
5. Hoopl’s implementation Section 4 gives a client’s-eye view of Hoopl, showing how to create analyses and transformations. Hoopl’s interface is simple, but the implementation of interleaved analysis and rewriting is not. Lerner, Grove, and Chambers (2002) do not describe their implementation. We have written at least three previous implementations, all of which were long and hard to understand, and only one of which provided compile-time guarantees about open and closed shapes. We are not confident that any of these implementations are correct.
Function analyzeAndRewriteFwdBody iterates through this graph until the dataflow facts stop changing. On the first iteration, the assignment i = i + 1 is analyzed with an incoming fact i=1, and the assignment is rewritten to the graph i = 2. But on a later iteration, the incoming fact increases to i=, and the rewrite is no longer justified. After each iteration, Hoopl starts the next iteration with new facts but with the original graph—by virtue of using purely functional data structures, rewrites from previous iterations are automatically rolled back.
In this paper we describe a new implementation. It is elegant and short (about a third of the size of our last attempt), and it offers strong compile-time guarantees about shapes. We describe only the implementation of forward analysis and transformation. The implementations of backward analysis and transformation are exactly analogous and are included in Hoopl.
But a rewrite function doesn’t only produce new graphs; it can also take monadic actions, such as acquiring a fresh name. These actions must also be rolled back, and because the client chooses the monad in which the actions take place, the client must provide the means to roll back the actions. Hoopl therefore defines a rollback interface, which each client must implement; it is the type class CkpointMonad from Figure 4:
We also explain, in Section 5.5, how we isolate errors in faulty optimizers, and how the fault-isolation machinery is integrated with the rest of the implementation. 5.1 Overview Instead of the interface function analyzeAndRewriteFwdBody, we present the more polymorphic, private function arfGraph, which is short for “analyze and rewrite forward graph:”
class Monad m => CkpointMonad m where type Checkpoint m checkpoint :: m (Checkpoint m) restart :: Checkpoint m -> m ()
arfGraph :: forall m n f e x. (CkpointMonad m, NonLocal n) => FwdPass m n f -- lattice, transfers, rewrites -> MaybeC e [Label] -- entry points for a closed graph -> Graph n e x -- the original graph -> Fact e f -- fact(s) flowing into entry/entries -> m (DG f n e x, Fact x f)
Hoopl calls the checkpoint method at the beginning of an iteration, then calls the restart method if another iteration is necessary. These operations must obey the following algebraic law: do { s <- checkpoint; m; restart s } == return () where m represents any combination of monadic actions that might be taken by rewrite functions. (The safest course is to make sure the law holds for any action in the monad.) The type of the saved checkpoint s is up to the client; it is specified as an associated type of the CkpointMonad class. 4.8
Function arfGraph has a more general type than the function analyzeAndRewriteFwdBody because arfGraph is used recursively to analyze graphs of all shapes. If a graph is closed on entry, a list of entry points must be provided; if the graph is open on entry, the graph’s entry sequence must be the only entry point. The graph’s shape on entry also determines the type of fact or facts flowing in. Finally, the result is a “decorated graph” DG f n e x, and if the graph is open on exit, an “exit fact” flowing out.
Correctness
Facts computed by the transfer function depend on graphs produced by the rewrite function, which in turn depend on facts computed by the transfer function. How do we know this algorithm is sound, or if it terminates? A proof requires a POPL paper (Lerner, Grove, and Chambers 2002); here we merely state the conditions for correctness as applied to Hoopl:
A “decorated graph” is one in which each block is decorated with the fact that holds at the start of the block. DG actually shares a representation with Graph, which is possible because the definition of Graph in Figure 2 contains a white lie: Graph is a type synonym for an underlying type Graph’, which takes the type of block as an additional parameter. (Similarly, function gSplice in Section 3.4 is actually a higher-order function that takes a block-concatenation function as a parameter.) The truth about Graph and DG is as follows:
• The lattice must have no infinite ascending chains; that is,
every sequence of calls to fact_join must eventually return NoChange. • The transfer function must be monotonic: given a more infor-
mative fact in, it must produce a more informative fact out.
type Graph = Graph’ Block type DG f = Graph’ (DBlock f) data DBlock f n e x = DBlock f (Block n e x)
• The rewrite function must be sound: if it replaces a node n by a
replacement graph g, then g must be observationally equivalent to n under the assumptions expressed by the incoming dataflow fact f. Moreover, analysis of g must produce output fact(s) that are at least as informative as the fact(s) produced by applying the transfer function to n. For example, if the transfer function says that x=7 after the node n, then after analysis of g, x had better still be 7.
Type DG is internal to Hoopl; it is not seen by any client. To convert a DG to the Graph and FactBase that are returned by the API function analyzeAndRewriteFwdBody, we use a 12-line function: normalizeGraph :: NonLocal n => DG f n e x -> (Graph n e x, FactBase f)
129
Function arfGraph is implemented as follows: arfGraph where node :: => block::
pass entries = graph forall e x . (ShapeLifter e x) n e x -> f -> m (DG f n e x, Fact x f) forall e x . Block n e x -> f -> m (DG f n e x, Fact x f)
body :: [Label] -> LabelMap (Block n C C) -> Fact C f -> m (DG f n C C, Fact C f) graph:: Graph n e x -> Fact e f -> m (DG f n e x, Fact x f) ... definitions of ’node’, ’block’, ’body’, and ’graph’ ...
The four auxiliary functions help us separate concerns: for example, only node knows about rewrite functions, and only body knows about fixed points. Each auxiliary function works the same way: it takes a “thing” and returns an extended fact transformer. An extended fact transformer takes dataflow fact(s) coming into the “thing,” and it returns an output fact. It also returns a decorated graph representing the (possibly rewritten) “thing”—that’s the extended part. Finally, because rewrites are monadic, every extended fact transformer is monadic. • Extended fact transformers for nodes and blocks have the same
type; like forward transfer functions, they expect a fact f rather than the more general Fact e f required for a graph. Because a node or a block has exactly one fact flowing into the entry, it is easiest simply to pass that fact. type, as expressed using Fact: if the graph is open on entry, its fact transformer expects a single fact; if the graph is closed on entry, its fact transformer expects a FactBase.
The node function is where we interleave analysis with rewriting: node :: forall e x . (ShapeLifter e x) => n e x -> f -> m (DG f n e x, Fact x f) node n f = do { grw <- frewrite pass n f ; case grw of Nothing -> return ( singletonDG f n , ftransfer pass n f ) Just (g, rw) -> let pass’ = pass { fp_rewrite = rw } f’ = fwdEntryFact n f in arfGraph pass’ (fwdEntryLabel n) g f’ }
In the Nothing case, no rewrite takes place. We return node n and its incoming fact f as the decorated graph singletonDG f n. To produce the outgoing fact, we apply the transfer function ftransfer pass to n and f.
• Extended fact transformers for bodies have the same type as
extended fact transformers for closed/closed graphs.
In the Just case, we receive a replacement graph g and a new rewrite function rw, as specified by the model in Section 4.4. We use rw to analyze and rewrite g recursively with arfGraph. The recursive analysis uses a new pass pass’, which contains the original lattice and transfer function from pass, together with rw. Function fwdEntryFact converts fact f from the type f, which node has, to the type Fact e f, which arfGraph expects.
Function arfGraph and its four auxiliary functions comprise a cycle of mutual recursion: arfGraph calls graph; graph calls body and block; body calls block; block calls node; and node calls arfGraph. These five functions do three different kinds of work: compose extended fact transformers, analyze and rewrite nodes, and compute fixed points.
As shown above, several functions called in node are overloaded over a (private) class ShapeLifter. Their implementations depend on the open/closed shape of the node. By design, the shape of a node is known statically everywhere node is called, so this use of ShapeLifter is specialized away by the compiler.
Analyzing blocks and graphs by composing extended fact transformers
Extended fact transformers compose nicely. For example, block is implemented thus: block :: forall e x . Block n e x -> f -> m (DG f n e x, Fact x f) block (BFirst n) = node n block (BMiddle n) = node n block (BLast n) = node n block (BCat b1 b2) = block b1 ‘cat‘ block b2
5.4
Fixed points
The fourth and final auxiliary function of arfGraph is body, which iterates to a fixed point. This part of the implementation is the only really tricky part, and it is cleanly separated from everything else: body
:: [Label] -> LabelMap (Block n C C) -> Fact C f -> m (DG f n C C, Fact C f) body entries blockmap init_fbase = fixpoint Fwd lattice do_block blocks init_fbase where blocks = forwardBlockList entries blockmap lattice = fp_lattice pass do_block b fb = block b entryFact where entryFact = getFact lattice (entryLabel b) fb
The composition function cat feeds facts from one extended fact transformer to another, and it splices decorated graphs. e m m m =
5.3 Analyzing and rewriting nodes
Function node uses frewrite to extract the rewrite function from pass, and it applies that rewrite function to node n and incoming fact f. The result, grw, is scrutinized by the case expression.
• Extended fact transformers for graphs have the most general
cat :: forall (f1 -> -> (f2 -> -> (f1 -> cat ft1 ft2 f
Function graph is much like block, but it has more cases.
class ShapeLifter e x where singletonDG :: f -> n e x -> DG f n e x fwdEntryFact :: NonLocal n => n e x -> f -> Fact e f fwdEntryLabel :: NonLocal n => n e x -> MaybeC e [Label] ftransfer :: FwdPass m n f -> n e x -> f -> Fact x f frewrite :: FwdPass m n f -> n e x -> f -> m (Maybe (Graph n e x, FwdRewrite m n f))
The types of the extended fact transformers are not quite identical:
5.2
suitable for DBlocks.) The name cat comes from the concatenation of the decorated graphs, but it is also appropriate because the style in which it is used is reminiscent of concatMap, with the node and block functions playing the role of map.
a x f1 f2 f3. (DG f n e a, f2)) (DG f n a x, f3)) (DG f n e x, f3)) do { (g1,f1) <- ft1 f ; (g2,f2) <- ft2 f1 ; return (g1 ‘dgSplice‘ g2, f2) }
Function getFact looks up a fact by its label. If the label is not found, getFact returns the bottom element of the lattice:
(Function dgSplice is the same splicing function used for an ordinary Graph, but it uses a one-line block-concatenation function
getFact :: DataflowLattice f -> Label -> FactBase f -> f
130
rewrites finds an n such that the program works after n − 1 rewrites but fails after n rewrites. The nth rewrite is faulty. As alluded to at the end of Section 2, this technique enables us to debug complex optimizations by identifying one single rewrite that is faulty.
Function forwardBlockList takes a list of possible entry points and a finite map from labels to blocks. It returns a list of blocks, sorted into an order that makes forward dataflow efficient.2 forwardBlockList :: NonLocal n => [Label] -> LabelMap (Block n C C) -> [Block n C C]
To use this debugging technique, we must be able to control the number of rewrites. We limit rewrites using optimization fuel. Each rewrite consumes one unit of fuel, and when fuel is exhausted, all rewrite functions return Nothing. To debug, we do binary search on the amount of fuel.
For example, if the entry point is at L2, and the block at L2 branches to L1, but not vice versa, then Hoopl will reach a fixed point more quickly if we process L2 before L1. To find an efficient order, forwardBlockList uses the methods of the NonLocal class— entryLabel and successors—to perform a reverse postorder depth-first traversal of the control-flow graph.
The supply of fuel is encapsulated in the FuelMonad type class (Figure 4), which must be implemented by the client’s monad m. To ensure that each rewrite consumes one unit of fuel, mkFRewrite wraps the client’s rewrite function, which must be oblivious to fuel, in another function that satisfies the following contract:
The rest of the work is done by fixpoint, which is shared by both forward and backward analyses:
• If the fuel supply is empty, the wrapped function always returns
data Direction = Fwd | Bwd fixpoint :: forall m n f. (CkpointMonad m, NonLocal n) => Direction -> DataflowLattice f -> (Block n C C -> Fact C f -> m (DG f n C C, Fact C f)) -> [Block n C C] -> (Fact C f -> m (DG f n C C, Fact C f))
Nothing. • If the wrapped function returns Just g, it has the monadic
effect of reducing the fuel supply by one unit.
6.
While there is a vast body of literature on dataflow analysis and optimization, relatively little can be found on the design of optimizers, which is the topic of this paper. We therefore focus on the foundations of dataflow analysis and on the implementations of some comparable dataflow frameworks.
Except for the Direction passed as the first argument, the type signature tells the story. The third argument can produce an extended fact transformer for any single block; fixpoint applies it successively to each block in the list passed as the fourth argument. Function fixpoint returns an extended fact transformer for the list.
Foundations. When transfer functions are monotone and lattices are finite in height, iterative dataflow analysis converges to a fixed point (Kam and Ullman 1976). If the lattice’s join operation distributes over transfer functions, this fixed point is equivalent to a join-over-all-paths solution to the recursive dataflow equations (Kildall 1973).3 Kam and Ullman (1977) generalize to some monotone functions. Each client of Hoopl must guarantee monotonicity.
The extended fact transformer returned by fixpoint maintains a “current FactBase” which grows monotonically: as each block is analyzed, the block’s input fact is taken from the current FactBase, and the current FactBase is augmented with the facts that flow out of the block. The initial value of the current FactBase is the input FactBase, and the extended fact transformer iterates over the blocks until the current FactBase stops changing.
Cousot and Cousot (1977, 1979) introduce abstract interpretation as a technique for developing lattices for program analysis. Steffen (1991) shows that a dataflow analysis can be implemented using model checking; Schmidt (1998) expands on this result by showing that an all-paths dataflow problem can be viewed as model checking an abstract interpretation.
Implementing fixpoint requires about 90 lines, formatted for narrow display. The code, which is appended to the Web version of this paper (http://bit.ly/cZ7ts1), is mostly straightforward— although we try to be clever about deciding when a new fact means that another iteration is required. There is one more subtle point worth mentioning, which we highlight by considering a forward analysis of this graph, where execution starts at L1:
Marlowe and Ryder (1990) present a survey of different methods for performing dataflow analyses, with emphasis on theoretical results. Muchnick (1997) presents many examples of both particular analyses and related algorithms.
L1: x:=3; goto L4 L2: x:=4; goto L4 L4: if x>3 goto L2 else goto L5
Lerner, Grove, and Chambers (2002) show that interleaving analysis and transformation is sound, even when not all speculative transformations are performed on later iterations.
Block L2 is unreachable. But if we na¨ıvely process all the blocks (say in order L1, L4, L2), then we will start with the bottom fact for L2, propagate {x=4} to L4, where it will join with {x=3} to yield {x=}. Given x=, the conditional in L4 cannot be rewritten, and L2 seems reachable. We have lost a good optimization.
Frameworks. Most dataflow frameworks support only analysis, not transformation. The framework computes a fixed point of transfer functions, and it is up to the client of the framework to use that fixed point for transformation. Omitting transformation makes it much easier to build frameworks, and one can find a spectrum of designs. We describe two representative designs, then move on to frameworks that do interleave analysis and transformation.
Function fixpoint solves this problem by analyzing a block only if the block is reachable from an entry point. This trick is safe only for a forward analysis, which is why fixpoint takes a Direction as its first argument. 5.5
Related work
Throttling rewriting using “optimization fuel”
The Soot framework is designed for analysis of Java programs (Vall´ee-Rai et al. 2000). While Soot’s dataflow library supports only analysis, not transformation, we found much to admire in its
When optimization produces a faulty program, we use Whalley’s (1994) technique to find the fault: given a program that fails when compiled with optimization, a binary search on the number of
3 Kildall
uses meets, not joins. Lattice orientation is a matter of convention, and conventions have changed. We use Dana Scott’s orientation, in which higher elements carry more information.
2 The
order of the blocks does not affect the fixed point or any other result; it affects only the number of iterations needed to reach the fixed point.
131
design. Soot’s library is abstracted over the representation of the control-flow graph and the representation of instructions. Soot’s interface for defining lattice and analysis functions is like our own, although because Soot is implemented in an imperative style, additional functions are needed to copy lattice elements.
they correspond to applications of singletonDG in node and of dgSplice in cat. In an earlier version of Hoopl, this overhead was eliminated by splitting arfGraph into two phases, as in Whirlwind. The single arfGraph is simpler and easier to maintain; we don’t know if the extra thunks matter.
The CIL toolkit (Necula et al. 2002) supports both analysis and rewriting of C programs, but rewriting is clearly distinct from analysis: one runs an analysis to completion and then rewrites based on the results. The framework is limited to one representation of control-flow graphs and one representation of instructions, both of which are mandated by the framework. The API is complicated; much of the complexity is needed to enable the client to affect which instructions the analysis iterates over.
• The representation of a forward-transfer function is private to
Hoopl. Two representations are possible: we may store a triple of functions, one for each shape a node may have; or we may store a single, polymorphic function. Hoopl uses triples, because although working with triples makes some code slightly more complex, the costs are straightforward. If we used a single, polymorphic function, we would have to use a shape classifier (supplied by the client) when composing transfer functions. Using a shape classifier would introduce extra case discriminations every time we applied a transfer function or rewrite function to a node. We don’t know how these extra discriminations might affect performance.
The Whirlwind compiler contains the dataflow framework implemented by Lerner, Grove, and Chambers (2002), who were the first to interleave analysis and transformation. Their implementation is much like our early efforts: it is a complicated mix of code that simultaneously manages interleaving, deep rewriting, and fixed-point computation. By separating these tasks, our implementation simplifies the problem dramatically. Whirlwind’s implementation also suffers from the difficulty of maintaining pointer invariants in a mutable representation of control-flow graphs, a problem we have discussed elsewhere (Ramsey and Dias 2005).
In summary, Hoopl performs well enough for use in GHC, but there is much we don’t know. We have no evidence that any of the decisions above measurably affects performance—systematic investigation is indicated.
8.
Because speculative transformation is difficult in an imperative setting, Whirlwind’s implementation is split into two phases. The first phase runs the interleaved analyses and transformations to compute the final dataflow facts and a representation of the transformations that should be applied to the input graph. The second phase executes the transformations. In Hoopl, because control-flow graphs are immutable, speculative transformations can be applied immediately, and there is no need for a phase distinction.
7.
Discussion
We built Hoopl in order to combine three good ideas (interleaved analysis and transformation, an applicative control-flow graph, and optimization fuel) in a way that could easily be reused by many compiler writers. To evaluate how well we succeeded, we examine how Hoopl has been used, we examine the API, and we examine the implementation. We also sketch one of the many alternatives we have implemented. Using Hoopl. As suggested by the constant-propagation example in Figure 5, Hoopl makes it easy to implement many standard dataflow analyses. Students using Hoopl in a class at Tufts were able to implement such optimizations as lazy code motion (Knoop, Ruething, and Steffen 1992) and induction-variable elimination (Cocke and Kennedy 1977) in just a few weeks. Graduate students at Yale and at Portland State have also implemented a variety of optimizations.
Performance considerations
Our work on Hoopl is too new for us to be able to say much about performance. It is important to know how well Hoopl performs, but the question is comparative, and there isn’t another library we can compare Hoopl with. For example, Hoopl is not a dropin replacement for an existing component of GHC; we introduced Hoopl to GHC as part of a major refactoring of GHC’s back end. With Hoopl, GHC seems about 15% slower than the previous GHC, but we don’t know what part of the slowdown, if any, should be attributed to the optimizer. We can say that the costs of using Hoopl seem reasonable; there is no “big performance hit.” And a somewhat similar library, written in an impure functional language, actually improved performance in an apples-to-apples comparison with a library using a mutable control-flow graph (Ramsey and Dias 2005).
Hoopl’s graphs can support optimizations beyond classic dataflow. For example, in GHC, Hoopl’s graphs are used to implement optimizations based on control flow, such as eliminating branch chains. Hoopl is SSA-neutral: although we know of no attempt to use Hoopl to establish or enforce SSA invariants, Hoopl makes it easy to include φ-functions in the representation of first nodes, and if a transformation preserves SSA invariants, it will continue to do so when implemented in Hoopl. Examining the API. We hope that our presentation of the API in Section 4 speaks for itself, but there are a couple of properties worth highlighting. First, it’s a good sign that the API provides many higher-order combinators that make it easier to write client code. We have had space to mention only a few: extendJoinDomain, joinMaps, thenFwdRw, iterFwdRw, deepFwdRw, and pairFwd.
Although thorough evaluation of Hoopl’s performance must await future work, we can identify some design decisions that might affect performance. • In Figure 2, we show a single concatenation operator for blocks.
Using this representation, a block of N nodes is represented using 2N − 1 heap objects. We have also implemented a representation of blocks that include “cons-like” and “snoc-like” constructors; this representation requires only N + 1 heap objects. We don’t know how this choice affects performance.
Second, the static encoding of open and closed shapes at compile time worked out well. Shapes may seem like a small refinement, but they helped eliminate a number of bugs from GHC, and we expect them to help other clients too. GADTs are a convenient way to express shapes, and for clients written in Haskell, they are clearly appropriate. If one wished to port Hoopl to a language without GADTs, many of the benefits could be realized by making the shapes phantom types, but without GADTs, pattern matching would be significantly more tedious and error-prone.
• In Section 5, the body function analyzes and (speculatively)
rewrites the body of a control-flow graph, and fixpoint iterates this analysis until it reaches a fixed point. Decorated graphs computed on earlier iterations are thrown away. For each decorated graph of N nodes, at least 2N − 1 thunks are allocated;
132
References
Examining the implementation. If you are thinking of adopting Hoopl, you should consider not only whether you like the API, but whether if you had to, you could maintain the implementation. We believe that Section 5 sketches enough to show that Hoopl’s implementation is a clear improvement over previous implementations of similar ideas. By decomposing our implementation into node, block, body, graph, cat, fixpoint, and mkFRewrite, we have cleanly separated multiple concerns: interleaving analysis with rewriting, throttling rewriting using optimization fuel, and computing a fixed point using speculative rewriting. Because of this separation of concerns, we believe our implementation will be easier to maintain than anything that preceded it.
Andrew W. Appel. 1998. Modern Compiler Implementation. Cambridge University Press, Cambridge, UK. Available in three editions: C, Java, and ML. John Cocke and Ken Kennedy. 1977. An algorithm for reduction of operator strength. Communications of the ACM, 20(11):850–856. Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. 2001. A simple, fast dominance algorithm. Technical report, Rice University. Unpublished report available from http://www.hipersoft.rice.edu/ grads/publications/dom14.pdf. Patrick Cousot and Radhia Cousot. 1977 (January). Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Conference Record of the 4th ACM Symposium on Principles of Programming Languages, pages 238–252. Patrick Cousot and Radhia Cousot. 1979 (January). Systematic design of program analysis frameworks. In Conference Record of the 6th Annual ACM Symposium on Principles of Programming Languages, pages 269– 282. John B. Kam and Jeffrey D. Ullman. 1976. Global data flow analysis and iterative algorithms. Journal of the ACM, 23(1):158–171. John B. Kam and Jeffrey D. Ullman. 1977. Monotone data flow analysis frameworks. Acta Informatica, 7:305–317. Gary A. Kildall. 1973 (October). A unified approach to global program optimization. In Conference Record of the ACM Symposium on Principles of Programming Languages, pages 194–206. Jens Knoop, Oliver Ruething, and Bernhard Steffen. 1992. Lazy code motion. Proceedings of the ACM SIGPLAN ’92 Conference on Programming Language Design and Implementation, in SIGPLAN Notices, 27 (7):224–234. Sorin Lerner, David Grove, and Craig Chambers. 2002 (January). Composing dataflow analyses and transformations. Conference Record of the 29th Annual ACM Symposium on Principles of Programming Languages, in SIGPLAN Notices, 31(1):270–282.
Design alternatives. We have explored many alternatives to the API presented above. While these alternatives are interesting, describing and discussing an interesting alternative seems to take us a half-column or a column of text. Accordingly, we discuss only the single most interesting alternative: keeping the rewrite monad m private instead of allowing the client to define it. We have implemented an alternative API in which every rewrite function must use a monad mandated by Hoopl. This alternative has advantages: Hoopl implements checkpoint, restart, setFuel, and getFuel, so we can ensure that they are right and that the client cannot misuse them. The downside is that the only actions a rewrite function can take are the actions in the monad(s) mandated by Hoopl. These monads must therefore provide extra actions that a client might need, such as supplying fresh labels for new blocks. Worse, Hoopl can’t possibly anticipate every action a client might want to take. What if a client wanted one set of unique names for labels and a different set for registers? What if, in order to judge the effectiveness of an optimization, a client wanted to log how many rewrites take place, or in what functions they take place? Or what if a client wanted to implement Primitive Redex Speculation (Runciman 2010), a code-improving transformation that can create new function definitions? Hoopl’s predefined monads don’t accommodate any of these actions. By permitting the client to define the monad m, we risk the possibility that the client may implement key operations incorrectly, but we also ensure that Hoopl can support these examples, as well as other examples not yet thought of.
Thomas J. Marlowe and Barbara G. Ryder. 1990. Properties of data flow frameworks: a unified model. Acta Informatica, 28(2):121–163. Steven S. Muchnick. 1997. Advanced compiler design and implementation. Morgan Kaufmann, San Mateo, CA. George C. Necula, Scott McPeak, Shree Prakash Rahul, and Westley Weimer. 2002. CIL: Intermediate language and tools for analysis and transformation of C programs. In CC ’02: Proceedings of the 11th International Conference on Compiler Construction, pages 213–228. Norman Ramsey and Jo˜ao Dias. 2005 (September). An applicative controlflow graph based on Huet’s zipper. In ACM SIGPLAN Workshop on ML, pages 101–122. Colin Runciman. 2010 (June). Finding and increasing PRS candidates. Reduceron Memo 50, www.cs.york.ac.uk/fp/reduceron.
Final remarks. Dataflow optimization is usually described as a way to improve imperative programs by mutating control-flow graphs. Such transformations appear very different from the tree rewriting that functional languages are so well known for and which makes Haskell so attractive for writing other parts of compilers. But even though dataflow optimization looks very different from what we are used to, writing a dataflow optimizer in Haskell was a win: we had to make every input and output explicit, and we had a strong incentive to implement things compositionally. Using Haskell helped us make real improvements in the implementation of some very sophisticated ideas.
David A. Schmidt. 1998. Data flow analysis is model checking of abstract interpretations. In ACM, editor, Conference Record of the 25th Annual ACM Symposium on Principles of Programming Languages, pages 38–48. Bernhard Steffen. 1991. Data flow analysis as model checking. In TACS ’91: Proceedings of the International Conference on Theoretical Aspects of Computer Software, pages 346–365.
Acknowledgments Brian Huffman and Graham Hutton helped with algebraic laws. Sukyoung Ryu told us about Primitive Redex Speculation. Several anonymous reviewers helped improve the presentation.
Raja Vall´ee-Rai, Etienne Gagnon, Laurie J. Hendren, Patrick Lam, Patrice Pominville, and Vijay Sundaresan. 2000. Optimizing Java bytecode using the Soot framework: Is it feasible? In CC ’00: Proceedings of the 9th International Conference on Compiler Construction, pages 18–34.
The first and second authors were funded by a grant from Intel Corporation and by NSF awards CCF-0838899 and CCF-0311482. These authors also thank Microsoft Research Ltd, UK, for funding extended visits to the third author.
David B. Whalley. 1994 (September). Automatic isolation of compiler errors. ACM Transactions on Programming Languages and Systems, 16 (5):1648–1659.
133
Supercompilation by Evaluation Maximilian Bolingbroke
Simon Peyton Jones
University of Cambridge [email protected]
Microsoft Research [email protected]
Abstract
Our supercompiler can deforest the following term:
This paper shows how call-by-need supercompilation can be recast to be based explicitly on an evaluator, contrasting with standard presentations which are specified as algorithms that mix evaluation rules with reductions that are unique to supercompilation. Building on standard operational-semantics technology for call-by-need languages, we show how to extend the supercompilation algorithm to deal with recursive let expressions.
let ones = 1 : ones; map = . . . in map (λx . x + 1) ones into the direct-style definition: let xs = 2 : xs in xs With the exception of Klyuchnikov’s HOSC [8], previous supercompilers for lazy languages have dealt only with nonrecursive let bindings. HOSC is also able to deforest this example, but at the cost of sometimes duplicating work – something that we are careful to avoid.
Categories and Subject Descriptors D.3.1 [Programming Languages]: Formal Definitions and Theory – Semantics; D.3.2 [Programming Languages]: Language Classifications – Applicative (functional) languages; D.3.4 [Programming Languages]: Processors – Optimization General Terms
1.
Because recursion is not special, we do not need to give the program top-level special status, or λ-lift the input program.
Algorithms, Performance
• We perform an empirical evaluation of our supercompiler (Sec-
tion 5), in particular comparing it to Mitchell’s supercompiler [7]. The source code for the implementation, the Cambridge Haskell Supercompiler (CHSC), is available online1 . Our supercompiler reduces benchmark runtime by up to 95%, with a mean reduction of 26%.
Overview
Supercompilation is a powerful program transformation technique due to Turchin [1] which can be used to both automatically prove theorems about programs [2] and greatly improve the efficiency with which they execute [3]. Supercompilation is capable of achieving transformations such as deforestation [4], function specialisation and constructor specialisation [5]. Despite its remarkable power, the transformation is simple, principled and fully automatic. Supercompilation is closely related to partial evaluation, but can achieve strictly more optimising transformations [6]. The key contributions of this paper are as follows:
2.
Supercompilation by example
The best way to understand how supercompilation works is by example. Let’s begin with a simple example of how standard supercompilation can specialise functions to their higher-order arguments: let inc = λx . x + 1 map = λf xs. case xs of [ ] → [ ] (y : ys) → f y : map f ys in map inc zs
• Inspired by Mitchell’s promising results [7], we cast supercom-
pilation in a new light, showing how to design a modular supercompiler that is based directly on the operational semantics of the language (Section 3). Viewing supercompilation in this way is valuable, because it makes it easier to derive a supercompiler in a systematic way from the language, and to adapt it to new language features. Previous work intermingles evaluation and specialisation in a much more complex and ad-hoc way.
A supercompiler evaluates open terms, so that reductions that would otherwise be done at runtime are performed at compile time. Consequently, the first step of the algorithm is to reduce the term as much as possible, following standard evaluation rules: let inc = . . . ; map = . . . in case zs of [ ] → [ ] (y : ys) → inc y : map inc ys
• As an example of this flexibility, we show how to supercompile
a call-by-need language with unrestricted recursive let bindings, by making use of a standard evaluator for call-by-need (Section 4). This has two advantages:
At this point, we become stuck on the free variable zs. The most important decision when designing a supercompiler is how to proceed in such a situation, and we will spend considerable time later explaining how this choice is made when we cover the splitter in Section 3.5. In this particular example, we continue by recursively supercompiling two subexpressions. We intend to later recombine the two subexpressions into an output term where the case zs remains in the output program, but where both branches of the case have been further optimised by supercompilation.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright
1 http://github.com/batterseapower/supercompilation-by-evaluation/
135
The first subexpression is just [ ]. Because this is already a value, supercompilation makes no progress: the result of supercompiling that term is therefore [ ]. The second subexpression is:
Variables
x, y, z
Primitives
Data Constructors
let inc = . . . ; map = . . . in inc y : map inc ys
let inc = . . . in inc y Again, we perform reduction, yielding the supercompiled term y + 1. The other subexpression, originating from splitting the (y : ys) case branch, is: let inc = . . . ; map = . . . in map inc ys
C ::= True, Just, (:), . . .
` ::= 1, 2, . . . , ’a’, ’b’, . . .
Literals
Again, evaluation of this term is unable to make progress: the rules of call-by-need reduction do not make allowance for evaluating within non-strict contexts such as the arguments of data constructors. It is once again time to use the splitter to produce some subexpressions suitable for further supercompilation. This time, the first subexpression is:
⊗ ::= +, −, . . .
Values v ::= | |
λx. e ` Cx
Terms e ::= | | | | |
x v ex e⊗ e let x = e in e case e of α → e
Lambda abstraction Literal Saturated constructed data Variable reference Values Application Binary primops Recursive let-binding Case decomposition
Case Alternative α ::= ` Literal alternative | C x Constructor alternative
This term identical to the one we started with, except that it has the free variable ys rather than zs. If we continued inlining and β-reducing the map call, the supercompiler would not terminate. This is not what we do. Instead, the supercompiler uses a memo function. It records all of the terms it has been asked to supercompile as it proceeds, so that it never supercompiles the same term twice. In concrete terms, it builds up a set of promises, each of which is an association between a term previously submitted for supercompilation, its free variables, and a unique, fresh name (typically written h0 , h1 , etc.). At this point in the supercompilation of our example, the promises will look something like this:
Heaps
H ::= x 7→ e
Stack Frames κ ::= update x | •x | case • of α → e | •⊗ e | v⊗ •
Stacks
K ::= κ
Update frame Apply to function value Scrutinise value Apply first value to primop Apply second value to primop
Figure 1: Syntax of the Core language and evaluator
h0 zs 7→ let inc = . . . ; map = . . . in map inc zs h1 7→ [ ] h2 y ys 7→ let inc = . . . ; map = . . . in inc y : map inc ys h3 y 7 let inc = . . . in inc y →
A trivial post-pass can eliminate some of the unnecessary indirections to obtain a version of the original input expression, where map has been specialised on its functional argument:
We have presented the promises in a rather suggestive manner, as if the promises were a sequence of bindings. Indeed, the intention is that the final output of the supercompilation process will be not only an optimised expression, but one optimised binding for each h0 , h1 , ... ever added to the promises. Because the term we are now being asked to supercompile is simply a renaming of the original term (with which we associated the name h0 ) we can immediately return h0 ys as the supercompiled version of the current term. Producing a tieback like this we can rely on the (not yet known) optimised form of the original term (rather than supercompiling afresh), while simultaneously sidestepping a possible source of non-termination. Now, both of the recursive supercompilations requested in the process of supercompiling h1 have been completed. We can now rebuild the optimised version of the h2 term from the optimised subterms, which yields:
let h0 zs = case zs of [ ] → [ ]; (y : ys) → (y + 1) : h0 ys in h0 zs
3.
The basic supercompiler
We now describe the design of an unusually-modular supercompiler for a simple functional language that closely approximates GHC’s intermediate language, Core. The syntax of the language itself is presented in Figure 1; it is a standard untyped call-by-need calculus with recursive let, algebraic data types, primitive literals and strict primitive operations. Although Figure 1 describes terms in A-normal form [9], for clarity of presentation we will often write non-normalised expressions. A program is simply a term, in which the top-level function definitions appear as possibly-recursive let bindings. A small-step operational semantics of Core appears in Figure 3, and is completely conventional in the style of Sestoft [10] — so conventional that our description here is very brief indeed. The state of the machine is a triple hH e Ki, of a heap, a term and a stack. The term is the focus of evaluation, while the stack embodies the evaluation context, or continuation, that will consume the value produced by the term. Figure 1 gives the syntax of heaps and stacks, as well as terms. Our supercompiler is built from the following four, mostly independent, subsystems:
h3 y : h0 ys Continuing this process of rebuilding an optimised version of the supercompiler input from the optimised subexpressions, we eventually obtain this final program: let h0 zs = case zs of [ ] → h1 ; (y : ys) → h2 y ys h1 = [ ] h2 y ys = h3 y : h0 ys h3 y = y + 1 in h0 zs
136
type Heap = Map Var Term
The core of the supercompilation algorithm is sc, whose key property is this: for any history h and state s, (sc h s) returns a term with exactly the same meaning as s, but which is implemented more efficiently.
-- See H in Figure 1
type Stack = [StackFrame ] data StackFrame = . . .
-- See κ in Figure 1
data Term = . . .
-- See e in Figure 1
sc, sc 0 :: History → State → ScpM Term sc hist = memo (sc 0 hist) sc 0 hist state = case terminate hist state of Continue hist 0 → split (sc hist 0 ) (reduce state) Stop → split (sc hist) state
type State = (Heap, Term, Stack ) freeVars :: State → [Var ] rebuild :: State → Term sc :: History → State → ScpM Term
As foreshadowed in Section 2, sc is a memoised function: if it is ever asked to supercompile a State that is identical to one we have previously supercompiled (modulo renaming), we want to reuse that previous work. This is achieved by calling memo, which memoises uses of sc by recording information in the ScpM monad. We will describe memoisation in more detail in Section 3.4. Memoisation deals with the case where sc is called on an identical argument. But what if it is called on a growing argument? You might imagine that we would keep supercompiling forever. This well-known problem arises, for example, when supercompiling a recursive function with an accumulating parameter. There is likewise a well-known way to ensure that supercompilation terminates, which involves maintaining a “history” of previous arguments. In concrete terms, the parameter hist is the history, and sc 0 starts by calling terminate (Figure 2) to decide whether to Stop or (the common case) Continue. The implementation of histories and terminate is elaborated in Section 3.6. The normal case is that terminate returns Continue hist 0 , in which case sc 0 proceeds thus:
-- The evaluator (Section 3.3) reduce :: State → State -- The splitter (Section 3.5) split :: Monad m ⇒ (State → m Term) → State → m Term -- Termination checking (Section 3.2) type History = [State ] emptyHistory = [ ] :: History data TermRes = Stop | Continue History terminate :: History → State → TermRes -- Memoisation and the ScpM monad (Section 3.4) memo :: (State → ScpM Term) → State → ScpM Term match :: State → State → Maybe (Var → Var ) runScpM :: ScpM Term → Term freshName :: ScpM Var bind :: Var → Term → ScpM () promises :: ScpM [Promise ] promise :: Promise → ScpM () data Promise = P {name :: Var , fvs :: [Var ], meaning :: State }
1. It invokes a call-by-need evaluator, reduce, to optimise the state s by evaluating it to head normal form. This amounts to performing compile-time evaluation, so reduce must itself be careful not to diverge – see Section 3.3. 2. It uses split to recursively supercompile some subcomponents of the reduced state, optimising parts of the term that reduction didn’t reach.
Figure 2: Types used in the standard supercompiler
Here is an example. Imagine that this term was input to sc 2 : let x = True; y = 1 + 2 in case x of True → Just y; False → Nothing
1. A termination criterion that prevents the supercompiler from running forever: Section 3.2
Assuming that this State has never been previously supercompiled, sc 0 will be invoked by memo. Further assuming that the termination check in sc 0 returns Continue, we would reduce the input state to head normal form, giving a new state 0 :
2. An evaluator for the language under consideration: Section 3.3 3. A memoiser, which ensures that we supercompile any term at most once: Section 3.4 4. A splitter that tells us how to proceed when evaluation becomes blocked: Section 3.5
let y = 1 + 2 in Just y The case computation and x binding have been reduced away. It would be possible to return this state 0 as the final, supercompiled form of our input — indeed, in general the supercompiler is free to stop at any time, using rebuild to construct a semanticallyequivalent result term. However, doing so misses the opportunity to supercompile some subcomponents of state 0 that are not reduced in the head normal form. Instead, we feed state 0 to split, which:
We will show how to implement each of these components in a way that will yield a standard supercompiler, which is nonetheless more powerful than previous work in that it will naturally support recursive let. 3.1
The top-level
A distinctive feature of our supercompiler is that it operates on States rather than Terms; we reflect on why in Section 3.7. A State is a triple of type (Heap, Term, Stack ), and it represents precisely the state hH e Ki of the abstract machine (Figure 3). Notice that Term and State are related: any Term can be converted to its initial State, and any State can be converted back to a Term simply by wrapping the heap and the stack around the term, a function we call rebuild . The signatures of the major functions and data types used by the supercompiler – including State and rebuild – are given for easy reference in Figure 2.
1. Invokes sc hist 0 on the subterm 1 + 2, achieving further supercompilation (and hence optimisation). Let’s say for the purposes of the example that this then returns the final optimised term h1 , with a corresponding optimised binding h1 = 3 recorded in the monad. 2 Technically
sc takes a State not a Term, but for ease of presentation our examples will often use a term e in place of the state (emptyHeap, e, [ ]), as we do here.
137
expression. Likewise, StackFrames are labelled with the tag of the term the evaluator produced them from – e.g. a case • of α → e frame would be labelled with the tag of the corresponding case expression. Occasionally, the evaluator needs to manufacture a new term which did not necessarily occur in the input program – e.g. if we evaluate 1 + 2 to get the new value 3. In such cases, one of the operand tags is used as the tag for the new term. The termination criterion then defines an internal function that obtains a tag-bag from the components of a State triple:
2. Reconstructs the term using the optimised subexpressions. So in this case the Term returned by split would be let y = h1 in Just y. The entry point to the supercompiler, start, is as follows: start :: Term → Term start e = runScpM (sc emptyHistory (emptyHeap, e, [ ])) The input term, e, is first converted into an initial State, namely (emptyHeap, e, [ ]). This initial state is passed to the main supercompiler sc, along with the initial history. Finally sc is performed in the ScpM monad, initialised by runScpM – we describe this monad in detail in Section 3.4. In the following sections, we will explore the meaning and implementation of the reduce, memo, terminate and split functions in much more detail. 3.2
tagBag :: State → Bag Tag tagBag (h, e, k ) = (termTag e ∗ 2) ‘insertBag‘ fmap (∗3) (heapTagBag h) ‘plusBag‘ fmap (∗5) (stackTagBag k ) The tagBag function multiplies tags by distinct prime numbers depending on where in the evaluation context the tag originated from. This does not change the fact that there are only ever a finite number of distinct tags in the bags (and hence Ctb is still a well-quasi-order). However, the multiplication tends to prevent the evaluator from terminating just because e.g. a tagged binding that used to appear in the Heap is forced and hence has its tag show up on a StackFrame instead. Finally, we can combine tagBag and Ctb to produce the wellquasi-order C on States used by terminate:
The termination criterion
The core of the supercompiler’s termination check is provided by a single function, terminate: terminate :: History → State → TermRes data TermRes = Stop | Continue History As the supercompiler proceeds, it builds up an ever-larger History of previously-observed States. This history is both interrogated and extended by calling terminate. Termination is guaranteed by making sure that History cannot grow indefinitely. More precisely, terminate guarantees that, for any history h0 and states s0 , s1 , s2 , . . . there can be no infinite sequence of calls to terminate of this form: terminate h0 s0 = Continue h1 terminate h1 s1 = Continue h2 ...
(C) :: State → State → Bool s1 C s1 = tagBag s1 Ctb tagBag s2 Mitchell uses tag-bags in a similar way, but only associates tags with let-bound variables. In order to tag every subexpression, he keeps terms in a normal form where all subexpressions are let-bound. Supercompiling States and tagging subterms directly means that we can avoid let-floating and – because we distinguish between tags from subexpressions currently being evaluated (in the stack), and those subexpressions that are not in the process of being forced (in the heap) – our termination criterion is more lenient.
Instead, there will always exist some j such that: terminate hj sj = Stop In Section 3.3 we will see how reduce uses terminate to ensure that it only performs a bounded number of reduction steps, and we will discuss how terminate ensures that the overall supercompiler terminates in Section 3.6. So much for the specification, but how can terminate be implemented? Of course, (λxy. Stop) would be a sound implementation of terminate, in that it satisfies the property described above, but it is wildly over-conservative because it forces the supercompiler to stop reduction immediately. We want an implementation of terminate that is correct, but which nonetheless waits for as long as possible before preventing further reduction by answering Stop. The key to implementing such a termination criterion is defining a well-quasi-order [11, 12]. The relation C ∈ S ×S is a well-quasiorder iff for all infinite sequences of elements of S (s0 , s1 , . . .), it holds that: ∃ij.i < j ∧ si C sj . Given any well-quasi-order C : State × State, we can implement a correct terminate function:
3.3
The evaluator
The reduce function tries to reduce a State to head normal form. In case the term diverges, reduce includes a termination check that allows it to stop after a finite number of steps. (This check is conservative, of course, so reduce might fail to find a head normal form when one does exist.) The two key properties of reduce are: • Reduction preserves meaning: the State returned has the same
semantics as the input State • Regardless of what meaning the input State may have, reduce
always terminates The implementation is straightforward: reduce :: State → State reduce = go emptyHistory where go hist state = case step state of Nothing → state Just state 0 | intermediate state 0 → go hist state 0 | otherwise → case terminate hist state 0 of Stop → state 0 0 Continue hist → go hist 0 state 0 intermediate ( , Var , ) = False intermediate = True step :: State → Maybe State -- Implements Figure 3
terminate prevs here = if any (Chere) prevs then Stop else Continue (here : prevs) Concretely, we choose the tag-bag ordering of Mitchell [7] as the basis of our well-quasi-order. The tag-bag order relates bags (multisets) of “tags” as follows: t1 Ctb t2 ⇐⇒ set (t1 ) = set (t2 ) ∧ |t1 | ≤ |t2 | For this to be a well-quasi-order there must be a finite number of distinct tags that can appear in the bags. We take tags to be Ints, and assume that every sub-term of the supercompiler’s input program is labelled with a unique Int, which forms the tag for that
138
hH e Ki VAR UPDATE APP LAMBDA PRIM PRIM - LEFT PRIM - RIGHT CASE DATA LIT LETREC
hH e Ki
hH, x 7→ e x Ki hH v update x, Ki hH e x Ki hH λx. e • x, Ki hH e1 ⊗ e2 Ki hH v1 • ⊗ e2 , Ki hH v2 v1 ⊗ •, Ki hH case escrut of α → e Ki hH C x case • of {. . . , C x → e, . . .} , Ki hH ` case • of {. . . , ` → e, . . .} , Ki hH let x = e in ebody Ki
hH e update x, Ki hH, x 7→ v v Ki hH e • x, Ki hH e Ki hH e1 • ⊗ e2 , Ki hH e2 v1 ⊗ •, Ki hH ⊗ (v1 , v2 ) Ki hH escrut case • of α → e, Ki hH e Ki hH e Ki hH, x 7→ e ebody Ki
Figure 3: Operational semantics of the Core language The reduce function uses a loop, the function go, with an accumulating history. In turn go uses an internal function, step, which implements precisely the one-step reduction relation of Figure 3. Note that step returns a Maybe State – this accounts for reduction being unable to proceed due to either reaching a value, or because a variable is in the focus which is not bound by the heap (remember that reduce may be used on open terms). In that case reduce terminates with the state it has reached. The totality of reduce is achieved using the terminate function. If terminate reports that evaluation appears to be diverging, reduce immediately returns. As a result, the State triple (h, e, k ) returned by reduce might not be fully reduced – in particular, it might be the case that e ≡ Var x where x is bound by h. As an optimisation, the termination criterion is not tested if the State is considered to be “intermediate”. The intermediate predicate shown ensures that we only test for non-termination upon reaching a variable. This is safe because every infinite series of reduction steps must certainly have a variable occur in the focus an infinite number of times – it is straightforward to construct a measure on (e, K) pairs that is strictly decreased by every reduction rule except VAR). After some experience with our supercompiler we discovered that making termination tests infrequent is actually more than a mere optimisation. If we test for termination very frequently (say, after every tiny step), the successive states will be very similar; and the more similar they are, the greater the danger that the necessarily-conservative termination criterion (Section 3.2) will unnecessarily say Stop. (For example, in the limit, it must say Stop for two identical states.) 3.4
different ways, we can reuse the supercompiled version of a State several times. The data structure used to store all this information is called a Promise (Figure 2). 2. The optimised bindings, each of the form x = e. The runScpM function, which is used to actually execute ScpM Term computations, wraps the optimised bindings collected during the supercompilation process around the final supercompiled Term in order to produce the final output. 3. A supply of fresh names (h0 , h1 , ...) to use for the optimised bindings. When sc begins to supercompile a State, it records a promise for that state; when it finishes supercompiling that state it records a corresponding optimised binding for it. At any moment there may be unfulfilled promises that lack a corresponding binding, but every binding has a corresponding promise. Moreover, every promise will eventually be fulfilled by an entry appearing in the optimised bindings. Figure 2 summarises the signatures of the functions provided by ScpM . We can now implement memo as follows: memo :: (State → ScpM Term) → State → ScpM Term memo opt state = do ps ← promises let ress = [ (name p ‘apps‘ map rn (fvs p)) | p ← ps , Just rn ← [match (meaning p) state ] ] case ress of res : → return res [] → do x ← freshName let vs = freeVars state promise P {name = x , fvs = vs, meaning = state } e ← opt state bind x (lambdas vs e) return (x ‘apps‘ vs)
The memoiser
The purpose of the memoisation function, memo, is to ensure that we never supercompile a term more than once. We achieve this by using the ScpM monad to record information about previously supercompiled States. Precisely, the ScpM monad is a simple state monad with three pieces of state: 1. The promises, which comprise all the States that have been previously submitted for supercompilation, along with: • The names that the supercompiled versions of those States
will be bound to in the final program (e.g. h0 , h1 ) • The list of free variables that those bindings will be ab-
The memo function proceeds as follows:
stracted over3 . By instantiating these free variables several 3 Strictly
us from accidentally introducing space leaks by increasing the garbagecollection lifetime of constant expressions.
speaking, bindings with no free variables at all should nonetheless be λ-abstracted over a dummy argument (such as ()). This will prevent
139
1. Firstly, it examines all existing promises. If the match function reports that some existing promise matches the State we want to supercompile (up to renaming), memo returns a call to the optimised binding corresponding to that existing promise.
Such an implementation is wildly conservative, because not even trivially reducible subexpressions will benefit from supercompilation. A good split function will residualise as little of the input as possible, using opt to optimise as much as possible. It turns out that, starting from this sound-but-feeble baseline, there is a rich variety of choices one can make for split, as we explore in the rest of this section. In preparation for describing split in more detail, we first introduce a notational device similar to that of Mitchell [7] for describing the operation of split on particular examples. Suppose that the following State is given to split:
2. Assuming no promise matches, memo continues: (a) A new promise for this novel State is made, in the form of a new Promise entry. A fresh name of the form hn (for some n) is associated with the Promise. (b) The state is optimised by calling opt, obtaining an optimised term e.
hx 7→ 1, xs 7→ map (const 1) ys x : xs i
(c) A final optimised binding hn = λfvs (s) . e is recorded using bind . This binding will be placed in the output program by runScpM .
In our notation the output of split would be this “term”, which has sub-components that are States:
(d) Finally, a call to that binding, hn fvs (s), is returned.
let x = h 1 i ; xs = h map (const 1) ys i in x : xs
The match function is used to compare States: match :: State → State → Maybe (Var → Var )
You should read this in the following way:
The key properties of the match function are that:
• The part of the term outside the hstate bracketsi is the residual
code that will form part of the output program.
• If match s1 s2 ≡ Just rn then the meaning of s2 is the same
• In contrast, those things that live within the brackets are the
as that of rn(s1).
not-yet-residual States which are fed to opt for further supercompilation.
• If s1 is syntactically identical to s2 , modulo renaming, then
isJust (match s1 s2 ). This property is necessary for termination of the supercompiler, as we will discuss later.
Before split returns, the supercompiled form of the bracketed expressions is pasted into the correct position in the residual code. So the actual end result of such a supercompilation run might be something like:
Naturally, it is desirable for the match function to match as many truly equivalent terms as possible. This is made slightly more convenient by the fact that we consider matching States, as they may have already been weakly normalised by the evaluator. Our implementation exploits this by providing a match function that is insensitive to the exact order of bindings in the Heap. One subtle point is that the matching should be careful not to duplicate work. This can happen if an old term such as:
let x = h2 ; xs = h3 ys in x : xs where h2 and h3 will have optimised bindings in the output program, as usual. So far, we have only seen examples where split opt invokes opt on subterms of the original input. While this is a good approximation to what split does, in general, we will also want to include some of the context in which that subterm lives. Consider the following input:
let x = fact 100; y = fact 100 in (x , y) is matched against a proposed new one such as: let x = fact 100 in (x , x )
hx 7→ 1, y 7→ x + x Just y i
However, if the let-bindings in those terms had bound, say, True instead of fact 100 then matching them would be both permissible and desirable. Unlike some supercompilers (e.g. [8]), our use of the memo function means that we will even share the work of supercompiling nodes that are siblings in the supercompilation “process tree”. 3.5
A good way to split is as follows: let y = hx 7→ 1 x + x i in Just y Note that split opt decided to recursively optimise the term x + x , along with a heap binding for x taken from the context which the subterm lived in. This extra context will allow the supercompiler to reduce x + x to 2 at compile time. Another way that a subterm can get some context added to it by split is when evaluation of a case expression gets stuck. As an example, consider the following (stuck) input to split:
The splitter
The job of the splitter is to somehow continue the process of supercompiling a State which we may not reduce further, either because of a lack of information (e.g. if the State is blocked on a free variable), or because the termination criterion is preventing us from making any further one-step reductions. The splitter has the following type signature:
h x case • of (True → 1; False → 2) , • + 3i One possibility is that split could break the expression up for further supercompilation as follows:
split :: Monad m ⇒ (State → m Term) → State → m Term
(case x of True → h 1 i False → h 2 i) + h 3 i
In general, (split opt s) identifies some sub-components of the state s, uses opt to optimise them, and combines the results into a term whose meaning is the same as s (assuming, of course, that opt preserves meaning). A sound, but feeble, implementation of split opt s would be one which never recursively invokes opt: split
However, split can achieve rather more potential for reduction if it duplicates the stack frame performing addition into both case branches: in particular, that will mean that we are able to evaluate the addition at compile time: (case x of True → h 1 (• + 3)i False → h 2 (• + 3)i)
s = return (rebuild s)
140
Our split uses let-floating to make more heap bindings suitable for pushing down under these criteria. For example, this state:
In fact, in general we will always want to push all of the stack frames following a case • of α → e frame to meet with the expressions e in the case branches. This is one of the places where the decision to have the supercompiler work with States rather than Terms pays off: the fact that we have an explicit evaluation context makes the process of splitting at a residual case very systematic and easy to implement. The key property of split is that for any opt that is meaning preserving (such that opt s returns an expression e with the same meaning as s), split opt must be meaning preserving in the same sense. There are a number of subtle points to bear in mind when implementing split. We describe some issues below, and will have more to say in Section 4.
hx 7→ Just (fact n) λm. case x of Just y → y + m i Will be split as follows: let a = h fact n i inλm. hx 7→ Just a case x of Just y → y + m i Sketching split Due to space limitations, we are unable to give a complete description of split. However, we can give a sketch of a suboptimal implementation that may nonetheless clarify our description. We first introduce the concept of a Bracket. This is a Haskell representation of the “term with holes” notational device we introduced earlier. Each hole contains a State:
Issue 1: learning from residual case branches We gain information about a free variable when it is scrutinised by a residual case. Thus, given this State:
data Bracket = B {holes :: [State ], build :: [Term ] → Term } termBracket :: Term → Bracket termBracket e = B [(emptyHeap, e, noStack )] (λ[e 0 ] → e 0 )
h x case • of (3 → x + x ; 4 → x ∗ x )i We split as follows:
Our code examples will often make use of a [[bracketed]] syntax to concisely define a value of type Bracket:
case x of 3 → hx → 7 3 x + x i 4 → hx → 7 4 x ∗ x i
[[f h 1 i]] :: Bracket
Because we have learnt the value of x from the case alternative, we are able to statically reduce the + and ∗ operations in each branch.
This particular example corresponds to: B {holes = [(, 1, )], build = λ[e 0 ] → var "f" ‘apps‘ e 0 }
Issue 2: work duplication Consider splitting the following State, where fact is an unknown function and hence must be assumed to to be expensive to execute:
Split can now be defined as follows: split opt (h, e, k ) = liftM (build br ) $ mapM opt (holes br ) where xs = case e of Var x → [x ]; → [ ] br = splitHeap h $ splitStack xs k $ splitTerm e
hx 7→ fact n (x + 1, x + 2) i One possibility is to split as follows:
Each part of the State is split independently to produce a Bracket, which then has all of its holes optimised before we rebuild the final term. Before we cover splitTerm, splitStack and splitHeap, we will need a way to build a larger bracket from smaller ones:
(hx 7→ fact n x + 1 i , hx 7→ fact n x + 2 i) Unfortunately, this choice leads to duplication of the expensive fact n subterm. If we freely duplicate unbounded amounts of work in this manner we can easily end up “optimising” the program into a much less efficient version. Work can be duplicated even if no syntactic duplication occurs, as occurs if we take this example:
plusBrackets :: [Bracket ] → ([Term ] → Term) → Bracket plusBrackets brs rb = B {holes = concatMap holes brs, build = f } where f es = rb (zipWith (λbr es → build br es) brs ess) where ess = splitInto (map holes brs) es splitInto :: [[b ]] → [a ] → [[a ]] -- splitInto bss as ≡ ass ∧ length (concat bss) ≡ length as -- =⇒ map length bss ≡ map length ass ∧ as ≡ concat ass
hx 7→ fact n λy. x + y i We would duplicate work if we were to split in the following way: λy → hx 7→ fact n x + y i Furthermore, syntactic duplication does not necessarily lead to work duplication. Consider:
Now, splitTerm just identifies some subexpressions for supercompilation:
hx 7→ fact n y case • of (True → x + 1; False → x + 2)i
splitTerm :: Term → Bracket splitTerm e = plusBrackets (map termBracket es) rb where (es, rb) = uniplate e
Notice that splitting it as follows does not duplicate the computation of fact n: case y of True → hx → 7 fact n x + 1 i False → hx → 7 fact n x + 2 i
We make use of the uniplate combinator (following Mitchell and Runciman [13]), which takes a Term apart into a list of its immediate subterms, and a function to recombine those subterms to obtain the original input:
Consequently, we push the heap bindings supplied to split down into those split-out subterms of which they are free variables, as long as either one of these conditions is met:
uniplate :: Term → ([Term ], [Term ] → Term)
• The binding manifestly binds a value, such as λx . x : values
There is more work to do when splitting the stack:
require no further reduction, so no work can be lost that way
splitStack :: [Var ] → Stack → Bracket → ([(Var , Bracket)], Bracket)
• Pushing the binding down into the subterm would not result
in the allocation of its thunk occurring more than once in any possible context consuming the output
141
The call splitStack xs k b splits stack k with bracket b in the focus, where all of the variables xs are guaranteed to have the same value as the focus. We will use the xs in splitStack to learn from residual case branches. There are three principal possibilities that splitStack has to deal with. Firstly, applications and primitives can be handled uniformly:
3.6
Termination of the supercompiler
Although we have been careful to ensure that our evaluation function, reduce, is total, it is not so obvious that sc itself is terminating. Since split may recursively invoke sc via its higher order argument, we might get an infinitely deep stack of calls to sc! To rule out this possibility, sc carries a history, which – as we saw in Section 3 – is checked before any reduction is performed. If terminate allows the history to be extended, the input State is reduced before recursing. Otherwise, the input State is fed to split unchanged. In order to be able to prove that the supercompiler terminates, we need some condition on exactly what sort of subcomponents split opt invokes opt on. It turns out that the presence of recursive let requires us to choose a rather complicated condition here, as we will explain further in Section 4.4. Let us pretend for a moment that we have no recursive let. In this scenario, it is always the case for our split that split opt s invokes opt s 0 only if s 0 ≺ s. The ≺ relation is a well-founded relation defined by s0 ≺ s ⇐⇒ size (s0 ) < size (s), where size : State → N returns the number of abstract syntax tree nodes in the State. This is sufficient to ensure termination, as the following argument shows:
splitStack xs (• x : k ) br = splitStack [ ] k [[hbr i x ]] splitStack xs (• ⊗ e : k ) br = splitStack [ ] k [[hbr i ⊗ h e i]] splitStack xs (v ⊗ • : k ) br = splitStack [ ] k [[h v i ⊗ hbr i]] The next possibility is that the stack frame arises from a case: splitStackhhxs (case • of α → e : k ) brii = ([ ], case hbr i of α → haltbr i ) where altbr = haltHeap α e k i altHeap α = fromList [(x , altConValue α) | x ← xs ] altConValue :: AltCon → Value altConValue (C x ) = (C x ) altConValue ` =`
Theorem: sc always recurses a finite number of times Proceed by contradiction. If sc recursed an infinite number of times, then by definition the call stack would contain infinitely many activations of sc hist s for (possibly repeating) sequences of hist and s values. Denote the infinite chains formed by those values as hhist 0 , hist 1 , . . .i and hs0 , s1 , . . .i respectively. Now, observe that there must be infinitely many i such that isContinue (terminate hist i si ). This follows because the only other possibility is that there must exist some j such that ∀l.l ≥ j =⇒ isStop (terminate hist l sl ). On such a suffix, sc is recursing through split without any intervening uses of reduce. However, by the property we required split to have, such a sequence of states must have a strictly decreasing size:
Notice that we do not recursively call splitStack in this situation: as we discussed, the entire stack is pushed into each case branch. We also use altHeap to construct a heap that binds the variables being scrutinised (if any) to the value corresponding to the particular case alternative. Finally, the immediate stack frame may be an update frame: splitStack xs (update x : k ) br = ((x , br ) : xbrs 0 , br 0 ) where (xbrs 0 , br 0 ) = splitStack (x : xs) k [[x ]] In this case, we recursively split the remainder of the stack, but change the focus to be the variable being updated. The presence of update frames is why splitStack returns a [(Var , Bracket)] as well as a Bracket – the list of (Var , Bracket) contains a Bracket for every update frame that splitStack encountered. As we will see shortly, the brackets from this list will be placed in an enclosing let expression along with those arising from the Heap. Finally, we can implement splitHeap:
∀l.l > j =⇒ size (sl ) < size (sj ) However, < is a well founded relation, so such a chain cannot be infinite. This contradicts our assumption that this suffix of sc calls is infinite, so it must be the case that there are infinitely many i such that isContinue (terminate hist i si ). Now, form the infinite chain ht1 , t2 , . . .i consisting of si such that isContinue (terminate hist i si ). By the properties of terminate, it follows that:
splitHeap :: Heap → ([(Var , Bracket)], Bracket) → Bracket splitHeap h (xbrs, br ) = plusBrackets (map inline (br : brs)) (λ(e : es) → letRec (xs ‘zip‘ es) e) where (xs, brs) = unzip (xbrs + + [ (x , termBracket e) | (x , e) ← toList h ])
∀ij.j < i =⇒ ¬ (tagBag tj C tagBag ti ) However, this contradicts the fact that C is a well-quasi-order. Combined with the requirement that split opt only calls opt finitely many times, the whole supercompilation process must terminate.
This completes the implementation of split. A real implementation will need to add several complications:
Two non-termination checks It is important to note that the history carried by sc is extended entirely independently from the history produced by the reduce function (similar to “transient reductions” [14]). The two histories deal with different sources of nontermination. The history carried by reduce prevents non-termination due to divergent expressions, such as this one:
• The splitHeap function should attempt to push some elements
of the Heap into the holes of the brackets from splitStack . A linearity analysis will be required in order to avoid duplicating work when non-value heap bindings get pushed down. • The Heap should be let-floated to expose values under lets,
let f x = 1 + (f x ) in f 10
and hence allow more bindings to be propagated downwards. • In the presence of recursive let it is not always valid for
In contrast, the history carried by sc prevents non-termination that can arise from repeatedly invoking the split function – even if every subexpression would, considered in isolation, terminate. This is illustrated in the following program:
splitStack to push down the entire stack into the branches of a residual case. This issue is discussed in more detail in Section 4.
142
let count n = n : count (n + 1) in count 0
e has been reduced to a value, v, the update frame will be popped from the stack, which is the cue for the evaluator to update the heap with a binding x 7→ v, replacing the old one. Now, subsequent uses of x in the course of evaluation will be able to reuse that value directly, without reducing e again. As an example of how update frames work, consider this reduction sequence:
Left unchecked, we would repeatedly reduce the calls to count, yielding a value (a cons-cell) each time. The split function would then pick out both the head and tail of the cons cell to be recursively supercompiled, leading to yet another unfolding of count, and so on. The resulting (infinite) residual program would look something like:
hx 7→ 1 + 2 x + x i hx 7→ 1 + 2 x • + x i h 1 + 2 update x , • + x i ... h 3 update x , • + x i hx 7→ 3 3 • + x i hx 7→ 3 x 3 + •i h 3 update x , 3 + •i hx 7→ 3 3 3 + •i hx 7→ 3 6 i
let h0 = h1 : h2 ; h1 = 0 h2 = h3 : h4 ; h3 = 1 h4 = h5 : h6 ; h5 = 2 ... The check with terminate before reduction ensures that instead, one of the applications of count is left unreduced. This use of terminate ensures that our program remains finite:
Because the corresponding heap binding is removed from the heap whenever an update frame is pushed, the update frame mechanism is what causes reduction to become blocked if you evaluate a term which forms a black hole:
let h0 = h1 : h2 ; h1 = 0 h2 = let count = λn. h3 n in count 1 h3 n = n : h3 (n + 1) in h0
hx 7→ x + 1 x i
4.2
Negative recursion in data constructors As a nice aside, the rigorous termination criterion gives us a stronger termination guarantee than the Glasgow Haskell Compiler (GHC) [15], the leading Haskell implementation. Because GHC does not check for recursion through negative positions in data constructors, the following notorious program will force GHC into an infinite loop:
Splitting in the presence of update frames
We will split as follows, pushing the whole stack, including the update frame for y, into the case branch: case x of T → h F update y, case • of F → (2, y)i After supercompilation is complete, we will then obtain an output term something like the following:
Observations on the supercompiler
case x of T → let y = F in (2, y) This is what the splitStack function we saw in Section 3.5 does. 4.3
Splitting update frames from recursive lets
The key problem that the splitter must face is that update frames derived from recursive let can interact badly with our intention to push the entire enclosing stack into the branches of a case. Consider this input to split: h unk
Extending to recursive let
•+y, case • of 1 → 2, update y, •+ 2i
Following our earlier discussion of case, we might be tempted to split as follows:
In the previous section, we described all the pieces necessary to implement a complete supercompiler. The handling of recursive let is mostly straightforward in this framework, with the exception of two things:
case unk + y of 1 → h 2 update y, •+ 2i However, this is a disastrous choice – due to the occurrence of y in the scrutinee, y is now a free variable of the output expression! The lesson here is that update frames should not be pushed inside case branches if they bind a variable that we may need to refer to outside the case. Following this rule, our example is instead split as follows:
• Update frames originating from recursive let complicate the
splitter: Section 4.3 • The termination proof for the supercompiler becomes more
complicated: Section 4.4
let y = case unk + y of 1 → h 2 i in y + h 2 i
We cover each of these points in order. 4.1
• + 1, update x i 6
h x case • of T → F , update y, case • of F → (2, y)i
It is a unique feature of our supercompiler that all our ingredients operate on States, rather than Terms. This is a consequence of explicitly basing the supercompiler on an evaluator, but it pays off in the splitter (Section 3.5) as well. The splitter operates distinctively differently on each of the three components of the State, and takes advantage of the explicit representation of the Stack to push the continuation into the branches of a residual case expression. To split a Term well would be much more inconvenient.
4.
h x
Just like all other kinds of stack frame, we want to push update frames into residual case branches. Consider this input to split:
data U = MkU (U → Bool ) russel u@(MkU p) = not (p u) x = russel (MkU russel ) :: Bool 3.7
...
Update frames complicate the supercompiler slightly, but in a localised way – we must think carefully as to how the split function should deal with update frames.
Irritatingly, the choice about which update frames should not be pushed inside case branches is not as straightforward as a simple free-variable check. The reason is that choosing to not push an update frame down may make more of the variables bound by other pushable update frames free, and hence require us to prevent pushing in yet more update frames! Here is a contrived example illustrating the point – note that for clarity we will not write the update frames directly, and represent the States as if they were terms:
Update frames
The evaluator (Figure 3 and Section 3.3) deals with a call-by-need language, using update frames in the conventional way to model laziness [10]. When a heap binding x 7→ e is demanded by a variable x coming into the focus of the evaluator, e may not yet be a value. To ensure that we only reduce any given heap-bound e to a value at most once, the evaluator pushes an update frame update x on the stack, before beginning the evaluation of e . After
143
let w = fact z ; y = unk + x x = case y of 10 → w + 3 z = case x of 20 → a + 3 in z + w + a
Theorem: sc always recurses a finite number of times Proceed by contradiction. If sc recursed an infinite number of times, then by definition the call stack would contain infinitely many activations of sc hist s for (possibly repeating) sequences of hist and s values. Denote the infinite chains formed by those values as hhist 0 , hist 1 , . . .i and hs0 , s1 , . . .i respectively. Now, observe that there must be infinitely many i such that isContinue (terminate hist i si ). This follows because the only other possibility is that there must exist some j such that ∀l.l ≥ j =⇒ isStop (terminate hist l sl ). On such a suffix, sc is recursing through split without any intervening uses of reduce. By the modified property of split and the properties of alt-heap and subterms we have that
Our initial guess at the output of split may be as follows: let y = unk + hx i in case y of 10 → h let w = fact z ; x = w + 3 z = case x of 20 → a + 3 in z + w + a i Unfortunately, x is now a free variable of the whole expression, and consequently we should not have pushed the update frame for x within the case branch. Based on this information, our next guess may be:
∀l.l ≥ j =⇒ hl ⊆ hj ∪ alt-heap (ej , kj ) ∧ kl ‘isInfixOf ‘ kj ∧ el ∈ subterms (sj )
let w = hfact z i; y = unk + hx i x = case y of 10 → hw + 3i in case x of 20 → h let z = a + 3 in z + w + a i
We can therefore conclude that the infinite suffix must repeat itself at some point: ∃l.l > j ∧ sl ≡ sj . However, we required that match always succeeds when matching two terms equivalent up to renaming, which means that sc hist l sl would have been tied back by memo rather than recursing. This contradicts our assumption that this suffix of sc calls is infinite, so it must be the case that there are infinitely many i such that isContinue (terminate hist i si ). Now, form the infinite chain ht1 , t2 , . . .i consisting of si such that isContinue (terminate hist i si ). As in Section 3.6, this contradicts the fact that C is a well-quasi-order. Although the termination argument becomes more complex, the actual supercompilation algorithm remains as simple as ever.
Note that we have now been forced not to push the w binding down into either the case branch, because doing so would risk work duplication. Unfortunately, that has caused z to be free in the output expression! The correct solution is in fact to not push down the update frames for both x and z : let w = hfact z i; y = unk + hx i x = case y of 10 → hw + 3i z = case x of 20 → ha + 3i in z + hw i + hai Our real split implementation uses a fixed point that follows essentially this reasoning process to determine the set of update frames which may not be pushed down. 4.4
5.
Termination in the presence of recursive let
Results
We have implemented the supercompiler for a subset of Haskell. It is implemented as a preprocessor: programs are run through the supercompiler before being compiled by GHC at the -O2 optimisation level. The preliminary results of running the supercompiler on a standard array of benchmark programs are shown in Figure 4. For comparison, we include benchmark results from a supercompiler of Mitchell [7]. The “append”, “factorial”, “raytracer”, “sumtree” and “treeflip” benchmarks are all standard examples that have been described in previous work on supercompilation and deforestation [3, 4, 7, 16]. The “sumsquare” program is taken from work in stream fusion [17]. The “bernouilli”, “digitsofe2”, “exp3 8”, “primes”, “rfib”, “tak”, “wheel-sieve1”, “wheel-sieve2” and “x2n1” benchmarks are from the imaginary portion of the nofib benchmark suite [18]. We tested two variants of our supercompiler: one where the supercompiler evaluated primitive operations (primops), and one where it did not. Both variants treated primitives as strict operations. The benchmark results are promising. The supercompiler without primops reduced runtime by an (arithmetic) average of 26% compared to GHC alone. Evaluating primops reduced the average runtime reduction to 16%. Similar to our system, Mitchell’s system achieved an average reduction of 27%, though the improvements had a rather different profile. The use of supercompilation in practice is limited because despite the fact that it is guaranteed to terminate, it might take very long indeed to do so. Nofib imaginary suite benchmarks such as “digitsofe1” and “gen regexps” are prohibitively expensive to supercompile in both our system and that of Mitchell. Interestingly, the same problem afflicts “tak” – but only when evaluation of primops is enabled.
In Section 3.6 we showed why the supercompiler without recursive let terminated. However, to make that argument we had to rely on a condition on split that is simply too restrictive for the supercompiler with recursive let. Before, we used the property that split opt s invoked opt s 0 only if s0 ≺ s ⇐⇒ size (s0 ) < size (s). However, consider this input to split: hf 7→ λy. Just (f (not y)) Just (f (not y)) i We would like to split as follows: let f = λx . hf 7→ λy. Just (f (not y)) Just (f (not y)) i in Just (f (not y)) This is disallowed by the size-based criterion because the recursivelyoptimised State would be no smaller than the input. In the presence of recursive let, we can instead use the property that for our split, split opt (h, e, k ) only invokes opt on states (h 0 , e 0 , k 0 ) that satisfy all of these conditions: 1. h 0 ⊆ h ∪ alt-heap (e, k ) 2. k 0 ‘isInfixOf ‘ k 3. e 0 ∈ subterms (h, e, k ) The subterms (h, e, k ) function returns all expressions that occur syntactically within any of the Heap, Stack or Term inputs. The alt-heap (e, k ) function takes the variables bound by update frames in k and, if e ≡ Var x , the variable x . It then forms the cross product of that set with the values corresponding to the α in any case • of α → e ∈ k . We are now in a position to repair the proof.
144
Program append bernouilli digitsofe2 exp3 8 factorial primes raytracer rfib sumsquare sumtree tak treeflip wheel-sieve1 wheel-sieve2 x2n1 Average Minimum Maximum a d
SC.a 0.0s 5.8s 4.2s 0.8s 0.0s 0.1s 0.0s 0.0s 19.5s 0.1s 0.1s 0.1s N/A N/A 0.1s 2.4s 0.0s 19.5s
Mitchell [7] Cmp.b Runc Memd 0.88 0.86 0.85 1.63 0.98 0.97 1.24 0.32 0.46 1.34 0.96 1.00 0.99 0.95 1.00 1.04 0.63 0.99 1.00 0.57 0.44 0.94 0.93 1.00 1.45 0.36 0.00 1.01 0.13 0.00 0.86 0.81 655.04 1.03 0.56 0.45 N/A N/A N/A N/A N/A N/A 1.06 0.92 0.99 1.10 0.73 44.35 0.86 0.13 0.00 1.63 1.00 655.04
Sizee 1.29 3.76 1.15 6.59 0.77 0.79 1.54 0.87 7.38 1.50 0.59 1.99 N/A N/A 1.39 2.11 0.59 7.38
SC.a 0.0s 0.1s 0.1s 8.7s 0.0s 0.0s 0.0s 0.0s 2.3s 0.0s 0.1s 0.0s 22.2s 1.3s 0.0s 2.3s 0.0s 22.2s
Evaluator-based, no primops Cmp.b Runc Memd 1.00 0.89 0.87 1.07 0.98 0.95 1.07 1.17 1.08 2.85 0.59 0.67 0.96 0.99 1.00 0.98 0.72 1.07 1.00 0.52 0.45 1.00 0.67 1.00 1.97 0.05 0.00 1.02 0.14 0.00 1.34 0.74 18644.34 1.02 0.13 0.05 7.87 0.90 0.53 3.16 1.55 1.21 1.10 0.99 0.95 1.83 0.74 1243.61 0.96 0.05 0.00 7.87 1.55 18644.34
Sizee 3.24 2.26 2.81 85.17 1.00 0.87 1.37 2.00 20.78 2.46 7.22 2.53 71.07 18.35 1.21 14.82 0.87 85.17
SC.a 0.0s 0.1s 0.1s 15.4s 0.0s 0.0s 0.0s 0.0s 3.0s 0.2s N/A 0.2s 16.8s 1.4s 0.0s 2.7s 0.0s 16.8s
Evaluator-based, primops Cmp.b Runc Memd Sizee 1.03 0.92 0.87 3.24 1.07 0.98 0.95 2.24 1.08 1.18 1.09 2.79 3.35 0.55 0.67 114.31 0.98 1.05 1.00 0.91 0.98 0.71 1.07 0.80 1.00 0.51 0.45 1.38 1.00 0.67 1.01 2.00 1.95 0.06 0.00 21.15 1.24 0.68 0.93 9.09 N/A N/A N/A N/A 1.47 0.81 0.91 19.40 10.61 1.00 0.54 71.47 3.06 1.55 1.21 18.24 1.15 0.99 0.95 1.18 2.06 0.84 0.84 17.95 0.98 0.06 0.00 0.80 10.61 1.55 1.21 114.31
b GHC compile time relative to no supercompilation c Program runtime relative to no supercompilation Supercompilation time (seconds) e Size (in syntax tree nodes) of program relative to no supercompilation Runtime allocation relative to no supercompilation
Figure 4: Benchmark results Primitive operations Indeed, the supercompiler performed worse overall when evaluating primops than when it left them unevaluated – particularly suffering on “sumtree” and “treeflip”. These benchmarks have a common structure where a binary tree is generated and then consumed by a function pipeline, terminated by a simple sum of the tree nodes. The initial construction of the tree does not deforest cleanly, but the consuming function pipeline makes several intermediate copies of the tree which can be deforested to produce a function that produces the required sum directly. Both our system (without primops) and Mitchell’s system are able to fuse these pipelines together. The addition of primops to the system means that we create specialisations of the fused pipeline that include in their evaluation contexts frames such as 2 + •, where 2 is a partial sum of the tree. Every specialisation of the fused pipeline includes such a stack frame, and because the partial sum changes regularly those specialisations can never be reused. We end up building a lot of specialisations of the pipeline for a few values of the partial sum, before the termination condition kicks in and stops us. Unfortunately, the resulting termination splitting prevents us from fusing the pipeline entirely. The net result is that the first few iterations of the sum are computed with perfect deforestation, but later iterations must fall back on a fully-forested function isomorphic to the original unfused pipeline.
The benchmark where we do noticeably worse than Mitchell is “digitsofe2” – we actually increase both allocations and runtime, while he reduces each figure by more than 50%. Although the exact reasons remain unclear, it appears that once again the problem is that the supercompilation process has prevented GHC from aggressively unboxing the output. Supercompilation time Benchmarking our supercompiler on one program (“digits-of-e2”) showed that the vast majority of time (42%) is spent on managing names and renaming. Matching against previous states accounted for 14% of the runtime. Only 6% of time was spent testing the termination condition.
6.
Related Work
Supercompilation was introduced by Turchin [1], but has recently seen a revival of interest from both the call-by-value [3], call-byname [8] and call-by-need [7] perspectives. Partial evaluation [20] is a technique closely related to supercompilation. The fields overlap somewhat, but supercompilers tend to make a distinctive set of choices which set them apart: they specialise expressions in the context in which they occur, operate on unannotated programs and test for termination online. Theoretical work has suggested that certain kinds of partial evaluator suffer from strictly less information propagation than supercompilers, limiting their optimising power [6]. The idea of building a partial evaluation system around an actual evaluator is hardly new – it is present from the very earliest work by Sestoft et al. [21]. However, this approach seems to have received surprisingly little attention in the supercompilation community, though it is somewhat foreshadowed by early work of Turchin [22]. Much of the supercompilation literature makes use of the homeomorphic embedding test for ensuring termination [3, 8, 23]. Users of this test uniformly report that testing the termination condition makes up the majority of their supercompilers runtime [3, 23]. The tag-bag criterion appears to be much more efficient – our supercompiler spends only 6% of its runtime on termination testing. Jørgensen has previously produced a compiler for call-by-need through partial evaluation of a Scheme partial evaluator with respect to an interpreter for the lazy language [24]. His work made use of a partial evaluator capable of dealing with the set! primitive, which was used to implement updateable thunks. Our supercom-
Recursive let We are able to report results for two benchmarks (“wheel-sieve1” and “wheel-sieve2”) that Mitchell’s system is unable to supercompile because they make fundamental use of recursive let. We achieve an improvement in “wheel-sieve1” by deforesting intermediate lists, but actually manage to increase allocations in “wheel-sieve2”. Opportunities for improvement The “tak” benchmark reported a staggering 18,000-fold increase in allocations, although this was up from a very low base – the unmodified program allocates only 13kB. Mitchell’s supercompiler exhibits the same problem, albeit to a lesser degree. Investigation shows that the allocation increase is due to supercompilation introducing several large join points which take boxed integers as arguments. When compiled without supercompilation, there are no join points and all arithmetic is unboxed by GHC’s strictness analyser [19].
145
piler takes a direct approach that avoids the need for any imperative features in the language being supercompiled.
7.
In ESOP ’94: Proceedings of the 5th European Symposium on Programming, pages 485–500, London, UK, 1994. Springer-Verlag. [7] Neil Mitchell. Rethinking supercompilation. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2010. ACM, 2010.
Further Work
Because the supercompiler described here is nicely separated from issues of evaluation order, it should be straightforward to modify the system to supercompile a pure call-by-name language for e.g. the purposes of theorem proving. Indeed, a splitter for call-by-name (or call-by-value) is rather simple to define because such evaluation strategies have no equivalent to update frames, and it is always permissible to duplicate heap bindings – so no work-duplication check is required at all. We plan to extend the supercompiler to work on the typed language System FC [25] for implementation as a part of GHC. Again, this should be fairly straightforward, and involve mostly local changes to the evaluator. Supercompilation works best when it has access to the whole program, but GHC already has the necessary facilities to get hold of the definitions from imported modules, in the shape of interface files. Although our presentation is nicely modular, the split function remains a tricky point and heavily dependent on the semantics of the language under consideration. A principled way to derive split from the operational semantics would be an interesting avenue for further exploration.
8.
[8] Ilya Klyuchnikov. Supercompiler HOSC 1.0: under the hood. Preprint 63, Keldysh Institute of Applied Mathematics, Moscow, 2009. [9] Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of compiling with continuations. ACM SIGPLAN Notices, 28(6):237–247, 1993. [10] Peter Sestoft. Deriving a lazy abstract machine. Journal of Functional Programming, 7(03):231–264, 1997. [11] G. Higman. Ordering by divisibility in abstract algebras. Proceedings of the London Mathematical Society, 3(1):326, 1952. [12] Michael Leuschel. On the power of homeomorphic embedding for online termination. In Static Analysis, volume 1503 of Lecture Notes in Computer Science, pages 230–245. Springer Berlin / Heidelberg, 1998. [13] Neil Mitchell and Colin Runciman. Uniform boilerplate and list processing. In Proceedings of the ACM SIGPLAN Workshop on Haskell, page 60. ACM, 2007. [14] Morten Heine Sørensen and Robert Gl¨uck. Introduction to supercompilation. In Partial Evaluation, volume 1706 of Lecture Notes in Computer Science, pages 246–270. Springer Berlin / Heidelberg, 1999. [15] Simon Peyton Jones, Cordy Hall, Kevin Hammond, Jones Cordy, Kevin Hall, Will Partain, and Phil Wadler. The Glasgow Haskell compiler: a technical overview, 1992.
Conclusions
Supercompilation is a simple, powerful and principled technique for program optimisation. A single pass with a supercompiler achieves many optimisations that have traditionally been laboriously specified and implemented independently. We have shown how to produce a supercompiler by basing it explicitly on an evaluator. This clean design allowed us to extend the technique to lazy languages with recursive let, by building the supercompiler around a call-by-need evaluator. Initial benchmark results are promising, but also bring to light weaknesses in the algorithm. In particular, a method is sorely needed for reducing the worst-case runtime of supercompilation.
[16] Jan Kort. Deforestation of a raytracer. Master’s thesis, Department of Computer Science, University of Amsterdam, The Netherlands, 1996. [17] Duncan Coutts, Roman Leshchinskiy, and Donald Stewart. Stream fusion: From lists to streams to nothing at all. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, April 2007. [18] Will Partain. The nofib benchmark suite of Haskell programs. In Proceedings of the 1992 Glasgow Workshop on Functional Programming, pages 195–202, London, UK, 1993. Springer-Verlag. [19] Simon Peyton Jones and John Launchbury. Unboxed values as first class citizens in a non-strict functional language. In Functional Programming Languages and Computer Architecture, pages 636–666. Springer, 1991.
Acknowledgments This work was partly supported by a PhD studentship generously provided by Microsoft Research. Thanks are due to Neil Mitchell and Peter Jonsson for enlightening discussions and feedback, and to the anonymous reviewers for their detailed feedback.
[20] Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. Partial evaluation and automatic program generation. Prentice-Hall International Series In Computer Science, page 415, 1993. [21] Peter Sestoft. The structure of a self-applicable partial evaluator. In Programs as Data Objects, volume 217 of Lecture Notes in Computer Science, pages 236–256. Springer Berlin / Heidelberg, 1986. [22] Valentin F. Turchin. The algorithm of generalization in the supercompiler. Dines Bjørner, Andrei P. Ershov, and Neil D. Jones, editors, Partial Evaluation and Mixed Computation, pages 531–549. [23] Neil Mitchell and Colin Runciman. A supercompiler for core Haskell. In Implementation and Application of Functional Languages, volume 5083 of Lecture Notes in Computer Science, pages 147–164. Springer Berlin / Heidelberg, 2008.
References [1] Valentin F. Turchin. The concept of a supercompiler. ACM Trans. Program. Lang. Syst., 8(3):292–325, 1986. [2] Alexei P. Lisitsa and Andrei P. Nemytykh. Verification as specialization of interpreters with respect to data. In Proocedings of First International Workshop on Metacomputation in Russia, pages 94–112, 2008. [3] Peter A. Jonsson and Johan Nordlander. Positive supercompilation for a higher order call-by-value language. In POPL ’09: Proceedings of the 36th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, 2009.
[24] Jesper Jørgensen. Generating a compiler for a lazy language by partial evaluation. In POPL ’92: Proceedings of the 19th ACM SIGPLANSIGACT symposium on Principles of Programming Languages, pages 258–268, New York, NY, USA, 1992. ACM. [25] Martin Sulzmann, Manuel Chakravarty, Simon Peyton Jones, and Kevin Donnelly. System F with type equality coercions. In ACM SIGPLAN International Workshop on Types in Language Design and Implementation (TLDI’07). ACM, 2007.
[4] Philip Wadler. Deforestation: Transforming programs to eliminate trees. In ESOP ’88, volume 300 of Lecture Notes in Computer Science, pages 344–358. Springer Berlin / Heidelberg, 1988. [5] Simon Peyton Jones. Constructor specialisation for Haskell programs. Proceedings of the ACM SIGPLAN International Conference on Functional Programming, ICFP 2007, pages 327–337, 2007. [6] Morten Heine Sørensen, Robert Gl¨uck, and Neil D. Jones. Towards unifying partial evaluation, deforestation, supercompilation, and gpc.
146
Species and Functors and Types, Oh My! Brent A. Yorgey University of Pennsylvania [email protected]
Abstract
a certain size, or randomly generate family structures. There exist tools for accomplishing at least two of these tasks: QuickCheck [9] and SmallCheck [22] can be used to do random and exhaustive generation, respectively. However, suppose Dorothy now decides that the order of the families in a group doesn’t matter, although she wants to continue using the same list representation. Suddenly she is out of luck: Haskell has no way to formally describe this rather common situation, and there is no easy way to inform QuickCheck and SmallCheck of her intentions. She could add a Bag newtype,
The theory of combinatorial species, although invented as a purely mathematical formalism to unify much of combinatorics, can also serve as a powerful and expressive language for talking about data types. With potential applications to automatic test generation, generic programming, and language design, the theory deserves to be much better known in the functional programming community. This paper aims to teach the basic theory of combinatorial species using motivation and examples from the world of functional programming. It also introduces the species library, available on Hackage, which is used to illustrate the concepts introduced and can serve as a platform for continued study and research.
newtype Bag a = Bag [a ] and endow it with custom QuickCheck and SmallCheck generators, but this is rather ad-hoc. What if she later decides that the order of the families does matter after all, but only up to cyclic rotations? Or that groups must always contain at least two families? Or what if she wants to have a data structure representing the graph of interactions between different family groups? What Dorothy needs is a coherent framework in which to describe these sorts of sophisticated structures. The theory of species is precisely such a framework: for example, her original data structure can be described succinctly by the regular species expression F = 2 • X + X • (L ◦ F); Section 3 explains how to interpret this expression. The variants on her structure correspond to non-regular species (Section 5) and can be expressed with only simple tweaks to the original expression. The payoff is that these species expressions form an abstract syntax (Section 6) with multiple useful semantic interpretations, including ways to exhaustively enumerate, count, or randomly generate structures (Sections 7 and 8). This paper is available at http://www.cis.upenn.edu/ ~byorgey/papers/species-pearl.lhs as a literate Haskell document. The species library itself, together with a good deal of documentation and examples, is available on Hackage [10] at http://hackage.haskell.org/package/species.
Categories and Subject Descriptors D.3.3 [Language Constructs and Features]: Data types and structures; D.1.1 [Programming Techniques]: Applicative (Functional) Programming; G.2.1 [Combinatorics] General Terms Keywords
Languages, Theory
Combinatorial species, algebraic data types
1. Introduction The theory of combinatorial species was invented by Andr´e Joyal in 1981 [16] as an elegant framework for understanding and unifying much of enumerative combinatorics. Since then, mathematicians have continued to develop the theory, proving a wide range of fundamental results and producing at least one excellent reference text on the topic [4]. Connections to computer science and functional programming have been pointed out in detail, notably by Flajolet, Salvy, and Zimmermann [12, 13]. Sadly, however, this beautiful theory is not widely known among functional programmers. Suppose Dorothy G. Programmer has created the following data type to aid in her ethological study of alate simian family groups: data Family a = Monkey Bool a | Group a [Family a ]
2. Combinatorial species Intuitively, a species describes a family of structures, parameterized by a set of labels which identify locations in the structures. In programming language terms, a species is like a polymorphic type constructor with a single type argument.
That is, a family (parameterized by names of type a) is either a single monkey with a boolean indicating whether it can fly, or an alpha male together with a group of families. While developing and testing her software, Dorothy might want to do things such as enumerate or count all the family structures of
Definition 1. A species F consists of a pair of mappings, F• and F↔ , with the following properties: • F• , given a finite set U of labels, sends U to a finite set of
structures F• [U ] which can be “built from” the given labels. We call F• [U ] the set of F-structures with support U , or sometimes just F-structures over U . ∼ • F↔ , given a bijection σ : U1 ↔ U2 between two label sets U1 and U2 (i.e. a relabeling), “lifts” σ to a bijection between F-structures, ∼ F↔ [σ] : F• [U1 ] ↔ F• [U2 ].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Haskell’10, September 30, 2010, Baltimore, Maryland, USA. c 2010 ACM 978-1-4503-0252-4/10/09. . . $10.00 Copyright
147
Moreover, this lifting must be functorial: the identity relabeling should lift to the identity on structures, and composition of relabelings should commute with lifting.
3.1 Basic regular species For each primitive species or species operation, we will define a corresponding Haskell data type embodying the F• mapping— that is, values of the type will correspond to F-structures. The F↔ mapping, in turn, can be expressed by an instance of the Functor type class, whose method fmap :: (a → b) → f a → f b shows how to relabel a structure f a by applying a relabeling map a → b to each of its labels. (We can also use fmap to fill in a labeled shape with content, by applying it to a mapping from labels to data.) For each species we also exhibit a method to enumerate all distinct labeled structures on a given set of labels, via an instance of the Enumerable type class shown in Figure 2. The actual Enumerable type class used in the species library is more sophisticated, but not fundamentally so.
We usually omit the subscript on F• when it is clear from context. For example, Figure 1 illustrates lifting a relabeling σ to a relabeling of binary trees.
class Enumerable f where enumerate :: [a ] → [f a ] Figure 2. The Enumerable type class Finally, for each species or species operation we also exhibit a picture as an aid to intuition. These pictures are not meant to be formal, but they will generally conform to the following rules:
Figure 1. Relabeling a binary tree Note that the notion of structures in this definition is entirely abstract: a “structure” is just an element of the set output by F• , which could be anything (subject to the restrictions on F). Fans of category theory will recognize a much more concise version of this definition: a species is an endofunctor on B, the category of finite sets with bijections as arrows. You won’t need to know any category theory to understand this paper or to make use of the species library; however, the categorical point of view does add considerable conciseness and depth to the study of combinatorial species. The ability to relabel structures means that the actual labels we use don’t matter; we get “the same structures”, up to relabeling, for any label sets of the same size. We might say that species are parametric in the label sets of a given size. In particular, F’s action on all label sets of size n is determined by its action on any particular such set: if |U1 | = |U2 | and we know F[U1 ], we can determine F[U2 ] by lifting any bijection between U1 and U2 . So we often take the finite set of natural numbers {1, . . . , n} (also written [n]) as the canonical label set of size n, and write F [n] for the set of F-structures built from this set. As a final note on the definition, it is tempting to assume that labels play the role of the “data” held by data structures. Instead, however, labels should be thought of as names for the locations within a structure. The idea is that data structures can be decomposed into a shape together with some sort of content [1, 15]. In this case, a labeled shape is some sort of structure built out of labels, and the content can be specified by a mapping from labels to data (which need not be injective).
• The left-hand side of each picture shows a canonical set of
labels (depicted as circled numbers), either of an arbitrary size, or of a size that is “interesting” for the species being defined. Although the canonical label set [n] is used, of course the labels could be replaced by any others. • In the middle is an arrow labeled with the name of the species
being defined. • On the right-hand side is a set of structures, or some sort of
schematic depiction of the construction of a “typical” structure (the species then being understood to construct such structures “in all possible ways”). When the name of a species is superimposed on a set of labels, it represents a structure of the given species built from the labels. Zero The species 0 (Figure 3) corresponds to the constantly void type constructor. That is, it yields no structures no matter what labels it is given as input. We are forced to cheat a bit in the Functor instance for 0, since Haskell does not allow empty case expressions.
data 0 a instance Functor 0 where fmap = ⊥
3. Regular species Although the formal definition of species is good to keep in mind as a source of intuition, in practice we take an algebraic approach, building up complex species from a small set of primitive species and species operations. We start our tour of the species menagerie with what I term regular species.1 These should seem like old friends: intuitively, regular species correspond to Haskell’s algebraic data types. We’ll step back to define regular species more abstractly in Section 3.2, but first let’s see how to build them.
instance Enumerable 0 where enumerate = [ ] Figure 3. The primitive species 0 One The species 1 (Figure 4) yields a single unit structure when applied to an empty set of labels, and no structures otherwise. In other words, there is exactly one structure of type 1 a, and it contains no locations where values of type a can be stored. It
1 There
is no widely accepted name for this class of species; I call them regular since they correspond to the regular tree types of Morris et al. [20].
148
corresponds to nullary constructors in algebraic data types. The unit structure built by 1 is shown in Figure 4 as a filled square, to emphasize the fact that it contains no labels.
data 1 a = 1 instance Functor 1 where fmap 1 = 1
infixl 6 + data (f + g) a = Inl (f a) | Inr (g a)
instance Enumerable 1 where enumerate [ ] = [1] enumerate = [ ]
instance (Functor f , Functor g) ⇒ Functor (f + g) where fmap h (Inl x ) = Inl (fmap h x ) fmap h (Inr x ) = Inr (fmap h x )
Figure 4. The primitive species 1
instance (Enumerable Enumerable enumerate ls = map + + map
The species of singletons The species of singletons, X (Figure 5), yields a single structure when applied to a singleton label set, and no structures otherwise. That is, X corresponds to the identity type constructor, which has exactly one way of building a structure with a single location.
f , Enumerable g) ⇒ (f + g) where Inl (enumerate ls) Inr (enumerate ls)
(+) :: (f a → g a) → (h a → j a) → (f + h) a → (g + j ) a (fg + hj ) (Inl fa) = Inl (fg fa) (fg + hj ) (Inr ha) = Inr (hj ha) Figure 6. Species sum
data X a = X a
and g, witnessed by a pair of functions, one in each direction. We also define the identity isomorphism as well as composition and inversion of isomorphisms.
instance Functor X where fmap f (X a) = X (f a) instance Enumerable X where enumerate [x ] = [X x ] enumerate = []
infix 1 ↔ data f ↔ g = (↔) {to :: ∀a. f a → g a, from :: ∀a. g a → f a }
Figure 5. The species X of singletons
ident :: f ↔ f ident = id ↔ id
Species sum We define species sum (Figure 6) to correspond to type sum, i.e. disjoint (tagged) union. Given species F and G and a set of labels U , the set of (F + G)-structures over U is the disjoint union of the sets of F- and G-structures over U :
(>>>) :: (f ↔ g) → (g ↔ h) → (f ↔ h) (fg ↔ gf ) >>> (gh ↔ hg ) = (gh ◦ fg) ↔ (gf ◦ hg) inv :: (f ↔ g) → (g ↔ f ) inv (fg ↔ gf ) = gf ↔ fg
(F + G)[U ] = F [U ] G[U ]. In other words, an (F + G)-structure is either an F-structure or a G-structure, along with a tag specifying which. For example, 1 + X corresponds to the familiar Maybe type constructor. We can generalize 0 and 1 by defining the species n, for each n 0, to be the species which generates n distinct structures for the empty label set, and no structures for any nonempty label set; n is isomorphic to 1 + · · · + 1. For example, 2 corresponds to the
Figure 7. Isomorphisms Armed with these definitions, Figure 8 presents the algebraic laws for sum in Haskell form, as implemented in the species library. The one technical issue to note is that for the congruences inSumL and inSumR (and the corresponding inProdL and inProdR shown in the next section), we must be careful to use lazy pattern matches, since the isomorphism between f and g may not be needed. Always forcing the proof of (f ↔ g) to weakhead normal form can cause some isomorphisms between recursive structures (such as the one shown in Figure 15) to diverge.
n
constantly Bool type constructor, data CBool a = CBool Bool . It is not hard to verify that, up to isomorphism, 0 is the identity for species addition, and that + is associative and commutative. Since these algebraic laws correspond directly to generic isomorphisms between structures, we can represent the laws as Haskell code. We define a type of embedding-projection pairs, shown in Figure 7. A value of type f ↔ g is an isomorphism between f
Species product Just as species sum corresponds to type sum, we define species product (Figure 9) to correspond to type product, in
149
For example, X • X (which can be abbreviated X2 ) is the species of ordered pairs. X yields no structures unless it is given a single label, so the only way to get an X • X structure is if we start with two labels and partition them into two singleton sets to pass on to the X’s. Of course, there are two ways to do this, reflecting the two possible orderings of the labels. Similarly, X3 is the species of ordered triples, with 3! = 6 orderings for the labels, and so on. Up to isomorphism, 1 is an identity for species product, and 0 is an annihilator. It is also not hard to check that • is associative and commutative, and distributes over + (as usual, all up to isomorphism). Thus, species form a commutative semiring. The isomorphisms justifying these algebraic laws are shown in Figure 10, although their straightforward implementations are omitted in the interest of space.
sumIdL :: 0 + f ↔ f sumIdL = (λ(Inr x ) → x ) ↔ Inr sumComm :: f + g ↔ g + f sumComm = swapSum ↔ swapSum where swapSum (Inl x ) = Inr x swapSum (Inr x ) = Inl x sumAssoc :: f + (g + h) ↔ (f + g) + h sumAssoc = reAssocL ↔ reAssocR where reAssocL (Inl x ) = Inl (Inl x ) reAssocL (Inr (Inl x )) = Inl (Inr x ) reAssocL (Inr (Inr x )) = Inr x reAssocR (Inl (Inl x )) = Inl x reAssocR (Inl (Inr x )) = Inr (Inl x ) reAssocR (Inr x ) = Inr (Inr x )
prodIdL prodIdR
inSumL :: (f ↔ g) → (f + h ↔ g + h) inSumL ∼ (fg ↔ gf ) = (fg + id) ↔ (gf + id )
:: 1 • f ↔ f :: f • 1 ↔ f
prodAbsorbL :: 0 • f ↔ 0
inSumR :: (f ↔ g) → (h + f ↔ h + g) inSumR ∼ (fg ↔ gf ) = (id + fg) ↔ (id + gf )
prodComm :: f • g ↔ g • f prodAssoc :: f • (g • h) ↔ (f • g) • h prodDistrib :: f • (g + h) ↔ (f • g) + (f • h)
Figure 8. Algebraic laws for sum
inProdL inProdR
such a way that the resulting structures contain each label exactly once. So, to form an (F • G)-structure over U , we split U into two disjoint subsets, and form an ordered pair of an F-structure built from the first subset and a G-structure built from the second. Doing this in all possible ways yields the species (F • G). Formally, F[U1 ] × G[U2 ], (F • G)[U ] =
:: (f ↔ g) → (f • h ↔ g • h) :: (f ↔ g) → (h • f ↔ h • g) Figure 10. Algebraic laws for product
Least fixed points and the implicit species theorem If we add a least fixed point operator μ, we now get the regular types or algebraic data types familiar to any functional programmer [20]. For example, the species L of linear orderings (or lists for short) can be defined as L = μ.1 + X • . That is, a list is either empty (1) or an element paired with a list (X•). For any set U of labels, L[U ] is the set of all linear orderings of U (Figure 11); of course, |L[n]| = n!.
U =U1 U2
where denotes repeated disjoint union and × denotes Cartesian product of sets.
infixl 7 • data (f • g) a = f a • g a
instance Enumerable [ ] where enumerate = Data.List .permutations
instance (Functor f , Functor g) ⇒ Functor (f • g) where fmap h (x • y) = fmap h x • fmap h y
listRec :: [ ] ↔ 1 + (X • [ ]) listRec = unroll ↔ roll where unroll [ ] = Inl 1 unroll (x : xs) = Inr (X x • xs) roll (Inl 1) = [] roll (Inr (X x • xs)) = x : xs
instance (Enumerable f , Enumerable g) ⇒ Enumerable (f • g) where enumerate ls = [x • y | (fls, gls) ← splits ls ,x ← enumerate fls ,y ← enumerate gls ] splits :: [a ] → [([a ], [a ])] splits [ ] = [([ ], [ ])] splits (x : xs) = (map ◦ first ) (x :) ss + + (map ◦ second ) (x :) ss where ss = splits xs first f (x , y) = (f x , y) second f (x , y) = (x , f y)
Figure 11. The species L of linear orderings Actually, mathematicians would not write L = μ.1 + X • , but simply L = 1 + X • L. This is not because they are being sloppy, but because of the implicit species theorem [4], which is a combinatorial analogue of the implicit function theorem from analysis. Suppose we have a species equation which implicitly defines F in terms of itself. If
Figure 9. Species product
150
F yields no structures on the empty label set, and is not trivially reducible to itself, then the implicit species theorem guarantees that there is a unique solution for F with F(0) = 0, which is exactly the least fixed point of the implicit equation. Of course, the criteria given above are somewhat vague; a more precise formulation is explained in Section 8.1. The species L, as defined above, does not actually meet these criteria, since it yields a structure on the empty label set. However, if we let L+ denote the species of nonempty lists, with L+ + 1 = L, then we have L+ = X • (L+ + 1), which does meet the criteria and hence has a unique solution. Thus, L = 1 + L+ is uniquely determined as well, and we are justified in forgetting about μ and simply manipulating the implicit equation L = 1 + X • L however we like. For example, expanding L in its own definition, we find that
bTreeRec :: BTree ↔ 1 + X • BTree • BTree bTreeRec = unroll ↔ roll where unroll Empty = Inl 1 unroll (Node x l r ) = Inr (X x • l • r ) roll (Inl 1) = Empty roll (Inr (X x • l • r )) = Node x l r parenRec :: Paren ↔ X + Paren • Paren parenRec = unroll ↔ roll where unroll (Leaf x ) = Inl (X x ) unroll (Pair l r ) = Inr (l • r ) roll (Inl (X x )) = Leaf x roll (Inr (l • r )) = Pair l r Figure 14. Implicit equations for B and P
L=1+X•L = 1 + X • (1 + X • L) = 1 + X + X2 • L
species theorem, but we can easily express it in terms of one that does. Suppose we happen to notice that there seem to always be the same number of B-structures over n labels as there are P-structures over n + 1 labels (say, by enumerating small instances). Is there some sort of isomorphism lurking here? In particular, can we pair an extra element with a BTree to get something isomorphic to a Paren? Well, pairing an extra element with a BTree corresponds to the species X • B. Let’s just do some algebra and see what we get:
Continuing this process, we find that L = 1 + X + X2 + X3 + . . . , which corresponds to the observation that a list is either empty, or a single element, or an ordered pair, or an ordered triple, and so on. 1 . We can also “solve” the implicit equation for L to obtain L = 1−X This may seem nonsensical at this point—we have defined neither subtraction nor division of species—but it is perfectly valid in the context of virtual species (Section 8.1), and directly corresponds to the exponential generating function for L (Section 7). As another example of recursive species and the power of the implicit species theorem, consider the Haskell data types shown in Figure 12.
X • B = X • (1 + X • B2 ) = X + X2 • B2 = X + (X • B)2
data BTree a = Empty | Node a (BTree a) (BTree a)
Thus, X • B satisfies the same implicit equation as P—so by the implicit species theorem, they must be isomorphic! Not only that, we can directly read off the isomorphism as the composition of the isomorphisms corresponding to our algebraic manipulations (Figure 15). Of course, coding this by hand is a bit of a pain, but it is not hard to imagine deriving it automatically: the species library cannot yet do this, but it is planned for a future release.
data Paren a = Leaf a | Pair (Paren a) (Paren a) Figure 12. Binary trees and binary parenthesizations BTree is the type of binary trees with data at internal nodes, and Paren is the type of binary parenthesizations, that is, full binary trees with data stored in the leaves (Figure 13).
bToP :: X • BTree ↔ Paren bToP = inProdR bTreeRec prodDistrib inSumL prodIdR inSumR ( inProdR ( inProdL prodComm inv prodAssoc) prodAssoc inProdL bToP inProdR bToP ) inv parenRec
Figure 13. Binary trees (top) and parenthesizations (bottom)
>>> >>> >>>
>>> >>> >>> >>> >>>
Figure 15. The isomorphism between X • B and P
It is not hard to write down equations implicitly defining species corresponding to these types: B = 1 + X • B2
Could we have come up with this isomorphism without the theory of species? Of course. This particular isomorphism is not even that complex. The point is that by boiling things down to their essentials, the theory allowed us to elegantly and easily derive the isomorphism with only a few lines of algebra.
2
P =X+P
Figure 14 shows the isomorphisms witnessing these implicit equations. Again, B itself does not fulfill the conditions of the implicit
151
3.2 Regular species, formally
with no symmetries, Haskell’s data types are perfectly adequate to express any data structure we could possibly think up.
We are now ready to state the precise definition of regular species. A first characterization is as follows:
3.3 Other operations on regular species
Definition 2. A species F is regular if it can be expressed in terms of 1, X, +, •, and least fixed point.
In addition to sum and product, the class of regular species is also closed under other fundamental operations.
This definition validates the promised intuition that regular species correspond to Haskell algebraic data types, since normal Haskell98 data type declarations provide exactly these features (forgetting for the moment about infinite data structures). However, there is a more direct characterization that makes apparent why this particular collection of species is interesting. We must first define what we mean by the symmetries of a structure. Recall that Sn denotes the termsymmetric group of order n, which has permutations of size n (that is, bijections between {1, . . . , n} and itself) as elements, and composition of permutations as the group operation.
Species composition Given species F and G, the composition F ◦ G is a species which builds “F-structures made out of G-structures”, with the underlying labels distributed to the G-structures so that each label occurs exactly once in the overall structure (Figure 17). However, in order to ensure we get only a finite number of (F ◦ G)structures of each size, G must not yield any structures on the empty label set. This corresponds exactly to the criterion for composing formal power series, namely, that the inner series have no constant term. Specifically, to build an (F ◦ G)-structure over a label set U , we • partition U into some number of nonempty disjoint parts, U =
Definition 3. A permutation σ ∈ Sn is a symmetry of an F-structure f ∈ F[U ] if and only if σ fixes f , that is, F↔ [σ](f ) = f.
U1 U2 · · · Uk ;
• create a G-structure on each of the Ui ; • create an F-structure on these G-structures.
For example, Figure 16 depicts a tree with a set of labels at each node. This structure has many nontrivial symmetries, such as the permutation which swaps 4 and 6 but leaves all the other labels unchanged; since 4 and 6 are in the same set, swapping them has no effect.
Doing this in all possible ways yields the set of (F ◦ G)-structures over U .
newtype (f ◦ g) a = C {unC :: f (g a)} instance (Functor f , Functor g) ⇒ Functor (f ◦ g) where fmap h = C ◦ (fmap ◦ fmap) h ◦ unC instance (Enumerable f , Enumerable g) ⇒ Enumerable (f ◦ g) where enumerate ls = [C y | p ← partitions ls , gs ← mapM enumerate p , y ← enumerate gs ]
Figure 16. A labeled structure with nontrivial symmetries However, the binary trees shown in Figure 1 have only the trivial symmetry, since permuting their labels in any nontrivial way yields different trees.
partitions :: [a ] → [[[a ]]] partitions [ ] = [[ ]] partitions (x : xs) = [ (x : ys) : p | (ys, zs) ← splits xs ,p ← partitions zs ]
Definition 4. A species F is regular if every F-structure has the identity permutation as its only symmetry; such structures are also called regular. It turns out that these two definitions are equivalent (with the slight caveat that we must allow countably infinite sums and products in the first definition). That species built from sum, product, and fixed point have no symmetries is not hard to see; less obvious is the fact that up to isomorphism, every species with no symmetries can be expressed in this way (a proof sketch is given in Section 8.1). Of course, since we cannot write down infinite sums or products in Haskell, there are some regular species which cannot be expressed as simple algebraic data types. For example, the regular species of prime-length lists,
Figure 17. Species composition For example, R = X • (L ◦ R) (where L denotes the species of linear orderings) defines the species of rose trees, as defined in Data.Tree, with each node having a data element and any number of children. We can also easily encode nested data types [6] (such types are sometimes called “non-regular”, although that nomenclature is confusing in the current context, since they do in fact correspond to regular species). For example, B = X+B◦X2 is the species of complete binary trees; a B-structure is either a single leaf, or a complete binary tree with pairs of elements at the leaves. It is not hard to verify that composition is associative (but not commutative), and that it has X as both a left and right identity. Composition also distributes over both sum and product from the right: (F+G)◦H = (F◦H)+(G◦H), and similarly for (F•G)◦H.
X2 + X3 + X5 + X7 + X11 + . . . , cannot be written as a simple algebraic data type.2 But aside from infinite sums and products, as long as we stick to data structures 2 Although
I am sure it can be expressed using GADTs and type-level arithmetic. . .
152
• (F • G) = F • G + F • G
As noted at the beginning of this section, regular species are closed under composition. Although we won’t prove this formally, it makes intuitive sense: if an F-structure has no symmetries, and in each location we put a G-structure which also has no symmetries, the resulting composed structure cannot have any symmetries either.
• (F ◦ G) = (F ◦ G) • G
Cardinality restriction For a species F and a natural number n, Fn denotes the species F restricted to label sets of cardinality n. That is, F[U ] |U | = n Fn [U ] = ∅ otherwise.
Differentiation Of course, no discussion of an algebra for data types would be complete without mentioning differentiation. There has been a great deal of fascinating work in the functional programming community on differentiating data structures [2, 14, 17, 18]. As usual, however, the mathematicians beat us to it! Intuitively, the derivative of a data type D is the type of Dstructures with a single “hole”, that is, a distinguished location not containing any data. This is useful, for example, in building zipper structures [14] which keep track of a movable focus within a data structure. We can make this precise in the context of species by using a “dummy label” to correspond to the hole. Formally, given a species F, its derivative F sends label sets U to the set of F-structures on the label set U ∪ {∗}, where ∗ is a new element distinct from all elements of U . That is,
More generally, if P is any predicate on natural numbers, FP denotes the restriction of F to label sets whose size satisfies P . For example, Leven is the species of lists of even length. We have Leven = 1 + X2 • Leven = 1 + X • Lodd and L = Leven + Lodd . More generally, for any species we have F = Feven + Fodd = F0 + F1 + F2 + . . . As a final note, we often write F+ as an abbreviation for F0 , the species of nonempty F-structures. Unfortunately, it is difficult to represent general cardinality restriction with a Haskell type, since we would have to somehow embed a predicate on integers into the type level.
F [U ] = F[U ∪ {∗}].
4. Unlabeled structures Before moving on to non-regular species, it’s worth pausing to make precise what is meant by the “shape” of a structure.
newtype Diff f a = Diff (f (Maybe a)) deriving Functor
Definition 5. For a species F, an unlabeled F-structure, or F-shape, is an equivalence class of labeled F-structures, where two structures s and t are considered equivalent if there is some relabeling σ such that F↔ [σ](s) = t.
instance Enumerable f ⇒ Enumerable (Diff f ) where enumerate ls = map Diff (enumerate (Nothing : map Just ls))
In other words, two labeled structures are equivalent if they are relabelings of each other. For example, Figure 19 shows three rose tree structures. The first two are equivalent, but the third is not equivalent to the first two, since there is obviously no way to turn it into the first two merely by changing the labels.
Figure 18. Species differentiation For example, L -structures are lists with a distinguished hole. Since the structure on either side of the distinguished location is also a list, we have L = L2 . (The reader may enjoy proving this formally using the implicit species theorem and the algebraic identities for differentiation listed below.) Figure 18 shows a representation of this idea in Haskell. It is important to note that Diff f a will often admit values which are not among those listed by enumerate. For example, although as a Haskell type Diff 1 Int is perfectly well inhabited by the value Diff 1, enumerate will always produce the empty list at type Diff 1 Int. Isomorphisms between structure types should map between values listed by enumerate and not necessarily between all Haskell values of the given types; thus, for example, we are justified in treating Diff 1 as isomorphic to 0. Although regular species are closed under differentiation, it should be noted that there are regular species which are the derivative of a non-regular species (X, for example, is the derivative of the non-regular species E2 , to be defined in Section 5). Why refer to this operation as differentiation? Simply because, somewhat astonishingly, it satisfies all the same algebraic laws as the differentiation operation in calculus! Explaining the intuition behind these isomorphisms and expressing them in Haskell is left as a challenge for the reader.
Figure 19. Equivalent and inequivalent trees Although unlabeled structures formally consist of equivalence classes of labeled structures, we can informally think of them as normal structures built from “indistinguishable labels”; for a given species F, there will be one unlabeled F-structure for each possible “shape” that F-structures can take. For example, Figure 20 shows all the rose tree shapes on four nodes. For regular species, the distinction between labeled and unlabeled structures is uninformative. Since every possible permutation of the labels of a regular structure results in a distinct labeled structure, there are always exactly n! times as many labeled as unlabeled structures of size n. Thus, a method to enumerate all unlabeled structures of a regular species is easy. In fact, it is a slight simplification of the code we have already exhibited for enumerating labeled structures: instead of taking a list of labels as input, we take simply a natural number size, and output structures full of unit
• 1 = 0 • X = 1 • (F + G) = F + G
153
about the order of the elements in a Bag, we create an Eq instance to identify any two Bags with the same elements. We can also use E to build other interesting and useful species. For example: • X • E is the species of pointed sets, also known as the species
of elements and sometimes written ε. Pointed set structures consist of a single distinguished label paired with a set structure on the rest of the labels. Thus, there are precisely n labeled ε-structures on a set of n labels, one for each choice of the distinguished label. We can also think of an (X • E) structure as consisting solely of the distinguished label, since the set of remaining labels adds no information. In other words, we can treat E as a sort of “sink” for the elements we don’t care about, and the same technique can be used generally for describing structures containing only a subset of the labels.
Figure 20. Unlabeled rose trees of size 4 values. Enumerating unlabeled structures for non-regular species, however, is much more complicated; a partial implementation of unlabeled enumeration can be found in the species package. For example, we can enumerate all unlabeled sets of sets of size 4, corresponding to integer partitions of 4: > enumerateU (set ‘o‘ nonEmpty set) 4 :: [Comp Set Set ()] [{{(),(),(),()}},{{(),()},{(),()}}, {{()},{(),(),()}},{{()},{()},{(),()}}, {{()},{()},{()},{()}}]
• E•E is the species of subsets, sometimes abbreviated ℘ in refer-
ence to the power set operator. Again, ℘-structures technically consist of a subset of the labels paired with its complement, but we may (and often do) ignore the complement.
• (E ◦ E+ ) is the species of set partitions: its structures are
5. Beyond regular species
collections of nonempty sets.
As promised, the theory of combinatorial species can describe structures with nontrivial symmetries just as easily as regular structures. This section introduces common non-regular species and combinators.
The species of cycles The primitive species C of directed cycles (Figure 22) yields all directed cyclic orderings (known in the mathematical literature as necklaces) of the given labels. By convention, cycles are always nonempty, and are read clockwise when represented pictorially.
The species of sets The primitive species of sets, usually denoted E (from the French ensemble), represents unordered collections of elements. For any given set of labels, there is exactly one set structure, the set of labels itself. It is easy to see that E is not regular, since E-structures have every possible symmetry; permuting the elements of a set leaves the set unchanged. Although the standard mathematical name for this species is the species of sets, a better name for it from a computer science perspective is the species of bags. The term set usually indicates both that the order of the elements doesn’t matter and that there are no duplicate elements; the species of bags only embodies the former, since we can have non-injective mappings from labels to data. However, to model sets we can certainly restrict ourselves to injective mappings.
newtype Cycle a = Cycle [a ] deriving Functor instance Eq a ⇒ Eq (Cycle a) where Cycle xs ≡ Cycle ys = any (≡ ys) (rotations xs) where rotations xs = zipWith (+ +) (tails xs) (inits xs) instance Enumerable Cycle where enumerate [ ] = [ ] enumerate (x : xs) = (map (Cycle ◦ (x :)) ◦ permutations ) xs Figure 22. The species C of cycles
newtype Bag a = Bag [a ] deriving Functor
C is also non-regular, since each labeled C-structure is fixed by certain nontrivial permutations, namely, the ones which only “rotate” the labels. An example of an interesting species we can build using C is E◦C, the species of permutations, corresponding to the observation that every permutation can be decomposed into a collection of disjoint cycles. We note also that a cycle with a hole in it is isomorphic to a list, that is, C = L (Figure 23).
instance Eq a ⇒ Eq (Bag a) where Bag xs ≡ Bag ys = xs ‘subBag ‘ ys ∧ ys ‘subBag ‘ xs where subBag b = null ◦ foldl (flip delete) b instance Enumerable Bag where enumerate ls = [Bag ls ] Figure 21. The species E of sets In Figure 21, we declare a Haskell data type for E by declaring Bag to be isomorphic to [ ] via a newtype declaration, and deriving a Functor instance for Bag from the existing instance for lists, using GHC’s newtype deriving extension. Since we don’t care
Cartesian product Given two species F and G, we may define the Cartesian product F × G by (F × G)[U ] = F[U ] × G[U ],
154
Figure 23. Differentiating a cycle where the × on the right denotes the standard Cartesian product of sets. That is, an (F × G)-structure is a pair of an F-structure and a G-structure, both of which are built over all the labels, instead of partitioning the labels as with normal product (Figure 24). However, instead of thinking of the labels as being duplicated, we think of an (F × G)-structure as two structures which are superimposed on the same label set. In particular, when specifying the content for an (F×G)-structure, we should still only map each label to a single piece of data.
Figure 25. A (B × (E ◦ E+ ))-shape functorial composite by (F G)[U ] = F[G[U ]], that is, F-structures over the set of all G-structures on U (Figure 26). Like (F ◦ G)-structures, an (F G)-structure is an Fstructure of G-structures, but instead of partitioning the labels U among the G-structures, we give all the labels to every G-structure. As with Cartesian product structures, (F G)-structures appear to contain each label multiple times, but in fact we should still think of them as containing each label once, with a rich structure superimposed on it.
data (f × g) a = f a × g a instance (Functor f , Functor g) ⇒ Functor (f × g) where fmap f (x × y) = fmap f x × fmap f y instance (Enumerable f , Enumerable g) ⇒ Enumerable (f × g) where enumerate ls = [x × y | x ← enumerate ls , y ← enumerate ls ]
data (f g) a = FC {unFC :: f (g a)}
Figure 24. Cartesian product of species
instance (Functor f , Functor g) ⇒ Functor (f g) where fmap h = FC ◦ (fmap ◦ fmap) h ◦ unFC
One interesting use of Cartesian product is to model some type class-polymorphic data structures, where the type class methods provide us with a second observable structure on the data elements. For example, a type constructor F with an Eq constraint on its argument can be modeled by the species
instance (Enumerable f , Enumerable g) ⇒ Enumerable (f g) where enumerate = map FC ◦ enumerate ◦ enumerate
F × (E ◦ E+ ). Structures of this species consist of an F-structure with a superimposed partition on the labels, with each part corresponding to an equivalence class. For example, Figure 25 shows a binary tree shape with a superimposed partition indicating which sets of elements are equal. Likewise, we can model an Ord constraint by superimposing an (L ◦ E+ )-structure, which additionally places an observable ordering on the equivalence classes. This works particularly well in conjunction with the approach of Bernardy et al. [5] for testing polymorphic functions. Because of parametricity, it suffices to test polymorphic functions on randomly generated shapes filled with carefully chosen data; if the function works correctly for the chosen data then by parametricity it will work correctly for any data. The above discussion shows that we can treat Eq and Ord constraints as part of the shape of an input structure, and choose data to match. Cartesian product has E as both a left and right identity, and is associative, commutative, and distributes over species sum. Again, implementing these laws as isomorphisms is left as an exercise for the reader.
Figure 26. Functor composition The functor composition operation is especially useful for defining species of graphs and relations. For example, recalling that ℘ = E • E is the species of subsets and E2 is the species of sets restricted to sets of size two, ℘ (E2 • E) defines the species of simple graphs. An (E2 • E)-structure is a set of two labels, which we can think of as an undirected edge, and a simple graph is a subset of the set of all possible edges. In fact, many graph-like species can be defined as ℘ G for a suitable species G. For example, G = X2 • E gives directed graphs without self-loops, and G = ε × ε gives directed graphs with selfloops allowed (recalling that ε = X • E is the species of elements). The reader may enjoy discovering how to represent the species of undirected graphs with self-loops allowed.
6. An embedded language of species We have defined a type corresponding to each primitive species and species operation, but we would also like to be able to write down
Functor composition The final species operation we will explore is functor composition. Given species F and G, we define their
155
7. Generating functions and counting
and compute with species expressions at the term level. The perfect way to do this is with a type class defining a domain-specific language of species expressions. The expressions can then be interpreted in different ways (for example, as exponential generating functions, cycle index series, abstract syntax trees, or random generation routines) depending on the types at which they are instantiated. The basic Species type class, as defined in the species library, is shown in Figure 27. The actual Species class contains additional methods, but this is the core essence.
What else can we do with combinatorial species? A key element of the story we haven’t seen yet is the correspondence between species and generating functions. Generating functions are an indispensable tool in combinatorics, and have a well-developed theory [23]. Much of their power lies in the surprising fact that many natural power series operations (addition, multiplication, substitution. . . ) have natural combinatorial interpretations (disjoint union, independent choice, partition. . . ). Every species can be associated with several different generating functions, each of which encodes certain aggregate information about the species. For example, we can associate to each species F an exponential generating function (egf) of the form xn fn , F(x) = n! n0
class (Differential .C s) ⇒ Species s where singleton :: s set :: s cycle :: s (◦) :: s → s → s (×) :: s → s → s () :: s → s → s ofSize :: s → (Z → Bool ) → s
where fn is the number of distinct labeled F-structures of size n. (Note that x is a purely formal parameter, and we need not concern ourselves with convergence; for a more detailed explanation of generating functions, see Wilf [23].) Thus we have, for example, • 0(x) =
Figure 27. The Species type class
0 0 x 0!
+
0 1 x 1!
+ · · · = 0,
• 1(x) = 1, • X(x) = x,
Some things may seem to be missing (0, 1, sum and product, differentiation) but these are actually provided by the Differential .C constraint (from the numeric-prelude package), which ensures that species are a differentiable ring. The remainder of the class requires primitive singleton, set, and cycle species; composition (◦), cartesian product (×), and functor composition () operations; and a cardinality-restricting operator ofSize. It is not hard to put together the Enumerable instances we have already seen into code which enumerates all the labeled structures of a given species. The user can then call the enumerate method on an expression of type Species s ⇒ s, along with some labels to use:
• L(x) = • E(x) =
1 n x = ex , n! n0
• C(x) =
(n − 1)! n x = − log(1 − x). n! n1
(Here is another good reason to call the species of sets E!) At first glance this may seem arbitrary, but quite the opposite is true: species sum, product, composition, and differentiation correspond precisely to the same operations on formal power series! For example, if F(x) and G(x) count the number of labeled F- and Gstructures as defined above, it is easy to see that F(x)+G(x) counts the number of labeled (F + G)-structures in the same way, since every (F+G)-structure is either an F-structure or a G-structure (with a tag). And although it is not as immediately apparent, we can verify that F(x)G(x) = (F • G)(x) as well:
xn xn fn gn F(x)G(x) = n! n! n0 n0 n xk xn−k = fk gn−k k! (n − k)!
-- cycles of lists > enumerate (cycle ‘o‘ (nonEmpty linOrd)) "abc" :: [Comp Cycle [] Char] [<"cba">,<"cab">,<"bca">,<"bac">,<"acb">, <"abc">,<"a","cb">,<"a","bc">,<"ca","b">, <"ac","b">,<"ba","c">,<"ab","c">, <"b","a","c">,<"a","b","c">] -- simple graphs on three vertices > enumerate (subset @@ ksubset 2) [1,2,3] :: [Comp Set Set Int] [{},{{1,2}},{{1,3}},{{1,3},{1,2}},{{2,3}}, {{2,3},{1,2}},{{2,3},{1,3}}, {{2,3},{1,3},{1,2}}]
n0 k=0 n
xn k!(n − k)! n0 k=0 n
n xn = fk gn−k n! k n0 k=0
=
Since the species library is able to automatically generate Species expressions representing any user-defined data type, we can also enumerate values of user-defined data types, such as Family Int: > [ , , ,
n! n 1 x = , n! 1 − x n0
enumerate family [1,2] :: [Family Int] Group 2 [Group 1 []], Group 2 [Monkey True 1] Group 2 [Monkey False 1] Group 1 [Group 2 []], Group 1 [Monkey True 2] Group 1 [Monkey False 2]]
fk gn−k
The expression in the outermost parentheses is precisely the number of labeled (F • G)-structures on a label set of size n: for each
k from 0 to n, there are nk ways to pick k of the n labels to put in the F-structure, fk ways to create an F-structure from them, and gn−k ways to create a G-structure from the remaining labels. The reader may also enjoy working out why species differentiation corresponds to exponential generating function differentiation. Seeing the correspondence between species composition and egf
We can also control how the enumeration happens by explicitly specifying the species to use for the Family type, rather than using the default.
156
substitution takes more work; the interested reader should look up Fa`a di Bruno’s formula. There are also generating function operations corresponding to Cartesian product and functor composition. Although they are not as natural as the other operations, they are simple to define and easy to compute. As a result, we can count labeled structures by interpreting species expressions as exponential generating functions, conveniently represented by infinite lazy lists of coefficients. In fact, this particular technique of counting labeled structures has been known in the functional programming community for some time [19, 21]. The species library defines an EGF type with an appropriate Species instance, and a labeled function to extract the coefficients from an egf. For example:
species are exactly what we need to make the implicit species theorem precise. First, we can write an implicit equation for F in the form F = H(X, F), where H is a two-sort species. For example, if H(X, Y) = 1 + X • Y then L = H(X, L) is the implicit equation defining the species of lists. Now the necessary conditions for the implicit species theorem to apply can now be stated precisely: • H(0, 0) = 0 (no structures on the empty label set) •
∂H (0, 0) ∂Y
= 0 (F does not trivially reduce to itself)
Extending the species library to handle general multisort species presents an interesting challenge, due to Haskell’s lack of kind polymorphism, and is a topic for further research. Virtual species It is possible to complete the semiring of species to a ring in a way analogous to the set-theoretic completion of the natural numbers to the integers. We consider pairs of species (F, G) where F is considered “positive” and G “negative”; more precisely, we define an equivalence relation on pairs of species such that (F, G) ∼ (H, K) if and only if F + K is isomorphic to G + H, and define virtual species as the equivalence classes of this relation. Virtual species allow us to define a multiplicative inverse for the species E of sets, and from there to define multiplicative and compositional inverses for other suitable species, solve differential equations, and define a combinatorial logarithm which generalizes the notion of structures built from connected components. Virtual species allow us to give a sensible and consistent meaning to equations like L = 1/(1 − X).
> take 10 . labeled $ 3 + x*x [3,0,2,0,0,0,0,0,0,0] > take 10 . labeled $ cycle ‘o‘ (nonEmpty linOrd) [0,1,3,14,90,744,7560,91440,1285200, 20603520] Thus, there are no C ◦ L+ structures of size 0, one with a single label, three with two labels, 14 of size three, and so on. The library can even compute generating functions for some recursively defined species, using a quadratically converging combinatorial analogue of the Newton-Raphson method. For example, once Dorothy has used the species library to derive all the appropriate instances for her Family type via Template Haskell, she can use the labeled function to count them:
Molecular species Definition 6. A species F is molecular if all F-structures are isomorphic (i.e. relabelings of one another).
> take 10 . labeled $ family [0,3,6,72,1368,36000,1213920,49956480,2427788160 ,136075645440]
For example, the species X2 of ordered pairs is molecular, since we can go from any ordered pair to any other by relabeling. On the other hand, the species L of linear orderings is not molecular, since any two list structures of different lengths are fundamentally non-isomorphic. We have the following three beautiful facts:
To each species we can also associate an ordinary generating function (ogf) and a cycle index series; the first counts unlabeled structures (shapes), and the second is a generalization of both exponential and ordinary generating functions which also keeps track of symmetries. There is not space to describe them here, but more information can be found in Bergeron et al. [4] or in the documentation for the species library, which includes facilities for computing with all three types of generating functions.
• The molecular species are precisely those that cannot be de-
composed as the sum of two nonzero species. • Every molecular species is equivalent to Xn “quotiented by
some symmetries”; in particular, the molecular species of size n are in one-to-one correspondence with the conjugacy classes of subgroups of Sn . This gives us a way to completely classify molecular species and to compute with them directly. For example, there are four conjugacy classes of subgroups of S3 , each representing a different symmetry on three locations: the trivial subgroup corresponds to X3 itself (no symmetry); swapping two locations yields X • E2 ; cycling the locations yields to C3 ; and identifying all the locations yields E3 .
8. Extensions and applications It should come as no surprise that we have barely scratched the surface; the theory of combinatorial species is both rich and deep. In closing, we will look at some extensions to the theory discussed here which may lead to a deeper understanding of functional programming and data types, as well as potential applications of the theory. At the time of writing, the species library does not include support for any of these extensions, but their inclusion is planned for future releases.
• Every species can be written uniquely (up to isomorphism and
reordering of terms) as a sum of molecular species. This, combined with the previous fact, immediately gives us a complete classification of all combinatorial species. It also provides a method for finding canonical representatives of virtual species: given a pair (F, G), decompose each into a sum of molecular species and cancel any that occur in both F and G. As a corollary, we can always detect when a species that “looks” virtual is actually non-virtual, such as L − 1.
8.1 Extensions Weighted species Assigning weights to the structures built by a species allows us to count and enumerate structures in much more refined ways. For example, we can define the species of binary trees, weighted by the number of leaves, and then easily count or enumerate only those trees with a certain number of leaves.
Now we see why species with no nontrivial symmetries can always be built from 1, X, +, and •: any species with no symmetries must be isomorphic to a sum of molecular species with no symmetries; but molecular species with no symmetries must be of the form Xn . Hence regular species are always of the form n0 + n1 X + n2 X2 + . . . with ni ∈ N. Adding a fixed point oper-
Multisort species We have only considered species which map a single set of labels to a set of structures, corresponding to type constructors of a single argument. However, all of the theory generalizes straightforwardly to multisort species, which build structures from multiple sets of labels (sorts). For example, multisort
157
ator allows us to write down certain infinite such sums using only finite expressions.
[7] Benjamin Canou and Alexis Darrasse. Fast and sound random generation for automated testing and benchmarking in objective Caml. In ML ’09: Proceedings of the 2009 ACM SIGPLAN workshop on ML, pages 61–70, New York, NY, USA, 2009. ACM.
8.2 Applications
[8] Jacques Carette and Gordon Uszkay. Species: making analytic functors practical for functional programming. Available at http://www. cas.mcmaster.ca/~ carette/species/, 2008. [9] Koen Claessen and John Hughes. QuickCheck: a lightweight tool for random testing of Haskell programs. In ICFP ’00: Proceedings of the fifth ACM SIGPLAN international conference on Functional programming, pages 268–279, New York, NY, USA, 2000. ACM.
Automated testing One interesting application is to use species expressions as input to a test-generator-generator, for either random [9] or exhaustive [22] testing. In fact, Canou and Darrasse [7] have already created a library for random test generation in OCaml based on the ideas of combinatorial species. There has also been some interesting recent work by Dureg˚ard on automatic derivation of QuickCheck generators for algebraic data types [11], and by Bernardy et al. on using parametricity to improve random test generation [5]; combining these approaches with insights from the theory of species seems promising.
[10] Duncan Coutts, Isaac Potoczny-Jones, and Don Stewart. Haskell: batteries included. In Haskell ’08: Proceedings of the first ACM SIGPLAN symposium on Haskell, pages 125–126, New York, NY, USA, 2008. ACM. [11] Jonas Almstr¨om Dureg˚ard. AGATA: Random generation of test data. Master’s thesis, Chalmers University of Technology, December 2009.
Language design What if we had a programming language that actually allowed us to declare non-regular data types? What would such a language look like? Could it be made practical? Carette and Uszkay [8] have explored this question by creating a Haskell library allowing the user to program with species. Abbott et al. have explored a similar question from a more theoretical point of view, with their more general notion of quotient containers [3]. More work needs to be done to explain the precise relationship between containers and species, and to transfer these approaches into practical technology available to programmers.
[12] P. Flajolet, B. Salvy, and P. Zimmermann. Lambda-upsilon-omega: The 1989 cookbook. Technical Report 1073, Institut National de Recherche en Informatique et en Automatique, August 1989. 116 pages. [13] Philippe Flajolet and Bruno Salvy. Computer algebra libraries for combinatorial structures. Journal of Symbolic Computation, 20(56):653–671, 1995. [14] G´erard Huet. Functional pearl: The zipper. J. Functional Programming, 7:7–5, 1997. [15] C. Barry Jay and J. Robin B. Cockett. Shapely types and shape polymorphism. In ESOP ’94: Proceedings of the 5th European Symposium on Programming, pages 302–316, London, UK, 1994. SpringerVerlag.
Acknowledgments I would like to thank the anonymous reviewers for many detailed and helpful comments, and Jeremy Gibbons for the initial encouragement a year ago to write this paper. This work was partially supported by the National Science Foundation, under grant 0910786 TRELLYS.
[16] Andr´e Joyal. Une th´eorie combinatoire des S´eries formelles. Advances in Mathematics, 42(1):1–82, 1981. [17] Conor McBride. The Derivative of a Regular Type is its Type of OneHole Contexts. Available at http://www.cs.nott.ac.uk/~ ctm/ diff.ps.gz, 2001.
References
[18] Conor McBride. Clowns to the left of me, jokers to the right (pearl): dissecting data structures. In Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 287–295, San Francisco, California, USA, 2008. ACM.
[1] Michael Abbott, Thorsten Altenkirch, and Neil Ghani. Categories of Containers. In Foundations of Software Science and Computation Structures, pages 23–38. 2003. [2] Michael Abbott, Thorsten Altenkirch, Neil Ghani, and Conor McBride. Derivatives of Containers. In Typed Lambda Calculi and Applications, TLCA, volume 2701 of LNCS. Springer-Verlag, 2003.
[19] M. Douglas McIlroy. Power series, power serious. Journal of Functional Programming, 9(03):325–337, 1999. [20] Peter Morris, Thorsten Altenkirch, and Conor Mcbride. Exploring the regular tree types. 2004. [21] Dan Piponi. A small combinatorial library, November 2007. http:// blog.sigfpe.com/2007/11/small-combinatorial-library. html. [22] Colin Runciman, Matthew Naylor, and Fredrik Lindblad. Smallcheck and lazy smallcheck: automatic exhaustive testing for small values. In Haskell ’08: Proceedings of the first ACM SIGPLAN symposium on Haskell, pages 37–48, New York, NY, USA, 2008. ACM. [23] Herbert S. Wilf. Generatingfunctionology. Academic Press, 1990.
[3] Michael Abbott, Thorsten Altenkirch, Neil Ghani, and Conor McBride. Constructing Polymorphic Programs with Quotient Types. In 7th International Conference on Mathematics of Program Construction (MPC 2004), volume 3125 of LNCS. Springer-Verlag, 2004. [4] F. Bergeron, G. Labelle, and P. Leroux. Combinatorial species and tree-like structures. Number 67 in Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 1998. [5] Jean-Philippe Bernardy, Patrik Jansson, and Koen Claessen. Testing polymorphic properties. In ESOP 2010: Proceedings of the 19th European Symposium on Programming, pages 125–144, London, UK, 2010. Springer-Verlag. [6] Bird and Meertens. Nested datatypes. In MPC: 4th International Conference on Mathematics of Program Construction. LNCS, SpringerVerlag, 1998.
158
Author Index Achten, Peter ................................................... 49 Aswad, Mustafa K............................................ 91 Biernacki, Dariusz ........................................... 25 Bolingbroke, Maximilian .............................. 135 Chakravarty, Manuel M. T............................. 109 Dias, João ...................................................... 121 Dijkstra, Atze .................................................. 37 Elliott, Trevor .................................................. 79 Jeuring, Johan ................................................. 37 Koopman, Pieter ............................................. 49 Launchbury, John ............................................ 79 Löh, Andres ..................................................... 37 Loidl, Hans-Wolfgang .................................... 91 Magalhães, José Pedro .................................... 37 Maier, Patrick .................................................. 91 Mainland, Geoffrey ......................................... 67 Marlow, Simon ............................................... 91 Morris, J. Garrett.............................................. 61 Morrisett, Greg ................................................ 67 Ostermann, Klaus .............................................. 1 O’Sullivan, Bryan ......................................... 103 Peyton Jones, Simon ............................. 121, 135 Piróg, Maciej ................................................... 25 Plasmeijer, Rinus ............................................ 49 Ramsey, Norman ........................................... 121 Rendel, Tillmann ............................................... 1 Straka, Milan ................................................... 13 Terei, David A. .............................................. 109 Tibell, Johan .................................................. 103 Trinder, Phil .................................................... 91 van Groningen, John ....................................... 49 van Noort, Thomas .......................................... 49 Yorgey, Brent A............................................. 147
159