PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS
SIAM PROCEEDINGS SERIES LIST Glowinski, R., Golub, G. H., Meurant, G. A., and Periaux, J., First International Conference on Domain Decomposition Methods for Partial Differential Equations (1988) Salam, Fathi M. A. and Levi, Mark L, Dynamical Systems Approaches to Nonlinear Problems in Systems and Circuits (1988) Datta, B., Johnson, C, Kaashoek, M., Plemmons, R., and Sontag, E., Linear Algebra in Signals, Systems and Control (1988) Ringeisen, Richard D. and Roberts, Fred S., Applications of Discrete Mathematics (1988) McKenna, James and Temam, Roger, ICIAM '87: Proceedings of the First International Conference on Industrial and Applied Mathematics (1988) Rodrigue, Garry, Parallel Processing for Scientific Computing (1989) Caflish, Russel E., Mathematical Aspects of Vortex Dynamics (1989) Wouk, Arthur, Parallel Processing and Medium-Scale Multiprocessors (1989) Flaherty, Joseph E., Paslow, Pamela J., Shephard, Mark S., and Vasilakis, John D., Adaptive Methods for Partial Differential Equations (1989) Kohn, Robert V. and Milton, Graeme W., Random Media and Composites (1989) Mandel, Jan, McCormick, S. F., Dendy, J. E., Jr., Farhat, Charbel, Lonsdale, Guy, Porter, Seymour V., Ruge, John W., and Stuben, Klaus, Proceedings of the Fourth Copper Mountain Conference on Multigrid Methods 0939^ Colton, David, Ewing, Richard, and Rundell, William, Inverse Problems in Partial Differential Equations (1990) Chan, Tony F., Glowinski, Roland, Periaux, Jacques, and Widlund, Olof B., Third International Symposium on Domain Decomposition Methods for Partial Differential Equations (1990) Dongarra, Jack, Messina, Paul, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing (1990) Glowinski, Roland and Lichnewsky, Alain, Computing Methods in Applied Sciences and Engineering (1990) Coleman, Thomas F. and Li, Yuying, Large-Scale Numerical Optimization (1990) Aggarwal, Alok, Borodin, Allan, Gabow, Harold, N., Galil, Zvi, Karp, Richard M., Kleitman, Daniel J., Odlyzko, Andrew M., Pulleyblank, William R., Tardos, Eva, and Vishkin, Uzi, Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms (1990) Cohen, Gary, Halpern, Laurence, and Joly, Patrick, Mathematical and Numerical Aspects of Wave Propagation Phenomena (1991) Gomez, S., Hennart, J. P., and Tapia, R. A., Advances in Numerical Partial Differential Equations and Optimization: Proceedings of the Fifth Mexico-United States Workshop (1991) Glowinski, Roland, Kuznetsov, Yuri A., Meurant, Gerard, Periaux, Jacques, and Widlund, Olof B., Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1991) Alavi, Y., Chung, F. R. K., Graham, R. L., and Hsu, D. R, Graph Theory, Combinatorics, Algorithms, and Applications (1991) Wu, Julian J., Ting, T. C. I, and Barnert, David M., Modern Theory of Anisotropic Elasticity and Applications (1991) Shearer, Michael, Viscous Profiles and Numerical Methods for Shock Waves (1991) Griewank, Andreas and Corliss, George F., Automatic Differentiation of Algorithms: Theory, Implementation, and Application (1991) Frederickson, Greg, Graham, Ron, Hochbaum, Dorit S., Johnson, Ellis, Kosaraju, S. Rao, Luby, Michael, Megiddo, Nimrod, Schieber, Baruch, Vaidya, Pravin, and Yao, Frances, Proceedings of the Third Annual ACM-SIAM Symposium on Discrete Algorithms (1992) Field, David A. and Komkov, Vadim, Theoretical Aspects of Industrial Design (1992) Field, David A. and Komkov, Vadim, Geometric Aspects of Industrial Design (1992) Bednar, J. Bee, Lines, L R., Stolt, R. H., and Weglein, A. B., Geophysical Inversion (1992)
O'Malley, Robert E. Jr., ICIAM 91: Proceedings of the Second International Conference on Industrial and Applied Mathematics (1992) Keyes, David E., Chan, Tony F., Meurant, Gerard, Scroggs, Jeffrey S., and Voigt, Robert G., Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations (1992) Dongarra, Jack, Messina, Paul, Kennedy, Ken, Sorensen, Danny C., and Voigt, Robert G., Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing (1992) Corones, James P., Kristensson, Gerhard, Nelson, Paul, and Seth, Daniel L, Invariant Imbedding and Inverse Problems (1992) Ramachandran, Vijaya, Bentley, Jon, Cole, Richard, Cunningham, William H., Guibas, Leo, King, Valerie, Lawler, Eugene, Lenstra, Arjen, Mulmuley, Ketan, Sleator, Daniel D., and Yannakakis, Mihalis, Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (1993) Kleinman, Ralph, Angell, Thomas, Colton, David, Santosa, Fadil, and Stakgold, Ivar, Second International Conference on Mathematical and Numerical Aspects of Wave Propagation (1993) Banks, H. I, Fabiano, R. H., and Ito, K., Identification and Control in Systems Governed by Partial Differential Equations (1993) Sleator, Daniel D., Bern, Marshall W., Clarkson, Kenneth L, Cook, William J., Karlin, Anna, Klein, Philip N., Lagarias, Jeffrey C., Lawler, Eugene L., Maggs, Bruce, Milenkovic, Victor J., and Winkler, Peter, Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (1994) Lewis, John G., Proceedings of the Fifth SIAM Conference on Applied Linear Algebra (1994) Brown, J. David, Chu, Moody I, Ellison, Donald C., and Plemmons, Robert J., Proceedings of the Cornelius Lanczos International Centenary Conference (1994) Dongarra, Jack J. and Tourancheau, B., Proceedings of the Second Workshop on Environments and Tools for Parallel Scientific Computing (1994) Bailey, David H., Bj0rstad, Petter E., Gilbert, John R., Mascagni, Michael V, Schreiber, Robert S., Simon, Horst D., Torczon, Virginia J., and Watson, Layne I, Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing (1995) Clarkson, Kenneth, Agarwal, Pankaj K., Atallah, Mikhail, Frieze, Alan, Goldberg, Andrew, Karloff, Howard, Manber. Udi, Munro, Ian, Raghavan, Prabhakar, Schmidt, Jeanette, and Young, Moti, Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (1995) Becache, Elaine, Cohen, Gary, Joly, Patrick, and Roberts, Jean E., Third International Conference on Mathematical and Numerical Aspects of Wave Propagation (1995) Engl, Heinz W., and Rundell, W., GAMM-SIAM Proceedings on Inverse Problems in Diffusion Processes (1995) Angell, T. S., Cook, Pamela L., Kleinman, R. E., and Olmstead, W. E., Nonlinear Problems in Applied Mathematics (1995) Tardos, Eva, Applegate, David, Canny, John, Eppstein, David, Galil, Zvi, Kdrger, David R., Karlin, Anna R., Linial, Nati, Rao, Satish B., Vitter, Jeffrey S., and Winkler, Peter M., Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (1996) Cook, Pamela L., Roytburd, Victor, and Tulin, Marshal, Mathematics Is for Solving Problems (1996) Adams, Loyce and Nazareth, J. L., Linear and Nonlinear Conjugate Gradient-Related Methods (1996) Renardy, Yuriko Y., Coward, Adrian V., Papageorgiou, Demetrios T., and Sun, Shu-Ming, Advances in Multi-Fluid Flows (1996) Berz, Martin, Bischof, Christian, Corliss, George, and Griewank, Andreas, Computational Differentiation: Techniques, Applications, and Tools (1996) Delic, George and Wheeler, Mary F., Next Generation Environmental Models and Computational Methods (1997) Engl, Heinz W., Louis, Alfred, and Rundell, William, Inverse Problems in Geophysical Applications (1997) Saks, Michael, Anderson, Richard, Bach, Eric, Berger, Bonnie, Blum, Avrim, Chazelle, Bernard, Edelsbrunner,Herbert, Henzinger, Monika, Johnson, David, Kannan, Sampath, Khuller, Samir, Maggs, Bruce, Muthukrishnan, S., Ruskey, Frank, Seymour, Paul, Spencer, Joel, Williamson, David P., and Williamson, Gill, Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (1997) Alexandrov, Natalia M. and Hussaini, M. Y., Multidisciplinary Design Optimization: State of the Art (1997) Van Huffel, Sabine, Recent Advances in Total Least Squares Techniques and Errors-in-Variables Modeling (1997)
Ferris, Michael C. and Pang, Jong-Shi, Complementarity and Variational Problems: State of the Art (1997) Bern, Marshall, Fiat, Amos, Goldberg, Andrew, Kannan, Sampath, Karloff, Howard, Kenyan, Claire, Kierstead, Hal, Kosaraju, Rao, Linial, Nati, Rabani, Yuval, Rodl, Vojta, Sharir, Micha, Shmoys, David, Spielman, Dan, Spinrad, Jerry, Srinivasan, Aravind, and Sudan, Madhu, Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (1998) DeSanto, John A., Mathematical and Numerical Aspects of Wave Propagation (1998) Tarjan, Robert E., Warnow, Tandy, Amenta, Nina, Benham, Craig, Cornell, Derek G., Edelsbrunner, Herbert, Feigenbaum, Joan, Gusfield, Dan, Habib, Michel, Hall, Leslie, Karp, Richard, King, Valerie, Koller, Daphne, McKay, Brendan, Moret, Bernard, Muthukrishnan, S., Phillips, Cindy, Raghavan, Prabhakar, Randall, Dana, and Scheinerman, Edward, Proceedings of the Tenth ACM-SIAM Symposium on Discrete Algorithms (1999) Hendrickson, Bruce, Yelick, Katherine A., Bischof, Christian H., Duff, lain S., Edelman, Alan S., Geist, George A., Heath, Michael I, Heroux, Michael H., Koelbel, Chuck, Schrieber, Robert S., Sincovec, Richard F., and Wheeler, Mary F., Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (1999) Henderson, Michael E., Anderson, Christopher R., and Lyons, Stephen L, Object Oriented Methods for Interoperable Scientific and Engineering Computing (1999) Shmoys, David, Brightwell, Graham, Cohen, Edith, Cook, Bill, Eppstein, David, Gerards, Bert, Irani, Sandy, Kenyan, Claire, Ostrovsky, Rafail, Peleg, David, Pevzner, Pavel, Reed, Bruce, Stein, Cliff, Tetali, Prasad, and Welsh, Dominic, Proceedings of the Eleventh ACM-SIAM Symposium on Discrete Algorithms (2000) Bermudez, Alfredo, Gomez, Dolores, Hazard, Christophe, Joly, Patrick, and Roberts, Jean E., Fifth International Conference on Mathematical and Numerical Aspects of Wave Propagation (2000) Kosaraju, S. Rao, Bellare, Mihir, Buchsbaum, Adam, Chazelle, Bernard, Graham, Fan Chung, Karp, Richard, Lovasz, Laszlo, Motwani, Rajeev, Myrvold, Wendy, Pruhs, Kirk, Sinclair, Alistair, Spencer, JoeLStein, Cliff, Tardos, Eva, Vempala, Santosh, Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (2001) Koelbel, Charles and Meza, Juan, Proceedings of the Tenth SIAM Conference on Parallel Processing for Scientific Computing (2001) Grossman, Robert, Kumar, Vipin, and Han, Jiawei, Proceedings of the First SIAM International Conference on Data Mining (2001) Berry, Michael, Computational Information Retrieval (2001) Eppstein, David, Demaine, Erik, Doerr, Benjamin, Fleischer, Lisa, Goel, Ashish, Goodrich, Mike, Khanna, Sanjeev, King, Valerie, Munro, Ian, Randall, Dana, Shepherd, Bruce, Spielman, Dan, Sudakov, Benjamin, Suri, Subhash, and Warnow, Tandy, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2002) Grossman, Robert, Han, Jiawei, Kumar, Vipin, Mannila, Heikki, and Motwani, Rajeev, Proceedings of the Second SIAM International Conference on Data Mining (2002) Estep, Donald and Tavener, Simon, Collected Lectures on the Preservation of Stability under Discretization (2002) Ladner, Richard E., Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments (2003) Arge, Lars, Italiano, Giuseppe F., and Sedgewick, Robert, Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments and the First Workshop on Analytic Algorithmics and Combinatorics (2004)
PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS
Edited by Lars Arge, Giuseppe F, Italiano, and Robert Sedgewick
Society for Industrial and Applied Mathematics Philadelphia
PROCEEDINGS OF THE SIXTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIRST WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS
Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments, New Orleans, LA, January 10, 2004 Proceedings of the First Workshop on Analytic Algorithmics and Combinatorics, New Orleans, LA, January 10, 2004 The workshops were supported by the ACM Special Interest Group on Algorithms and Computation Theory and the Society for Industrial and Applied Mathematics. Copyright © 2004 by the Society for Industrial and Applied Mathematics. 10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Catalog Card Number: 2004104581 ISBN 0-89871-564-4
is a registered trademark.
CONTENTS ix
Preface to the Workshop on Algorithm Engineering and Experiments
xi
Preface to the Workshop on Analytic Algorithmics and Combinatorics Workshop on Algorithm Engineering and Experiments
3
Engineering Geometric Algorithms: Persistent Problems and Some Solutions (Abstract of Invited Talk) Don Halperin
4
Engineering a Cache-Oblivious Sorting Algorithm Gerth St0lting Brodal, Rolf Fagerberg, and Kristoffer Vinther
18
The Robustness of the Sum-of-Squares Algorithm for Bin Packing Michael A. Bender, Bryan Bradley, Geetha Jagannathan, and Krishnan Pillaipakkamnatt
31
Practical Aspects of Compressed Suffix Arrays and FM-lndex in Searching DNA Sequences Wing-Kai Hon, Tak-Wah Lam, Wing-Kin Sung, Wai-Leuk Tse, Chi-Kwong Wong, and Siu-Ming Yiu
39
Faster Placement of Hydrogens in Protein Structures by Dynamic Programming Andrew Leaver-Fay, Yuanxin Liu, and Jack Snoeyink
49
An Experimental Analysis of a Compact Graph Representation Daniel K. Blandford, Guy E. Blelloch, and Ian A. Kash
62
Kernelization Algorithms for the Vertex Cover Problem: Theory and Experiments Faisal N. Abu-Khzam, Rebecca L. Collins, Michael R. Fellows, Michael A. Langston, W. Henry Suters, and Christopher T. Symons
70
Safe Separators for Treewidth Hans L Bodlaender and Arie M.C.A. Koster
79
Efficient Implementation of a Hotlink Assignment Algorithm for Web Sites ArturAlves Pessoa, Eduardo Sony Laber, and Criston de Souza
88
Experimental Comparison of Shortest Path Approaches for Timetable Information Evangelia Pyrga, Frank Schulz, Dorothea Wagner, and Christos Zaroliagis
100
Reach-Based Routing: A New Approach to Shortest Path Algorithms Optimized for Road Networks Ron Gutman
112
Lazy Algorithms for Dynamic Closest Pair with Arbitrary Distance Measures Jean Cardinal and David Eppstein
120
Approximating the Visible Region of a Point on a Terrain Boaz Ben-Moshe, Paz Carmi, and Matthew J. Katz
129
A Computational Framework for Handling Motion Leonidas Guibas, Menelaos I. Karavelas, and Daniel Russel
142
Engineering a Sorted List Data Structure for 32 Bit Keys Roman Dementiev, Lutz Kettner, Jens Mehnert, and Peter Sanders
VII
CONTENTS Workshop on Analytic Algorithmics and Combinatorics 152
Theory and Practice of Probabilistic Counting Algorithms (Abstract of Invited Talk) Philippe Flajolet
153
Analysis of a Randomized Selection Algorithm Motivated by the LZ'77 Scheme Mark Daniel Ward and Wojciech Szpankowski
161
The Complexity of Jensen's Algorithm for Counting Polyominoes Gill Barequet and Micha Moffie
170
Distributional Analyses of Euclidean Algorithms Viviane Baladi and Brigitte Vallee
185
A Simple Primality Test and the rth Smallest Prime Factor Daniel Panario, Bruce Richmond, and Martha Yip
194
Gap-Free Samples of Geometric Random Variables Pawei Hitczenko and Arnold Knopfmacher
199
Computation of a Class of Continued Fraction Constants Lo'i'ck Lhote
211
Compositions and Patricia Tries: No Fluctuations in the Variance! Helmut Prodinger
216
Quadratic Convergence for Scaling of Matrices Martin Purer
224
Partial Quicksort Conrado Martinez
229
Author Index
VIII
ALENEX WORKSHOP PREFACE The annual workshop on Algorithm Engineering and Experiments (ALENEX) provides a forum for the presentation of original research in the implementation and experimental evaluation of algorithms and data structures. ALENEX 2004 was the sixth workshop in this series. It was held in New Orleans, Louisiana, on January 10,2004. These proceedings contain the 14 papers that were selected for presentation from a total of 56 submissions. Considerable effort was devoted to the evaluation of the submissions. However, submissions were not refereed in the thorough and detailed way that is customary for journal papers. It is expected that most of the papers in these proceedings will eventually appear in finished form in scientific journals. We would like to thank all the people who contributed to a successful workshop. In particular, we thank the Program Committee and all of our many colleagues who helped the Program Committee evaluate the submissions. We also thank Adam Buchsbaum for answering our many questions along the way and Bryan Holland-Minkley for help with the submission and program committee software. We gratefully acknowledge the generous support of Microsoft, which helped reduce the registration fees for students, and thank SIAM for providing in-kind support and facilitating the workshop; in particular, the help of Kirsten Wilden from SIAM has been invaluable. Finally, we would like to thank the invited speaker, Dan Halperin of Tel Aviv University. Lars Arge and Giuseppe F. Italiano
ALENEX 2004 Program Committee
Lars Arge, Duke University Jon Bentley, Avaya Labs Research Mark de Berg, Technische Universiteit Eindhoven Monika Henzinger, Google Giuseppe F. Italiano, University of Rome David Karger, Massachusetts Institute of Technology Ulrich Meyer, Max-Planck-lnstitut fur Informatik Jan Vahrenhold, University of Munster ALENEX 2004 Steering Committee
Adam Buchsbaum, AT&T Research Roberto Battiti, University of Trento Andrew V. Goldberg, Microsoft Research Michael Goodrich, University of California, Irvine David S. Johnson, AT&T Research Richard E. Ladner, University of Washington, Seattle Catherine C. McGeoch, Amherst College David Mount, University of Maryland, College Park Bernard M.E. Moret, University of New Mexico Jack Snoeyink, University of North Carolina, Chapel Hill Clifford Stein, Columbia University
IX
ALENEX WORKSHOP PREFACE
ALENEX 2004 Subreviewers
Pankaj K. Agarwal Susanne Albers Ernst Althaus Rene Beier Michael Bender Henrik Blunck Christian Breimann Herve Bronnimann David Bryant Adam Buchsbaum Stefan Burkhardt Marco Cesati Sunil Chandran Erik Demaine Roman Dementiev Camil Demetrescu Jeff Erickson Irene Finocchi Luisa Gargano Leszek Gasieniec Raffaele Giancarlo Roberto Grossi Concettina Guerra Bryan Holland-Minkley Michael Jacob
Klaus Jansen Juha Kaerkkainen Spyros Kontogiannis Luigi Laura Giovanni Manzini Eli Packer Ron Parr Marco Pellegrini Seth Pettie Jordi Planes Naila Rahman Rajeev Raman Joachim Reichel Timo Ropinski Kay Salzwedel Peter Sanders Guido Schaefer Christian Scheideler Naveen Sivadasan Martin Skutella Bettina Speckmann Venkatesh Srinivasan Firas Swidan Kavitha Telikepalli Norbert Zeh
X
ANALCO WORKSHOP PREFACE The papers in these proceedings, along with the invited talk by Philippe Flajolet, "Theory and Practice of Probabilistic Counting Algorithms," were presented at the First Workshop on Analytic Algorithmics and Combinatorics (ANALCO04), which was held in New Orleans, Louisiana, on January 10, 2004. The aim of ANALCO is to provide a forum for the presentation of original research in the analysis of algorithms and associated combinatorial structures. The papers study properties of fundamental combinatorial structures that arise in practical computational applications (such as permutations, trees, strings, tries, and graphs) and address the precise analysis of algorithms for processing such structures, including average-case analysis; analysis of moments, extreme, and distributions; and probabilistic analysis of randomized algorithms. Some of the papers present significant new information about classic algorithms; others present analyses of new algorithms that present unique analytic challenges or address tools and techniques for the analysis of algorithms and combinatorial structures, both mathematical and computational. The workshop took place on the same day as the Sixth Workshop on Algorithm Engineering and Experiments (ALENEX04); the papers from that workshop are also published in this volume. Since researchers in both fields are approaching the problem of learning detailed information about the performance of particular algorithms, we expect that interesting synergies will develop. People in the ANALCO community are encouraged to look over the ALENEX papers for problems where the analysis of algorithms might play a role; people in the ALENEX community are encouraged to look over these ANALCO papers for problems where experimentation might play a role.
Program Committee Kevin Compton, University of Michigan Luc Devroye, McGill University, Canada Mordecai Golin, The Hong Kong University of Science and Technology, Hong Kong Hsien-Kuei Hwang, Academic Sinica, Taiwan Robert Sedgewick (Chair), Princeton University Wojciech Szpankowski, Purdue University Brigitte Vallee, Universite de Caen, France Jeffrey S. Vitter, Purdue University
XI
This page intentionally left blank
Workshop on Algorithm Engineering and Experiments
This page intentionally left blank
Invited Plenary Speaker Abstract Engineering Geometric Algorithms: Persistent Problems and Some Solutions Dan Halperin School of Computer Science Tel Aviv University The last decade has seen growing awareness to engineering geometric algorithms, in particular around the development of large scale software for computational geometry (like CGAL and LEDA). Besides standard issues such as efficiency, the developer of geometric software has to tackle the hardship of robustness problems, namely problems related to arithmetic precision and degenerate input, typically ignored in the theory of geometric algorithms, and which in spite of considerable efforts are still unresolved in full (practical) generality. We start with an overview of these persistent robustness problems, together with a brief review of prevailing solutions to them. We also briefly describe the CGAL project and library. We then focus on fixed precision approximation methods to deal with robustness issues, and in particular on so-called controlled perturbation which leads to robust implementation of geometric algorithms while using the standard machine floating-point arithmetic. We conclude with algorithm-engineering matters that are still geometric but have the more general flavor of addressing efficiency: (i) We discuss the fundamental issue of geometric decomposition (that is, decomposing geometric structures into simpler substructures), exemplify the large gap between the theory and practice of such decompositions, and present practical solutions in two and three dimensions, (ii) We suggest a hybrid approach to motion planning that significantly improves simple heuristic methods by integrating exact geometric algorithms to solve subtasks. The new results that we describe are documented in: http://www.cs.tau.ac.il/CGAL/Projects/
3
Engineering a Cache-Oblivious Sorting Algorithm* Gerth St01ting Brodal^*
Rolf Fagerberg*
Abstract This paper is an algorithmic engineering study of cacheoblivious sorting. We investigate a number of implementation issues and parameter choices for the cacheoblivious sorting algorithm Lazy Funnelsort by empirical methods, and compare the final algorithm with Quicksort, the established standard for comparison based sorting, as well as with recent cache-aware proposals. The main result is a carefully implemented cacheoblivious sorting algorithm, which our experiments show can be faster than the best Quicksort implementation we can find, already for input sizes well within the limits of RAM. It is also at least as fast as the recent cache-aware implementations included in the test. On disk the difference is even more pronounced regarding Quicksort and the cache-aware algorithms, whereas the algorithm is slower than a careful implementation of multiway Mergesort such as TPIE. 1 Introduction Modern computers contain a hierarchy of memory levels, with each level acting as a cache for the next. Typical components of the memory hierarchy are: registers, level 1 cache, level 2 cache, level 3 cache, main memory, and disk. The time for accessing a level increases for each new level (most dramatically when going from main memory to disk), making the cost of a memory access depend highly on what is the current lowest memory level containing the element accessed. As a consequence, the memory access pattern of an algorithm has a major influence on its running time in practice. Since classic asymptotic analysis of algorithms in the RAM model is unable to capture this. "This work is based on the M.Sc. thesis of the third author [29]. tBRIGS (Basic Research in Computer Science, www.brics.dk, funded by the Danish National Research Foundation), Department of Computer Science, University of Aarhus, DK-8000 Arhus C, Denmark. E-mail: {gerth.rolf}«brics.dk. Partially supported by the Future and Emerging Technologies programme of the EU under contract number 1ST-1999-14186 (ALCOM-FT). * Supported by the Carlsberg Foundation (contract number ANS-0257/20). § Systematic Software Engineering A/S, S0ren Frichs Vej 39, DK-8000 Arhus C, Denmark. E-mail: kvflbrics.dk.
4
Kristoffer Vinther5
a number of more elaborate models for analysis have been proposed. The most widely used of these is the I/O model introduced by of Aggarwal and Vitter [2] in 1988, which assumes a memory hierarchy containing two levels, the lower level having size M and the transfer between the two levels taking place in blocks of B consecutive elements. The cost of the computation is the number of blocks transferred. The strength of the I/O model is that it captures part of the memory hierarchy, while being sufficiently simple to make analysis of algorithms feasible. In particular, it adequately models the situation where the memory transfer between two levels of the memory hierarchy dominates the running time, which is often the case when the size of the data significantly exceeds the size of main memory. By now, a large number of results for the I/O model exists—see the surveys by Arge [3] and Vitter [30]. Among the fundamental facts are that in the I/O model, comparison based sorting takes 0(SortM,B(W)) I/Os in the worst case, where SortA/,s(^V) = ^ \O?,M/B ^. More elaborate models for multi-level memory hierarchies have been proposed ([30, Section 2.3] gives an overview), but fewer analyses of algorithms have been done. For these models, as for the I/O model of Aggarwal and Vitter, algorithms are assumed to know the characteristics of the memory hierarchy. Recently, the concept of cache-oblivious algorithms was introduced by Frigo et al. [19]. In essence, this designates algorithms formulated in the RAM model, but analyzed in the I/O model for arbitrary block size B and memory size M. I/Os are assumed to be performed automatically by an offline optimal cache replacement strategy. This seemingly simple change has significant consequences: since the analysis holds for any block and memory size, it holds for all levels of a multilevel memory hierarchy (see [19] for details). In other words, by optimizing an algorithm to one unknown level of the memory hierarchy, it is optimized to each level automatically. Thus, the cache-oblivious model in an elegant way combines the simplicity of the I/Omodel with a coverage of the entire memory hierarchy. An additional benefit is that the characteristics of the memory hierarchy do not need to be known, and do not need to be hardwired into the algorithm for
the analysis to hold. This increases the algorithms In this paper, we investigate the practical value of portability (a benefit for e.g. software libraries), and cache-oblivious methods in the area of sorting. We foits robustness against changing memory resources on cus on the Lazy Funnelsort algorithm, since we believe machines running multiple processes. it to have the biggest potential for an efficient impleIn 1999, Prigo et al. introduced the concept of cache- mentation among the current proposals for I/0-optimal obliviousness, and presented optimal cache-oblivious cache-oblivious sorting algorithms. We explore a numalgorithms for matrix transposition, FFT, and sort- ber of implementation issues and parameter choices for ing [19], and also gave a proposal for static search the cache-oblivious sorting algorithm Lazy Funnelsort. trees [25] with search cost matching that of standard and settle the best choices through experiments. We (cache-aware) B-irees [6]. Since then, quite a num- then compare the final algorithm with tuned versions ber of results for the model have appeared, including of Quicksort, which is generally acknowledged to be the the following: Bender et al. [11] gave a proposal for fastest all-round comparison based sorting algorithm, cache-oblivious dynamic search trees with search cost as well as with recent cache-aware proposals. Note that matching B-trees. Simpler cache-oblivious search trees the I/O cost of Quicksort is ©(^ Iog2 ^), which only with complexities matching that of [11] were presented differs from the optimal bound SortM,B(W) by the base in [12, 17, 26], and a variant with worst case bounds of the logarithm. for updates appear in [8]. Cache-oblivious algorithms The main result is a carefully implemented cachehave been given for problems in computational geome- oblivious sorting algorithm, which our experiments try [1, 8, 14], for scanning dynamic sets [7], for layout show can be faster than the best Quicksort implemenof static trees [9], and for partial persistence [8]. Cache- tation we can find, already for input sizes well within oblivious priority queues have been developed in [4, 15], the limits of RAM. It is also at least as fast as the rewhich in turn gives rise to several cache-oblivious graph cent cache-aware implementations included in the test. algorithms [4]. On disk the difference is even more pronounced regardSome of these results, in particular those involv- ing Quicksort and the cache-aware algorithms, whereas ing sorting and algorithms to which sorting reduces, the algorithm is slower than a careful implementation such as priority queues, are proved under the assump- of multiway Mergesort such as TPIE [18]. tion M > jB2, which is also known as the tall cache asThese findings support—and extend to the area of sumption. In particular, this applies to the Funnelsort sorting—the conclusion of the previous empirical results algorithm of Frigo et al. [19]. A variant termed Lazy on cache-obliviousness. This conclusion is that cacheFunnelsort [14] works under the weaker tall cache as- oblivous methods can lead to actual performance gains sumption M > Bl+£ for any fixed e > 0, at the cost over classic algorithms developed in the RAM-model. of a l/£ factor compared to the optimal sorting bound The gains may not always match those of the best algorithm tuned to a specific memory hierarchy level, but e(SortM,s(AO) for the case M > B1+e. Recently, it was shown [16] that a tall cache assump- on the other hand appear to be more robust, applying tion is necessary for cache-oblivious -comparison based to several memory hierarchy levels simultaneously. sorting algorithms, in the sense that the trade-off atOne observation of independent interest made in tained by Lazy Funnelsort between strength of assump- this paper is that for the main building block of Funneltion and cost for the for the case M » Bl+£ is best sort, namely the fc-merger, there is no need for a spepossible. This demonstrates a separation in power be- cific memory layout (contrary to its previous descriptween the I/O model and the cache-oblivious model for tions [14, 19]) for its analysis to hold. Thus, the centhe problems of comparison based sorting. Separations tral feature of the fc-merger definition is the sizes of its have also been shown for the problems of permuting [16] buffers, and does not include its layout in memory. and of comparison based searching [10]. The rest of this paper is organized as follows: In In contrast to the abundance of theoretical results Section 2, we describe Lazy Funnelsort. In Section 3, described above, empirical evaluations of the merits of we describe our experimental setup. In Section 4, we cache-obliviousness are more scarce. Existing results develop our optimized implementation of Funnelsort, have focused on basic matrix algorithms [19], and search and in Section 5, we compare it experimentally to a trees [17, 23, 26]. Although a bit tentative, they collection of existing efficient sorting algorithms. In conclude that in these areas, the efficiency of cache- Section 6, we sum up our findings. oblivious algorithms lies between that of classic RAMalgorithms and that of algorithms exploiting knowledge 2 Funnelsort about the specific memory hierarchy present (often Three algorithms for cache-oblivious sorting have been termed cache-aware algorithms). proposed so far: Funnelsort [19], its variant Lazy
5
Funnelsort [14], and a distribution based algorithm [19]. These all have the same optimal bound Sorter, fi(Af) on the number of I/Os performed, but have rather different structural complexity, with Lazy Funnelsort being the simplest. As simplicity of description often translates into smaller and more efficient code (for algorithms of same asymptotic complexity), we find the Lazy Funnelsort algorithm the most promising with respect to practical efficiency. In this paper, we choose it as the basis for our study of the practical feasibility of cache-oblivious sorting. We now review the algorithm briefly, and give an observation which further simplifies it. For the full details, see [14]. The algorithm is based on binary mergers. A binary merger takes as input two sorted streams of elements Figure 1: A 16-merger consisting of 15 binary mergers. and delivers as output the sorted stream formed by Shaded regions are the occupied parts of the buffers. merging of these. One merge step moves an element from the head of one of the input streams to the tail of 1 the output stream. The heads of the input streams and parameter, and the sizes of the remaining buffers are the tail of the output stream reside in buffers holding a defined by recursion on the top tree and the bottom limited number of elements. A buffer is simply an array trees. In the descriptions in [14, 19], a fc-merger is also of elements, plus fields storing the capacity of the buffer and pointers to the first and last elements in the buffer. laid out recursively in memory (according to the soBinary mergers can be combined to binary merge trees called van Emde Boas layout [25]), in order to achieve by letting the output buffer of one merger be an input I/O efficiency. We observe in this paper that this is buffer of another—in other words, binary merge trees not necessary: In the proof of Lemma 1 in [14], the are binary trees with mergers at the nodes and buffers central idea is to follow the recursive definition down to at the edges. The leaves of the tree contain the streams a specific size k of trees, and then consider the number of I/Os for loading this ^-merger and one block for each to be merged. An invocation of a merger is a recursive procedure of its output streams into memory. However, this price which performs merge steps until its output buffer is is not (except for constant factors) changed if we for full or both input streams are exhausted. If during the each of the k — 1 nodes have to load one entire block invocation an input buffer gets empty, but the corre- holding the node, and one block for each of the input sponding stream is not exhausted, the input buffer is and output buffers of the node. From this follows that recursively filled by an invocation of the merger having the proof holds true, no matter how the fc-merger is laid 2 this buffer as its output buffer. If both input streams out. Hence, the crux of the definition of the fc-merger of a merger get exhausted, the corresponding output lies entirely in the definition of the sizes of the buffers, stream is marked as exhausted. A single invocation of and does not include the van Emde Boas layout. To actually sort N elements, the algorithm recurthe root of the merge tree will merge the streams at the sively sorts Nl/d segments of size Nl~l/d of the input leaves of the tree. One particular merge tree is the fc-merger. For k and then merges these using an JV^-merger. For a a power of two, a fc-merger is a perfect binary tree of proof that this is an I/O optimal algorithm, see [14, 19]. k — 1 binary mergers with appropriate sized buffers on the edges, k input streams, and an output buffer at the 3 Methodology root of size kd, for a parameter d > 1. A 16-merger is As said, our goal is first to develop a good impleillustrated in Figure 1. mentation of Funnelsort by finding good choices for The sizes of the buffers are defined recursively: Let design options and parameter values through empirithe top tree be the subtree consisting of all nodes of cal investigation, and then to compare its efficiency to depth at most ft/2], and let the subtrees rooted by x nodes at depth [i/^l + 1 be the bottom trees. The edges The parameter a is introduced in this paper for tuning between nodes at depth [i/2] and depth [z/2] + 1 have purposes. 2 However, the (entire) A'-merger should occupy a contiguous associated buffers of size orfd 3 / 2 ], where a is a positive segment of memory in order for the complexity proof (Theorem 2 in [14]) of Funnelsort itself to be valid.
6
Architecture type Operation system Clock rate Address space Pipeline stages Ll data cache size Ll line size Ll associativity L2 cache size L2 line size L2 associativity TLB entries TLB associativity TLB miss handling RAM size
Petnium 4 Modern CISC Linux v. 2.4.18 2400MHz 32 bit 20 8KB
128 B 4- way 512KB 128 B 8- way 128
full hardware 512 MB
Pentium III Classic CISC Linux v. 2.4.18 800MHz 32 bit 12
MIPS 10000 RISC IRIX v. 6.5 175MHz 64 bit 6
32KB 32 B 2-way 1024 KB 32 B 2-way
16KB 32 B 4-way 256KB 32 B 4-way 64
64
64-way software 128 MB
4-way hardware 256MB
AMD Athlon Modern CISC Linux 2.4.18 1333 MHz 32 bit
Itanium 2 EPIC Linux 2.4.18 1137 MHz 64 bit
10
8
128 KB 64 B 2-way 256 KB 64 B 8- way
32 KB 64 B 4-way 256 KB 128 B 8- way
40
4-way hardware 512 MB
128
full ?
3072 MB
Table 1: The specifications of the machines used in this paper. that of Quicksort—the established standard for comparison based sorting algorithms—as well as that of recent cache-aware proposals. To ensure robustness of the conclusions, we perform all experiments on three rather different architectures, namely Pentium 4, Pentium III, and MIPS 10000. These are representatives of the modern CISC, the classic CISC, and the RISC type of computer architecture, respectively. In the final comparison of algorithms, we add the AMD Athlon (a modern CISC architecture) and the Intel Itanium 2 (denoted an EPIC architecture by Intel, for Explicit Parallel Instruction-set Computing) for even larger coverage. The specifications of all five machines used can be seen3 in Table 1. Our programs are written in C++ and compiled by GCC v. 3.3.2 (Pentiums 4 and III, AMD Athlon), GCC v. 3.1.1 (MIPS 10000), or the Intel C++ compiler v. 7.0 (Itanium 2). We compile using maximal optimization. We use three element types: integers, records containing one integer and one pointer, and records of 100 bytes. The first type is commonly used in experimental papers, but we do not find it particularly realistic, as keys normally have associated information. The second type models sorting small records directly, as well as key-sorting of large records. The third type models sorting medium sized records directly, and is the data type used in the Datamation Benchmark [20] originating from the database community. We mainly consider uniformly distributed keys, but also try skewed inputs such as almost sorted data, and a
Additionally, the Itanium 2 machine has 3072 KB of L3 cache, which is 12-way associative and has a cache line size of 128 B.
data with few distinct key values, to ensure robustness of the conclusions. To keep the experiments during the engineering phase (Section 4) tolerable in number, we only use the second data type and the uniform distribution, believing that tuning based on these will transfer to other situations. We use the drand48 family of C library functions for generation of random values. Our performance metric is wall clock time, as measured by the gettimeof day C library function. We keep the code for the different implementation options tested in the engineering phase as similar as possible, even though this generality entails some overhead. After judging what is the best choices of these options, we implement a clean version of the resulting algorithm, and use this in the final comparison against existing sorting algorithms. Due to space limitations, we in this paper mainly sum up our findings, and show only few plots of experimental data. A full set of plots (close to hundred) can be found in [29]. Our code is available from http: //www.daimi.au.dk/'kv/ALENEX04/. 4
Engineering Lazy Funnelsort
We consider a number of design and parameter choices for our implementation of Lazy Funnelsort. We group them as indicated by the following subsections. To keep the number of experiments within realistic limits, we settle the choices one by one, in the order presented here. We test each particular question by experiments exercising only parts of the implementation, and/or by fixing the remaining choices at hopefully reasonable values while varying the parameter under investigation. In this section, we for reasons of space merely summarize
7
the results of each set of experiments—the actual plots can be found in [29]. Regarding notation: a and d are the parameters from the definition of the fc-merger (see Section 2), and z denotes the degree of the basic mergers (see Section 4.3).
which shows that the spatial locality of the layout is not entirely without influence in practice, despite its lack of influence on the asymptotic analysis. The implicit vEB layout is slower than its pointer based version, but less so on the Pentium 4 architecture, which also is the fastest of the processors and most likely the one least 4.1 k-Merger Structure As noted in Section 2, no strained by complex arithmetic expressions. particular layout is needed for the analysis of Lazy Funnelsort to hold. However, some layout has to be 4.2 Tuning the Basic Mergers The "inner loop" chosen, and the choice could affect the running time. in the Lazy Funnelsort algorithm is the code performing We consider BFS, DPS, and vEB layout. We also the merge step in the nodes of the fc-merger. We explore consider having a merger node stored along with its several ideas for efficient implementation of this code. output buffer, or storing nodes and buffers separately One idea tested is to compute the minimum of the (each part having the same layout). number of elements left in either input buffer and the The usual tree navigation method is by pointers. space left in the output buffer. Merging can proceed However, for the three layouts above, implicit naviga- for at least that many steps without checking the state tion using arithmetic on node indices is possible—this of the buffers, thereby eliminating one branch from the is well-known for BFS [31], and arithmetic expressions core merging loop. We also try several hybrids of this for DFS and vEB layouts can be found in [17]. Implicit idea and the basic merger. navigation saves space at the cost of more CPU cycles This idea will not be a gain (rather, the minimum per navigation step. We consider both pointer based computation will constitute an overhead) in situations where one input buffer stays small for many merge and implicit navigation. We try two coding styles for the invocation of a steps. For this reason, we also implement the optimal merger, namely the straight-forward recursive imple- merging algorithm of Hwang and Lin [21, 22], which has mentation, and an iterative version. To control the higher overhead, but is an asymptotical improvement forming of the layouts, we make our own allocation func- when merging sorted lists of very different sizes. To tion, which starts by acquiring enough memory to hold counteract its overhead, we also try a hybrid solution the entire merger. We test the efficiency of our alloca- which invokes it only when the contents of the input tion function by also trying out the default allocator in buffers are skewed in size. C++. Using this, we have no guarantee that the proper Experiments: We run the same experiment as in memory layouts are formed, so we only try pointer based Section 4.1. The values of a and d influence the sizes of the smallest buffers in the merger. These smallest navigation in these cases. Experiments: We test all combinations of the buffers occur on every second level of the merger, so choices described above, except for a few infeasible ones any node has one of these as either input or output (e.g. implicit navigation with the default allocator), giv- buffer, making this size affect the heuristics above. For ing a total of 28 experiments on each of the three ma- this reason, we repeat the experiment for (a, d) equal to chines. One experiment consists of merging k streams (1,3), (4,2.5), and (16,1.5). These have smallest buffer of k"2 elements in a fc-merger with z = 2, a = 1, and sizes of 8, 23, and 45. respectively. d = 2. For each choice, we for values of k in [15; 270] Results: The Hwang-Lin algorithm has, as exmeasure the time for [20,000,000/A;3] such mergings. pected, a large overhead (a factor of three for the nonResults: The best combination on all architectures hybrid version). Somewhat to our surprise, the heuristic is recursive invocation of a pointer based vEB layout calculating minimum sizes is not competitive, being bewith nodes and buffers separate, allocated by the stan- tween 15% and 45% slower than the fastest, (except on dard allocator. The time used for the slowest combina- the MIPS 10000 architecture, where the differences betion is up to 65% larger, and the difference is biggest tween heuristics are less pronounced). Several hybrids on the Pentium 4 architecture. The largest gain occurs fare better, but the straight-forward solution is consisby choosing the recursive invocation over the iterative, tently the winner in all experiments. We interpret this and this gain is most pronounced on the Pentium 4 as the branch prediction of the CPUs being as efficient architecture, which also is the most sophisticated (it as explicit hand-coding for exploiting predictability in e.g. has a special return address stack holding the ad- the branches in this code (all branches, except the result dress of the next instruction to be fetched after return- of the comparison of the heads of the input buffers, are ing from a function call, for its immediate execution). rather predictable). Thus, hand-coding just constitutes The vEB layout ensures around 10% reduction in time, overhead.
8
4.3 Degree of Basic Mergers There is no need for the fc-merger to be a binary tree. If we for instance base it on four-way basic mergers, we effectively remove every other level of the tree. This means less element movement and less tree navigation. In particular, a reduction in data movement seems promising—part of Quicksorts speed can be attributed to the fact that for random input, only about every other element is moved on each level in the recursion, whereas e.g. binary Mergesort moves all elements at each level. The price to pay is more CPU steps per merge step, and code complication due to the increase in number of input buffers that can be exhausted. Based on considerations of expected register use, element movements, and number of branches, we try several different ways of implementing multi-way mergers using sequential comparison of the front elements in the input buffers. We also try a heap-like approach using looser trees [22], which proved efficient in a previous study by Sanders [27] of priority queues in RAM. In total, seven proposals for multi-way mergers are implemented. Experiments: We test the seven implementations in a 120-merger with (a, d) = (16,2), and measure the time for eight mergings of 1,728,000 elements each. The test is run for degrees 2 = 2,3,4,...,9. For comparison, we also include the binary mergers from the last set of experiments. Results: All implementations except the looser tree show the same behavior: As z goes from 2 to 9, the time first decreases, and then increases again, with minimum attained around 4 or 5. The maximum is 4065% slower than the fastest. Since the number of levels for elements to move through evolves as I/ log(z), while the number of comparisons for each level evolves as 2, a likely explanation is that there is an initial positive effect due to decrease in element movements, which soon is overtaken by increase in instruction count per level. The looser trees show decrease only in running time for increasing 2, consistent with the fact that the number of comparisons per element for a traversal of the merger is the same for all values of z, but the number of levels, and hence data movements, evolves as I/ log(2). Unfortunately, the running time starts out twice as large as for the remaining implementations for z = 2, and barely reaches them at z = 9. Apparently, the overhead is too large to make looser trees competitive in this setting. The plain binary mergers compete well, but are beaten by around 10% by the fastest four- or fiveway mergers. All these findings are rather consistent across the three architectures.
4.4 Merger Caching In the outer recursion of Funnelsort, the same size fc-merger is used for all invocations on the same level of the recursion. A natural optimization would be to precompute these sizes and construct the needed fc-mergers once for each size. These mergers are then reset each time they are used. Experiments: We use the Lazy Funnelsort algorithm with (a, d, 2) = (4,2.5,2), straight-forward implementation of binary basic mergers, and a switch to std: :sort, the STL implementation of Quicksort, for sizes below azd = 23. We sort instances ranging in size from 5,000,000 to 200,000,000 elements. Results: On all architectures, merger caching gave a 3-5% speed-up. 4.5 Base Sorting Algorithm Like any recursive algorithm, the base case in Lazy Funnelsort is handled specially. As a natural limit, we require all fc-mergers to have height at least two—this will remove a number of special cases in the code constructing the mergers. Therefore, for input sizes below azd we switch to another sorting algorithm. Experiments with the sorting algorithms Insertionsort, Selectionsort, Heapsort, Shellsort, and Quicksort (in the form of std: : sort from the STL library) on input size from 10 to 100 revealed the expected result, namely that std: :sort, which (in the GCC implementation) itself switches to Insertionsort below size 16, is the fastest for all sizes. We therefore choose this as the sorting algorithm for the base case. 4.6 Parameters a and d The final choices concern the parameters a (factor in buffer size expression) and d (main parameter defining the progression of the recursion, in the outer recursion of Funnelsort, as well as in the buffer sizes in the fc-merger). These control the buffer sizes, and we investigate their impact on the running time. Experiments: For values of d between 1.5 and 3 and for values of a between 1 and 40, we measure the running time for sorting inputs of various sizes in RAM. Results: There is a marked rise in running time when a drops below 10, increasing to a factor of four for Q = 1. This effect is particularly strong for d = 1.5. Smaller a and d give smaller sizes of buffers, and the most likely explanation seems to be that the cost of navigating to and invoking a basic merger is amortized over fewer merge steps when the buffers are smaller. Other than that, the different values of d appear to behave quite similarly. A sensible choice appears to be a around 16, and d around 2.5
9
5 Evaluating Lazy Punnelsort In Section 4, we settled the best choices for a number of implementation issues for Lazy Funnelsort. In this section, we investigate the practical merits of the resulting algorithm. We implement two versions: Funnelsort2, which uses binary basic mergers as described in Section 2, and Funnelsort4, which uses the four-way basic mergers found in Section 4.3 to give slightly better results. The remaining implementation details follow what was declared the best choices in Section 4. Both implementations use parameters (o,rf) = (16,2), and use std:: sort for input sizes below 400 (as this makes all k-mergers have height at least two in both). 5.1 Competitors Comparing algorithms with the same asymptotic running time is a delicate matter. Tuning of code can often change the constants involved significantly, which leaves open the question of how to ensure equal levels of engineering in implementations of different algorithms. Our choice in this paper is to use Quicksort as the main yardstick. Quicksort is known as a very fast general-purpose comparison based algorithm [28], and has long been the standard choice of sorting algorithm in software libraries. Over the last 30 years, many improvements have been suggested and tried, and the amount of practical experience with Quicksort is probably unique among sorting algorithms. It seems reasonable to expect implementations in current libraries to be highly tuned. To further boost confidence in the efficiency of the chosen implementation of Quicksort, we start by comparing several widely used library implementations, and choose the best performer as our main competitor. We believe such a comparison will give a good picture of the practical feasibility of cacheoblivious ideas in the area of comparison based sorting. The implementations we consider are std:: sort from the STL library included in the GCC v. 3.2 distribution, std:: sort from the STL library from Dinkumware4 included with Intels C++ compiler v.7.0, the implementation from [28, Chap. 7], and an implementation of our own, based on the proposal of Bentley and Mcllroy [13], but tuned slightly further by making it simpler for calls on small instances and adding an even more elaborate choice of pivot element for large instances. These algorithms mainly differ in their partitioning strategies—how meticulously they choose the pivot element and whether they use two- or three-way partitioning. Two-way partitioning allows tighter code, but is less robust when repeated keys are present. 4
www.dinkumware.com
10
To gain further insight, we also compare with recent implementations of cache-aware sorting algorithms aiming for efficiency in either internal memory or external memory by tunings based on knowledge of the memory hierarchy. TPIE [18] is a library for external memory computations, and includes highly optimized routines for e.g. scanning and sorting. We choose TPIEs sorting routine AMI_sort as representative of sorting algorithms efficient in external memory. The algorithm needs to know the amount of available internal memory, and following suggestions in the TPIEs manual we set it to 192 Mb, which is 50-75% of the physical memory on all machines where it is tested. The TPIE version used is the newest at time of writing (release date August 29, 2002). TPIE does not support the MIPS and Itanium architectures, and requires an older version (2.96) of the GCC compiler on the remaining architectures. Several recent proposals for cache-aware sorting algorithms in internal memory exist, including [5, 24. 32]. LaMarca and Ladner [24] give proposals for better exploiting LI and L2 cache. Improving on their effort, Arge et al. [5] give proposals using registers better, and Kubricht et al. [32] give variants of the algorithms from [24] taking the effects of TLB (Translation Lookaside Buffers) misses and the low associativity of caches into account. In this test, we compare against the two Mergesort based proposals from [24] as implemented by [32] (we encountered problems with the remaining implementations from [32]), and the R-merge algorithm of [5]. We use the publicly available code from [32] and code from [5] sent to us by the authors. 5.2 Experiments We test the algorithms described above on inputs of sizes in the entire RAM range, as well as on inputs residing on disk. All experiments are performed on machines with no other users. The influence of background processes is minimized by running each experiment in internal memory 21 times, and reporting the median. In external memory experiments are rather time consuming, and we run each experiment only once, believing that background processes will have less impact on these. Besides the three machines used in Section 3, we in these final experiments also include the AMD Athlon and the Intel Itanium 2 processor.5 Their specifications can be seen in Table 1. The methodology is as described in Section 3.
5 Due to our limited access period for the Itanium machine, we do not have results for all algorithms on this architecture.
5.3 Results The plots described in this section are shown in Appendix A. In all graphs, the y-axis shows wall time in seconds divided by n log n, and the x-axis shows log n, where n is the number of input elements. The comparison of Quicksort implementations showed that three contestants ran pretty close, with the GCC implementation as the overall fastest. It uses a compact two-way partitioning scheme, and simplicity of code here seems to pay off. It is closely followed by our own implementation (denoted Mix), based on the tuned three-way partitioning of Bentley and Mcllroy. The implementation from Sedgewicks book (denoted Sedge) is not much after, whereas the implementation from the Dinkumware STL library (denoted Dink) lags rather behind, probably due to a rather involved three-way partitioning routine. We use the GCC and the Mix implementation as the Quicksort contestants in the remaining experiments—the first we choose for pure speed, the latter for having better robustness with almost no sacrifice in efficiency. In the main experiments in RAM, we see that the Funnelsort algorithm with four-way basic mergers are consistently better than the one with binary basic mergers, except on the MIPS architecture, which has a very slow CPU. This indicates that the reduced number of element movements really do outweigh the increased merger complexity, except when CPU cycles are costly compared to memory accesses. For the smallest input sizes, the best Funnelsort looses to GCC Quicksort (by 10-40%), but on three architectures gains as n grows, ending up winning (by the approximately the same ratio) for the largest instances in RAM. The two architectures where GCC keeps its lead are the MIPS 10000 with its slow CPU, and the Pentium 4, which features the PC800 bus (decreasing the access time to RAM), and which has a large cache line size (reducing effects of cache latency when scanning data in cache). This can be interpreted as on these two architectures, CPU cycles, not cache effects, are dominating the running time for sorting, and on architectures where this is not the case, the theoretically better cache performance of Funnelsort actually shows through in practice, at least for a tuned implementation of the algorithm. The two cache-aware implementations msort-c and msort-m from [32] are not competitive on any of the architectures. The R-merge algorithm is competing well, and like Funnelsort shows its cache-efficiency by having a basically horizontal graph throughout the entire RAM range on the architectures dominated by cache effects. However, four-way Funnelsort is consistently better than R-merge, except on the MIPS 10000 machine. The latter is a RISC-type architecture and has a large
number of registers, something which the R-merge algorithm is designed to exploit. TPIEs algorithm is not competitive in RAM. For the experiments on disk, TPIE is the clear winner. It is optimized for external memory, and we suspect in particular that its use of double-buffering (something which seems hard to transfer to a cache-oblivious setting) gives it an unbeatable advantage6. However, Funnelsort comes in as a second, and outperforms GCC quite clearly. The gain over GCC seems to grow as n grows larger, which is in good correspondence with the difference in the base of logarithms in the I/O complexity of these algorithms. The algorithms tuned to cache perform notably badly on disk. Due to lack of space, we have only shown plots for uniformly distributed data of the second data type (records of integer and pointer pairs). The results for the other types and distributions discussed in Section 3 are quite similar, and can be found in [29]. 6
Conclusion
Through a careful engineering effort, we have developed a tuned implementation of Lazy Funnelsort, which we have compared empirically with efficient implementations of other comparison based sorting algorithms. The results show that our implementation is competitive in RAM as well as on disk, in particular in situations where sorting is not CPU bound. Across the many input sizes tried, Funnelsort was almost always among the two fastest algorithms, and clearly the one adapting most gracefully to changes of level in the memory hierarchy. In short, these results show that for sorting, the overhead involved in being cache-oblivious can be small enough for the nice theoretical properties to actually transfer into practical advantages. 7 Acknowledgments We thank Brian Vinter of the Department of Mathematics and Computer Science, University of Southern Denmark, for access to an Itanium 2 processor.
References [1] P. Agarwal, L. Arge, A. Banner, and B. HollandMinkley. On cache-oblivious multidimensional range searching. In Proc. 19th ACM Symposium on Computational Geometry, 2003. [2] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31 (9):1116-1127, Sept. 1988. °The TPIE sorting routine sorts one run while loading the next from disk, thus parallelizing CPU work and I/Os.
11
[3] L. Arge. External memory data structures. In Proc. 9th Annual European Symposium on Algorithms (ESA), volume 2161 of LNCS, pages 1-29. Springer, 2001. [4] L. Arge, M. A. Bender, E. D. Demaine, B. HollandMinkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In Proc. 34th Ann. ACM Symp. on Theory of Computing, pages 268276. ACM Press, 2002. [5] L. Arge, J. Chase, J. Vitter, and R. Wickremesinghe. Efficient sorting using registers and caches. ACM Journal of Experimental Algorithmics, 7(9), 2002. [6] R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, 1:173-189, 1972. [7] M. Bender, R. Cole, E. Demaine, and M. FarachColton. Scanning and traversing: Maintaining data for traversals in a memory hierarchy. In Proc. 10th Annual European Symposium on Algorithms (ESA), volume 2461 of LNCS, pages 139-151. Springer, 2002. [8] M. Bender, R. Cole, and R. Raman. Exponential structures for cache-oblivious algorithms. In Proc. 29th International Colloquium on Automata, Languages, and Programming (ICALP), volume 2380 of LNCS, pages 195-207. Springer, 2002. [9] M. Bender, E. Demaine, and M. Farach-Colton. Efficient tree layout in a multilevel memory hierarchy. In Proc. 10th Annual European Symposium on Algorithms (ESA), volume 2461 of LNCS, pages 165-173. Springer, 2002. [10] M. A. Bender, G. S. Brodal, R. Fagerberg, D. Ge, S. He, H. Hu, J. lacono, and A. Lopez-Ortiz. The cost of cache-oblivious searching. In Proc. 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 271-282, 2003. [11] M. A. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In Proc. 41st Ann. Symp. on Foundations of Computer Science, pages 399-409. IEEE Computer Society Press, 2000. [12] M. A. Bender, Z. Duan, J. lacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. In Proc. 13th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 29-39, 2002. [13] J. L. Bentley and M. D. Mcllroy. Engineering a sort function. Software-Practice and Experience, 23(1):1249-1265, 1993. [14] G. S. Brodal and R. Fagerberg. Cache oblivious distribution sweeping. In Proc. 29th International Colloquium on Automata, Languages, and Programming (ICALP), volume 2380 of LNCS, pages 426-438. Springer, 2002. [15] G. S. Brodal and R. Fagerberg. Funnel heap - a cache oblivious priority queue. In Proc. 13th Annual International Symposium on Algorithms and Computation, volume 2518 of LNCS, pages 219-228. Springer, 2002. [16] G. S. Brodal and R. Fagerberg. On the limits of cacheobliviousness. In Proc. 35th Annual ACM Symposium on Theory of Computing (STOC), pages 307-315, 2003.
12
[17] G. S. Brodal, R. Fagerberg, and R. Jacob. Cache oblivious search trees via binary trees of small height. In Proc. 13th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 39-48, 2002. [18] Department of Computer Science, Duke University. TPIE: a transparent parallel I/O environment. WWW page, http://ww.cs.duke.edu/TPIE/, 2002. [19] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science, pages 285-297. IEEE Computer Society Press, 1999. [20] J. Gray. Sort benchmark home page. WWW page, http://research.microsoft.com/bare/ SortBenchmark/, 2003. [21] F. K. Hwang and S. Lin. A simple algorithm for merging two disjoint linearly ordered sets. SI AM Journal on Computing, l(l):31-39, 1972. [22] D. E. Knuth. The Art of Computer Programming, Vol 3, Sorting and Searching. Addison-Wesley, Reading, USA, 2 edition, 1998. [23] R. E. Ladner, R. Fortna, and B.-H. Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In Experimental Algorithmics, volume 2547 of LNCS, pages 78-92. Springer, 2002. [24] A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. Journal of Algorithms, 31:66-104, 1999. [25] H. Prokop. Cache-oblivious algorithms. Master's thesis, Massachusetts Institute of Technology, June 1999. [26] N. Rahman, R. Cole, and R. Raman. Optimised predecessor data structures for internal memory. In Proc. 5th Int. Workshop on Algorithm Engineering (WAE), volume 2141, pages 67-78. Springer, 2001. [27] P. Sanders. Fast priority queues for cached memory. ACM Journal of Experimental Algorithmics, 5(7), 2000. [28] R. Sedgewick. Algorithms in C++: Parts 1-4: Fundamentals, Data Structures, Sorting, Searching. Addison-Wesley, Reading, MA, USA, third edition, 1998. Code available at http://www.cs.princeton. edu/"rs/Algs3.cxxl-4/code.txt. [29] K. Vinther. Engineering cache-oblivious sorting algorithms. Master's thesis, Department of Computer Science, University of Aarhus, Denmark, May 2003. Available online at http://www.daimi.au.dk/ "kv/thesis/. [30] J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. A CM Computing Surveys, 33(2):209-271, June 2001. [31] J. W. J. Williams. Algorithm 232: Heapsort. Commun. ACM, 7:347-348, 1964. [32] L. Xiao, X. Zhang, and S. A. Kubricht. Improving memory performance of sorting algorithms. ACM Journal of Experimental Algorithmics, 5(3), 2000.
A
Plots Comparison of Quicksort Implementations
13
14
15
Results for Inputs on disk
16
17
The Robustness of the Sum-of-Squares Algorithm for Bin Packing Michael A. Bender*
Bryan Bradley*
Geetha Jagannathan*
Krishnan Pillaipakkamnatt§ Abstract Csirik et al. [CJK+99, CJK+00] introduced the sum-ofsquares algorithm (SS) for online bin packing of integralsized items into integral-sized bins. They showed that for discrete distributions, the expected waste of SS is sublinear if the expected waste of the offline optimal is sublinear. This algorithm SS has a time complexity of O(nB) to pack n items into bins of size B. In [CJK"I"02] the authors present variants of this algorithm that enjoy the same asymptotic expected waste as SS (with larger multiplicative constants), but with time complexities of O(nlogB) and O(n). In this paper we present three sets of results that demonstrate the robustness of the sum-of-squares approach. First, we show the results of experiments from two new variants of the SS algorithm. The first variant, which runs in time O(n\/BlogB), appears to have almost identical expected waste as the sum-of-squares algorithm on all the distributions mentioned in [CJK+99, CJK+00, CJK+02]. The other variant, which runs in O(nlogB) time performs well on most, but not on all of those distributions. Both these algorithms have simple implementations. We present results from experiments that extend the sum-of-squares algorithm to the bin-packing problem with two bin sizes (the variable-sized bin-packing problem). From our experiments comparing SS and Best Fit over uniform distributions, we observed that there are scenarios where when one bin size is 2/3 the size of the other, SS has 6(-y/n) waste while Best Fit has linear waste. We also present situations where SS has 6(1) waste while Best Fit has 6(n) waste. We observe an anomalous behavior in Best Fit that does not seem to affect SS. Finally, we apply SS to the related problem of online memory allocation. Our experimental comparisons between SS and Best Fit indicate that neither algorithm is consistently better than the other. If the amount of randomness is low, SS appears to have lower waste than Best Fit, while larger amounts of randomness appear to favor Best Fit. An interesting phenomenon shows that for a given range of allocation sizes we can find ranges of allocation duration where SS has lower waste than Best Fit. In the online memoryallocation problem for the uniform and interval distributions, SS does not seem to have an asymptotic advantage over Best Fit, in contrast with the bin-packing problem. * Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400, USA, email: benderflcs.sunysb.edu. Supported in part by Sandia National Laboratories and NSF grants EIA-0112849 and CCR-0208670. t Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA, email: pot8osCacm.org. * Department of Computer Science, State University of New York, Stony Brook, NY 11794-4400, USA, email: geethajaganflacm.org. 5Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA, email: csckzpfihofstra.edu.
18
1
Introduction
In classical bin packing the input is a list L = (01,02, --,a n ) of n items and an infinite supply of bins of unit capacity. Each item a* has size s(aj), where 0 < s(a,i) < 1. The objective is to pack the items into a minimum number of bins subject to the constraint that the sum of sizes of the items in each bin is no greater than 1. Bin packing has a wide range of applications including stock cutting, truck packing, commercials assignment to stations breaks in television programming, and memory allocation. Because this problem is NPhard [GJ79], most bin-packing research concentrates on finding polynomial-time approximation algorithms for bin packing. Bin packing is among the earliest problems for which the performance of approximation algorithms were analyzed [CGJ96]. In this paper we focus on the average-case performance of online algorithms for bin packing, where item sizes are drawn according to some discrete distribution. An algorithm is said to be online if the algorithm packs each item as soon as it "arrives" without any knowledge of the items not yet encountered. That is, the decision to pack item a* into some particular bin can be based only on the knowledge of items oi,...,Oi_i. Moreover, once an item has been assigned to a bin, it cannot be reassigned. Discrete distributions are those in which each item size is an element of some set {si,si,...,sj} of integers, and such that each size has an associated rational probability. The capacity of a bin is a fixed integer B > sj. We overload notation and write s(L) for the sum of the sizes of the items in the list L. For an algorithm A, we use A{L) to denote the number of bins used by A to pack the items in L. We write OPT to denote an optimal packing algorithm. Let F be a probability distribution over item sizes. Then Ln(F) denotes a list of n items drawn according to distribution F. A packing is a specific assignment of items to bins. The size of a packing P, written ||P||, is the number of bins used by P. For an algorithm A, we use P£(F) to denote a packing resulting from the application of algorithm A to the list Ln(F). Given a packing P of a list L, the waste for P, the sum of the unused bin capacities, is defined
as W(P) = B \\P\\ - s(L). The expected waste of an algorithm A on a distribution F is EW*(F) = E[W(P*(F))], where the expectation is taken over all lists of length n. 1.1
Related Work.
Variable-sized Bin Packing. In the variable-sized bin-packing problem all bins do not have the same capacity. Assume that there are infinitely many bins of sizes Let L denote a list of items whose sizes are in (0, BI]. Lately this problem has received more attention because of its application in stock cutting and in the assignments of commercials to variable-sized breaks in television programming. Much of the literature for variable-sized bin-packing algorithms measures effectiveness in terms of the performance ratio of the algorithm (instead of expected waste). The asymptotic expected performance ratio for algorithm A on distribution F is the average ratio of the number of bins taken by A to the number of bins taken by an optimal algorithm. That is,
Bin Packing. Recently progress has been made in the average-case analysis of the standard heuristics for the discrete uniform distributions U{j, k} where the bin capacity is taken as B = fc, and item sizes are uniformly drawn from 1, 2, ...,.;' < k. When j = k — 1, the online Best-Fit and First-Fit algorithms have BCV**) expected waste. Remarkably, when 1 < j < k — 2, the expected waste of the optimal is O(l) [ECG+91j. An algorithm is said to be stable under a distribution if the expected waste remains bounded even when the number of items goes to infinity. Coffman et al. [ECG+91] proved that Best Fit is stable when k > j(j + 3)/2. Kenyon et al. [KRS98] showed that Best Fit is stable under U{k — 2, k} and is also stable for some specific values Frieson and Langston [FL86] first investigated the of (j, k) with k < 14. It has been shown experimentally variable-sized bin-packing problem. Kinnerly and that for most pairs (j, k) the expected waste of Best Langston [KL88] gave an online algorithm with perFit is 0(n). formance ratio 7/4. The variable-harmonic algorithm [Csi89], which is based on the harmonic-fit algorithm, The Sum-of-Squares algorithm. We assume that has a performance ratio of at most 1.69103. all items have integral sizes. The gap of a bin is the Epstein, Seiden, and Stee [ESS01] presented two amount of unassigned capacity in the bin. Let N(g) unbounded-space online algorithms for variable-sized denote the number of bins in the current packing with bin packing. They focused on the cases in which there gap 1 < g < B. Initially, N(g) = 0 for all g. The are two bin sizes. These algorithms are a combination sum-of-squares algorithm puts an item a of size s(a) of variable-harmonic and refined-harmonic algorithms into a bin such that after placing the item a the value [LL81]. Epstein et al. also proved a lower bound for any online algorithm for two bin sizes. The asymptotic of is minimized. performance ratio is greater than 1.33561. Csirik et al. [CJK+99], gave experimental evidence that for discrete uniform distributions U{j, k} (with Memory Allocation. Memory is modeled as an ink = 100 and 1 < j < 98) EW$S is 0(1). They also finitely long array of storage locations. An allocator reshowed for j = 99, EW£S = O(Jn). Their results ceives requests for blocks of memory of various sizes and indicated that for j = 97,98,99, the expected waste requests for their deallocation arrive in some unknown of SS, EW§S goes from 0(1) to Q(Jn) whereas the order. Although the size of the request is known to the expected waste of BF, EW^F transitions from 0(n) to allocator, the deallocation time is unknown at the time 0(1) to e(v'n). of allocation. The deallocation of a block may leave In a theoretical analysis of the sum-of-squares a "hole" in the memory. The objective of a memoryalgorithm [CJK+00] Csirik et al. proved that for any allocation algorithm is to minimize the total amount of perfectly-packable distribution F, EW§S — O(^/n). space wasted in these holes. They also proved that if F is a bounded waste disAlthough memory allocation has been studied since tribution then EWgS is either 0(1) or 0(logra). In the early days of computing, only a handful of results particular, if F is a discrete uniform distribution concern the competitive ratios of the standard heuristics U{j, k} where j < k - 1 then EW£S(F) = O(l). They for the problem. For the memory-allocation problem, also proved that for all lists L, SS(L) < 3OPT(L). the competitive ratio of an online algorithm is the ratio between the total amount of memory required by the algorithm to satisfy all requests, to W, the largest
19
amount of concurrently allocated memory. Luby, Naor and Orda [LNO96] showed that First Fit (the heuristic which assigns a request to the lowest indexed hole that can accommodate the request) has a competitive ratio of 0{min(log W, log C)}, where C is the largest number of concurrent allocations. By Robson's result [Rob74], this bound is the best possible value for any online algorithm. 1.2
Results.
In this paper we study the performance of the sum-ofsquares algorithm and its variants. We present faster variants of the sum-of-squares algorithm for the online bin-packing problem. Next, we compare the sumof-squares algorithm with the Best-Fit algorithm for variable-sized bin packing. Finally, we show the results of applying the sum-of-squares algorithm to memory allocation. We performed our experiments with the uniform distribution U{j, k}. We have also run our algorithms on interval distributions, especially those that are claimed to be difficult in [CJK+02]. For the memory allocation problems we used various interval distributions for both block sizes and block durations. • In Section 2 we present our variants for the sumof-squares algorithm. The first is the SSmax variant, which runs in O(nVB\ogB) time. Our experiments suggest that the performance of this variant is close to the SS algorithm in absolute terms, for all the distributions mentioned in [CJK+99, CJK+00, CJK+02]. The remaining algorithms form a family called the segregated sum-of-squares algorithms (SSS). These algorithms perform well in most of the distributions mentioned in the above papers. But on some distributions they do not have the same expected waste as SS. The best runtime in this family is 0(nloglogJ3). • Section 3 provides experimental results for the sum-of-squares algorithm applied to a generalized version of the bin-packing problem where bins come in two sizes. For fixed bin sizes, the performance of the sum-of-squares and Best-Fit algorithms can be categorized into three ranges based on item sizes. In the first range both algorithms appear to have constant waste. In the second range, Best Fit has 0(n) waste and SS has 0(1) waste. In the third range, when one bin size is around 2/3 the size of the other, SS has Q(\/n) waste while Best Fit has linear waste. When the bin sizes are not in this ratio, both algorithms have Q(-^/n} waste. We observe an anomalous behavior in Best Fit that does not seem to affect SS.
20
• In Section 4 we applied SS to the related problem of online memory allocation. Our experimental comparisons between SS and Best Fit indicate that neither algorithm is consistently better than the other. Smaller allocation durations appear to favor Best Fit, while larger allocation durations favor SS. Also, if the amount of randomness is low, SS appears to have lower waste than Best Fit, while larger amounts of randomness appear to favor Best Fit. An interesting phenomenon shows that for a given range of allocation sizes we can find ranges of allocation duration where SS has lower waste than Best Fit. In the online memory-allocation problem SS does not seem to have an asymptotic advantage over Best Fit, in contrast to the bin-packing problem. 2
Faster Variants Algorithm
of
the
Sum-of-Squares
In this section we present variants of the sum-of-squares algorithm. The SSmax variant of Section 2.2.1 runs in 0(n\/JBlogjB) time and appears to have an expected waste remarkably close to that of SS. Experimental results indicate that the segregated sum-of-squares family of algorithms (Section 2.2.2) run faster, but, in some cases, have 0(n) expected waste when SS has 0(x/n) waste. 2.1
Sum-of-Squares Algorithm.
The sum of the sizes of items in a bin is the level of the bin. The gap of a bin is the amount of its unused capacity. If the level of a bin is i, then its gap is B — t. Let P be a packing of a list L of items. Let the gap count of g, N(g}, denote the number of bins in the packing that have a gap of 1 < g < B. We call N the profile vector for the packing. We ignore perfectly packed bins (whose gap is 0) and completely empty bins. The sumB-l
of-squares for a packing P is ss(P) = ]T N(g)2. 9=1
The sum-of-squares algorithm is an online algorithm that works as follows. Let a be the next item to pack. It is packed into a legal bin (whose gap is at least s(a)) such that for the resulting packing P' the value of ss(P') is minimum over all possible packings of a. When an item of size s arrives, there are three possible ways of packing the item 1. Open a new bin: Here the sum of squares increases by l + 2N(B-s). 2. Perfectly fill an old bin: Here the sum of squares decreases by 2N(s) — 1.
3. The item goes into a bin of gap g where g > s: Instead of examining all possible legal gaps for the best Here the sum of squares increases by 2 - (N(g — s) — possible gap (as SS does), we examine only the k largest N(g) + 1). This step requires finding a g which gap counts, for some value of k. maximizes the value of N(g) — N(g — s). 2.2.1 Parameterized SSmax Algorithm. The Each tune an item arrives, the algorithm performs SSmax(fc) parameterized algorithm is based on the an exhaustive search to evaluate the change in ss(P) SooS variant mentioned above. The right choice for and finds an appropriate bin in O(B) time. In [CJK+02] k is discussed later in this section. To pack a given the authors discuss variants of the original SS algorithm. item into a bin, SSmax(A:) computes the value of ss(P) They present O(nlog5) and O(n) variants that apfor the k largest gap counts (for some k > 1). The proximate the calculation of the sum-of-squares. The algorithm also computes ss(P) for opening a new bin authors prove that these variants have the same asympand for perfectly packing the new item into an already totic growth rate for expected waste as SS, but with open bin. The given item is packed into a bin such larger multiplicative constants. For example, they conthat the resulting packing minimizes ss(P). (Note that sidered the distribution C/{400,1000}. For n = 100,000 when we set k — B, we get the original SS algorithm.) items, they state that the number of bins used by the Using a priority queue, we can find the k largest values variant is 400% more than optimal, whereas Best Fit in O(fclogB). This can be done in O(fcloglogB) time uses only 0.3% more bins, and SS uses 0.25% more bins using faster priority queues such as Van Emde Boas than necessary. The situation does improve in favor of trees. their variant for larger values of n. For n — 107, the variant uses roughly 9.8% more bins than optimal, but Experimental Results. For small values of k (such SS uses .0.0025% more bins than necessary. The auas for constant k, or for k = log-B), the experiments thors claim their fast variants of SS are "primarily of suggest that the results for SSmax(fc) are qualitatively theoretical interest", and that they are unlikely to be similar to those of SooS. However, for k near 1\fB the competitive with Best Fit, except for large values of n. average waste for SSmax(fc) undergoes a substantial They also looked at other generalizations for the change. For each of the distributions mentioned in sum-of-squares algorithm. Instead of considering the [CJK+99, CJK+00, CJK+02], we observed that the squares of the elements of the profile vector, they experformance of SSmax(2v^B) matched that of SS to ih amined the r power, for r > 1. These generalizations within a factor of 1.4. In many cases the factor is as yield various SrS algorithms. For all finite r > 1, SrS low as 1.02. (See Table 1 and Figure 1 for results from performed qualitatively the same as SS. For the limit some of the distributions.) Note that in this and other case of r = 1 (SIS), the resulting algorithm is one that tables when it becomes clear that the growth rate of an satisfies the any-fit constraint. That is, the algorithm algorithm is linear in n, we did not run experiments does not open a new bin unless none of the already open for n = 108 and n = 109. In all other cases, we bins can accommodate the new item to be packed. Best used 100 samples for n € {105,106}, 25 samples for Fit and First Fit are examples of algorithms that satisfy n € {107,108}, and 3 samples for n = 109. the any-fit constraint. The SooS algorithm chooses to We also compared the waste of this variant with minimize the largest value in the profile vector. This SS for the distributions U{j, 100}, t/{j,200} and variant has the optimal waste for the uniform distribuU{j, 1000}. As our results in Table 2, and Figure 2 tions U{j, k}. However, the authors provide examples show, the performance of the variant SSmax(2\/B) is of distributions for which SooS has linear waste while remarkably similar to that of SS. the optimal and SS have sublinear waste. 2.2 Our Variants. The variants try to approximate the behavior of the sum-of-squares algorithm. By attempting to minimize the value of ss(P'), SS effectively tries to ensure that it has an approximately equal number of bins of each gap , where 1 < g < B. Intuitively, the value of ss(P') can usually be reduced by lowering the largest gap count. However, minimizing the largest value in the profile vector alone is insufficient, as the SooS experiments indicate [CJK+02]. Nevertheless, this insight plays a key role in our variants for the sum-of-squares algorithm.
2.2.2 Segregated Sum of Squares Variants. All Segregated Sum-of-Squares algorithms (SSS) divide the profile vector N into some number t of approximately equal-sized contiguous sections of the vector, numbered 0 ...(< — 1). The choice of t, and other associated data structures, differentiate the various members of the SSS family. An auxiliary vector L[0 ...t — 1] keeps track of the gap g with largest gap count in each of the t sections. The change in ss(P) is evaluated for some of the legal g values in L, plus O(i) additional values. Thus, the best gap size can be selected in Q(t) steps.
21
Item sizes: 11-13,15-18, Bin size: 51 SS SSmax(2v^B) SSmax(l) SSS(sqrt) Item sizes: 18-27, Bin size: 100 SS SSmax(2N/B) SSmax(l) SSS(sqrt) Item sizes: 1,11-13,15-18, Bin size: 51 SS SSmax(2%/B) SSmax(l) SSS(sqrt)
105 9538 9559 316247 19259 105 8639 12333 939548 146468 105 585 525 26422 1130
106 31272 31314 3161114 144570 106 25454 36082 9409415 1447068 10" 595 537 2639009 1236
107 97650 97665 3.159X107 1287001 107 76259 108789 9.399X107 1.444xl07 107 635 594 2.64 xlO7 1278
108 300121 300172
109 516184 516388
12281857 108 229869 327609
1.237xl08 109 622198 943798
1.443xl08 108 414 465
109 628 577
1026
1648
Table 1: Comparison of waste for SS, SSmax(2-
Figure 1: Comparison of waste for SS, SSmax(2v/5), SSmax(l), and SSS(sqrt). (a) Item sizes uniformly drawn in the range 11-13,15-18 with bin size = 51 (block 1 of Table 1). (b) Item sizes uniformly drawn in the range 18-27 with bin size = 100 (block 2 of Table 1). In both figures, the curves represent plots for the number of items packed vs waste/sqrt(number of items). In both (a) and in (b), SS and SSmax(2\/B) are horizontal lines and are indistinguishable, and hence they are 0(\/n) waste. Other algorithms in these graphs have linear waste.
22
j=12 105 10b 107 108
ss
40 47 52 45
U{j,100> SSmax(2\/B) 42 57 52 45
j=75
SS 900 916 680 923
SSmax(2\/B) 974 982 776 1014
j=99
U{j,200>
SS 32302 108251 308919 876747
SSmax(2\/£) 32287 108176 308223 876354
j=24 105 106 107 10H
SS 117 91 L 63 78
SSmax(2\/B) 117 91 63 78
j=150
SS 3094 2691 3920 3639
SSmax(2
j=198
SS 82875 193144 541186 1908245
SSmax(2v/B) 83680 193332 538786 1908852
j=120 105 10b 10' 108
SS 253 876 580 672
SSmax(2-i/B) 253 876 580 672
j=400
SS 4432 5139 5020 4882
SSmax(2VrB) 8062 8328 8031 8081
j=990
SS 977566 2552676 5201380
SSmax(2-/B) 959566 2498732 5086175
U{j,1000>
Table 2: Comparison of waste for SS and SSmax(2\/B) for various uniform distributions U{j, k}. Here the item sizes are drawn at random between \...j and k denotes the bin size. The number of items n 6 {105,106,107,108}. The waste of SS and SSmax(2v/B) are of the same order of magnitude. For .;" = 99, both SS and SSmax(2\/B) have 0(v/ra) waste. In all the other cases both algorithms have constant waste.
Algorithms. We obtain various members of the SSS family as follows: • SSS(sqrt). We set the number of intervals t = |YB"|. Each section is of size \(B - l)/VB]. The algorithm runs as follows: Consider a new item of size s to be packed. We first examine gap values from section TO = fs/\/B"| — 1. This is the section that contains gaps with the tightest fit for the new item. We evaluate the change in ss(P) for all legal gap values of g in section m (there are Q(\fB] elements in each section). For each section m' > m, we evaluate the change in ss(P) at the largest gap count, L(m'), for section TO'. There are Q(VB) of these sections. The gap that minimizes the value of ss(P) is chosen for packing the new item. The process of finding the best gap takes 0(\/B) time. Once a gap has been identified for the new item, we update L values for at most two sections of the profile vector (the section that contains the chosen gap g, and the section that contains the newly made gap g — s). This update can be done by an exhaustive scan of each section in 0(\/2?) time. Thus, the overall time complexity to pack one item is Q(VB}. A simplification of this algorithm is to examine the change in ss(P) at L(m) alone, instead of computing the change in ss(P) for all the gaps in section TO. This simplified algorithm appears to
have linear waste when SS has a sublinear waste. Hence both components of the algorithm seem to be necessary in order to obtain a variant that gives average waste similar to that of SS. SSS(log). In this case, we set the number of sections t = [log B]. For each section we maintain a max heap on the gap counts of that section. Let s be the size of the new item to be packed and TO = fslogB/B] — 1. We evaluate the change in ss(P) for the following gaps g: 1. The gap with the largest count in section TO, L(m), and the log B gaps that follows L(m). That is, for gaps L(m), L (TO) + 1, ..., L(m) 4-logB. 2. For each section TO' > TO, the gap with the largest count in TO', L(m'}. Thus, we compute the change in the value of ss(P) for only O(logB) values. Note that since there is a heap for each section TO, the gap with the largest count in a section, L(m), can be found simply by examining the root of the heap. Once the gap g with the least increase in ss(P) has been identified, at most two heaps need to be updated. The heap that includes the gap g — s requires an increase-key operation, and the heap that includes g requires a decrease-key on the heap. Each operation can be performed in 0(logjB) time in the worst case.
23
(a)
Figure 2: Comparison of waste for SS and SSmax(2V'B). In both figures, the curves represent plots for the number of items packed vs waste, (a) Item sizes are uniformly drawn in the range 1-150 with bin size = 200 (block 2 of Table 2). Both SS and SSmax(2-«/B) appear to have constant waste, (b) Item sizes are uniformly drawn in the range 1-990 with bin size = 1000 (block 3 of Table 2). The waste of SS and SSmax(2
• SSS(loglog): We set t = [log log B]. For each section we maintain a maxheap of at most logB non-zero gap counts of the section. Since not all values of a section are in the heap, the top of the heap may not contain the largest gap count for the section. When a gap count in a section changes we add it to the heap and drop the last value in the heap (if the number of items in the heap exceeds logB). Let s be the size of the new item to pack and m - f slo ffi g "| - 1. If H(m) represents the largest value in the heap for section m, we evaluate the change in ss(P) at the following gap values:
The experiments show all versions of the SSS algorithms have the constant-waste property for j < 98, and appear to have qualitatively the same waste as SS for j = 99. Except for the SSS(loglog), the rest of the family appear to be within 0.80% of SS in terms of waste. We also observe that the waste for SSS(log log) seems to match SS to within a constant factor (a factor no more than 7). The SSS variants, however, do not fare as well on some of the interval distributions. Table 1 suggests for some of the distributions, SS appears to have sublinear waste, while the SSS(v/r6) algorithm appears to exhibit linear waste. Our experiments thus indicate that the sum-of1. The gap with the largest count in the heap for section m, and the log log B gap values that squares minimization is sufficiently robust that it is not follow (that is, for gaps -fiT(m), H(m) +1, ..., necessary to compute it precisely to receive its benefits. The variants we have presented are uncomplicated #(m)+loglog.B). algorithms that are easily implemented. 2. For each section m' > m, we evaluate the change at gap H(m'}. 3 Variable-sized Bin Packing
Experimental Results. We ran our experiments on the U{j,k} distributions. For B - k = 100, for each value of j from 1 to 99, and for each of n € {105,106,107,108,109}, we computed the average waste of SS, BF, SSS(sqrt), SSS(log), and SSS(loglog). We averaged the waste over 100 runs for n 6 {105,106}, over 25 runs for n 6 (107,108}, and over 3 runs for n = 109. We know from [CCG+00] that EW£PT = 0(1) for j < k - 2 = 98, and EW$PT = 0(v/n) when j = k — 1 = 99. We have summarized the results in Table 3, showing the waste for j at 12, 25, 75, and 99.
24
We now present results from experiments that naturally extend the sum-of-squares algorithm to accommodate variable-sized bins. We compare these results with the results from a straightforward extension of the Best-Fit algorithm. Algorithm. Let us assume that there are infinitely many bins of integral sizes B = B\ > B^ > . - . > Bfc, where k > 2, and that items have integral sizes. Let N(g) denote the number of bins in the current packing with gap 1 < g < B. When an item arrives, it goes
j=12 SS SSS(sqrt) SSS(log) SSS(log log) BF j=25 SS SSS(sqrt) SSS(log) SSS(log log) BF j=75 SS SSS(sqrt) SSS(log) SSS(log log) BF j=99 SS SSS(sqrt) SSS(log) SSS(log log) BF
105 46 46 46 302 46 105 60 61 61 341 177 105 920 934 1003 1204 43448 105 33996 33772 33651 33228 26330
106 48 49 48 404 49 106 62 63 63 335 834 106 902 906 953 1174 425606 106 107707 107207 106743 105379 88169
107 53 53 53 357 53 107 58 58 58 350 7322 107 917 927 973 1213 4243029 107 354053 354003 352245 343777 302489
108 50 50 50 350 50 108 53 53 53 313 71893 108 873 902 933 1133 42460163 108 1089447 1079132 1076027 1058548 950047
109 18 18 18 402 18 109 68 68 68 342 728368 109 1041 1025 1008 1324 424241908 109 2848891 2484005 2832558 2954212 27843191
Table 3: Comparison of SSS algorithms with SS and BF for various uniform distributions, where items sizes are drawn in the ranges 1...J for j = 12,25,75,99 and bin size = 100, and the number of items n e {105,106,107,108,109}. All variants of SS have waste of the same order of magnitude as SS. BF has linear waste when j € {25,75}, while SS has sublinear waste for those values of j. B-l
into a legal bin such that the value of ]T) N(g)2 is 9=1 minimized. When a new bin has to be opened, the algorithm chooses a bin of size B* such that the value B-l
]C ^(0)2 *s minimized. 9=1 Experimental Results. All of our experiments are conducted under the uniform discrete distributions. We consider the case of two bin sizes. With one bin size fixed at B\ — 100 we varied the other bin size #2, 2 < B-2 < 100. We generated item sizes uniformly at random between 1 and j, for each value of j between 2 and 99. We compared the modified SS algorithm against a straightforward extension of the Best-Fit algorithm for multiple bin sizes. Our observations show that for fixed bin sizes the performance of these algorithms can be categorized into three ranges based on item sizes. In the first range, both algorithms appear to have constant waste. In the second range, Best Fit has 0(n) waste and SS has O(l) waste. In the third range, when one bin size is around 2/3 the size of the other SS has Q(\/n) waste while Best Fit has 0(n) waste. When the bin sizes are not in the ratio mentioned above, both algorithms have ®(\/n) waste
(see Table 4 and Figure 3). The behavior of SS in these ranges is independent of the bin sizes. In the case of Best Fit the above mentioned behavior is seen as long as the largest item size is greater than the smaller bin size. When the size of the largest item generated, j, is no greater than the smaller bin, B% (that is, if j < B^), then Best Fit never opens a bin of the larger size B\. In those cases, the behavior of Best Fit is as in classical bin packing with only one bin size. We also observed an interesting phenomenon in the performance of Best Fit for the two bin case. For values of the largest item size, j, greater than 75, the ratio of the average waste of Best Fit to average waste of SS remains constant (around 150) as the size of the second bin goes from 2 to near 60. When J32 is about 2/3 the size of the first bin, the ratio undergoes a 15-fold increase (to around 2500), and then subsequently drops (Figure 4). That is, there are counter-intuitive situations where increasing the size of the second bin actually causes Best Fit's waste to dramatically increase. (Although a similar phenomenon can also be observed for smaller values of j, the change in waste is not as dramatic as for j > 75.)
25
Bl=100 B2=20
B2=40
B2=60
B2=90
n 105 10ti 10V 108 105 10« 107 108 105 10" 107 108 105 10& 107 108
1-10
ss
50 53 53 26 51 64 49 32 55 50 48 57 48 45 41 50
BF 21 21 29 26 20 19 21 12 31 28 30 30 44 43 50 13
1-40 SS 105 104 96 114 121 119 125 132 112 131 115 112 124 120 97 106
BF 5714 54750 543740 5433252 9151 27823 86083 239912 12918 123699 1235771 12356346 5152 49566 497937 4960739
1-70 SS 632 601 635 627 612 592 606 549 616 640 695 467 514 572 493 660
BF 32687 317288 3147865 31461781 41208 404233 4021002 40243002 68911 654135 6462667 64342607 40580 397687 3979766 39737014
1-99 SS 34732 110091 319029 1037493 32991 106553 340963 1005371 32738 109844 354860 1092447 26442 90732 293882 895774
BF 26300 88247 285693 952666 26563 87191 289709 835138 228309 2181956 21413094 212776914 65691 522376 4794953 46178777
Table 4: Comparison of the waste for SS and BF for various uniform distributions with one bin size B\ = 100 and the other bin size B2 = 20,40,60,90. The item sizes are from 1-10, 1-40, 1-70, 1-99. The number of items to be packed n € {105,106,107,108}.
Figure 3: Comparison of SS and BF for item sizes 1...99, Bl = 100, £2 6 {20,40,60,90}. The y-axis has a log scale. This a plot of the number of items packed vs. the waste/x/n. All horizontal lines represent Q(<\/n) waste. The SS algorithm has Q(\/n) waste for all values of B-z listed above. The Best-Fit algorithm has linear waste for B2 e {60,90} and &(Vn) waste for B2 e {20,40}.
26
Figure 4: Best Fit's anomalous behavior for two bin sizes (item sizes 1...85, BI = 100). The number of items packed = 106. This graph indicates that the SS is able to adapt as the second bin size changes, but Best Fit is unable to adapt as the second bin size changes.
4 Memory Allocation Researchers have long observed the similarity between bin packing and memory allocation [Dyc90, CL77, CL86]. A number of heuristics that apply to bin packing translate directly to memory allocation, and vice-versa. Since the sum-of-squares algorithm is effective for bin packing [CJK+02], we also study the algorithm in the context of memory allocation.
the newly deallocated block is contiguous with a previously deallocated block (or blocks) in L, these blocks are merged to form one larger free block. Free blocks at the end of memory are discarded and the memory "shrinks" to end at the last allocated block. In all cases, the profile vector P is updated appropriately.
4.3 Experimental Results. There are no well defined "bin sizes" for memory allocation, in contrast with bin packing, although holes 4.1 Bin Packing and Memory Allocation. Dynamic bin packing [CGJ83] is a variant of the clas- in memory correspond to gaps in bins. The parameters sical bin-packing problem where items are periodically for our experiments in memory allocation are deleted. At any step, an algorithm for this problem is • the range of block sizes, and either given a new item to pack, or is required to delete • the range of block durations. a previously packed item. As always, the objective in this problem is to minimize the total number of bins For consistency with the rest of the paper, we have used for the entire list of items. restricted ourselves to uniform and interval distribuClosely related to the dynamic bin-packing problem tions for both parameters mentioned above. In our is the memory-allocation problem in paged-memory experiments the block size B represents the interval operating systems. In this online problem, memory [B, B + 6B}- The value of B ranges from 1 to 1000 is represented as an infinitely long array of storage and 63 varies from 1 to 1000. Similarly, a duration of t locations. Requests for blocks of memory of various represents the interval [t, t + 5t]- The values of t ranges sizes, and requests for their deallocation arrive overtime. from 1 to 20,000 and 5t varies from 1 to 1000. Although the size of the request is known to the After each request is assigned a block of memory, we allocator, the deallocation time is unknown at the time measure the waste of the algorithm (as the sum of the of allocation. When an allocation request arrives, it blocks on the free list). The waste of the algorithm for a must be satisfied by assigning a contiguous set of free run is the maximum waste taken over all the allocation memory locations. Once an allocation has been made, requests. the block cannot be moved until it is deallocated. The experiments indicate that neither algorithm is The deallocation of a block may leave a "hole" in consistently of lower waste than the other for these the memory. The objective of a memory allocation- distributions. Even when an algorithm has a lower algorithm is to minimize the total amount of space waste than the other, both algorithms seem to have wasted in these holes. waste of the same order of magnitude. Figure 5(a) is a typical example of the scenario where BF has 4.2 Sum-of-Squares Algorithm for Memory Al- lower waste than SS. Our experiments suggest that location. BF appears to have an advantage over SS when the The algorithm maintains a list L of free blocks of mem- duration of the requests are small. Figure 5(b) is a ory. It also maintains a profile vector P which lists the typical example of the scenario where SS has lower number of free blocks of each size. (The block sizes waste than BF. Figure 6(a) shows the difference in waste that have a zero count are not explicitly stored in the between the two algorithms as the request size increases. vector.) When a new request for allocation arrives, the The difference between the waste of BF and that of SS algorithm computes the sum of the squares based on the also increases as the longevity of the allocation requests profile vector, for each legal block size. We say a block increases (Figure 6(b)). We have also observed that for size is legal if the requested allocation fits in the block. every possible range of request sizes there exists a lower The algorithm selects a block that has the minimum bound on the time duration from which point onwards sum-of-squares value. When a free block is selected the the sum-of-squares algorithm has lower waste than Best requested memory is allocated from the beginning of the Fit (Figure 7). block. The selected block is deleted from the list L and Our experiments suggest that in the online memorya new block that represents the left-over portion of the allocation problem SS does not seem to have an asympblock is inserted into L. If a block does not fit into any totic advantage over Best Fit for the uniform and interof the free blocks, then we "enlarge" memory to accom- val distributions, in contrast with the bin-packing probmodate the new request. When a deallocation request lem. In bin packing SS performs well since it tries to arrives, we obtain a free block and it is added into L. If maintain an equal number of bins for each gap size. So
27
Figure 5: Comparison of the waste of Best Fit against Sum of Squares for memory allocation. These are plots of the number of allocation requests vs. waste, (a) Duration range is 500-599, and size range is 100-199. The size range and the duration range are fixed. Best Fit has lower waste, but the waste appears to be of the same order as for Sum of Squares, (b) Duration range is 3000-3099, size range is 150-249. The size range and the duration range are fixed. Best Fit has higher waste, but the waste appears to be of the same order as for Sum of Squares.
Figure 6: Comparison of waste of Best Fit and Sum of Squares for memory allocation, (a) The is a plot of request size vs. waste. Duration range is 500-599. The number of allocation requests is 5000. The duration range and the number of allocation requests are fixed. The difference in waste increases as the block size increases, (b) This is a plot of request duration vs. waste. Block size range is 101-200. The number of allocation requests is 20,000. The difference between the waste of BF and SS increases as item duration increases.
28
when an item arrives, the allocator finds a suitable gap size and is able to pack a bin perfectly. A possible explanation for the behavior of the algorithms indicated by our experiments is as follows. In our experiments both the block size and the duration are uncorrelated (they are independent random variables). So when the system reaches a steady state where the rate of allocation and deallocation are the same, most of the time the Best-Fit allocator is able to find a hole of an appropriate size for any request. Deallocation provides Best Fit the same advantage that helps SS perform well in bin packing. It is too early to predict what happens when durations follow a more realistic distribution in which durations are correlated with size, such as the heavy tail distribution. References [CCG+00] E. G. Coffman, C. Courcoubetis, M. R. Garey, D. S. Johnson, P. W. Shor R. R. Weber, and M. Yannakakis. Bin packing with discrete item sizes, part I: Perfect packing theorems and average case behavior of optimal packings. SIAM Journal on Discrete Math., 13:384-402, 2000. [CGJ83] E. G. Coffman, M. R. Garey, and D. S. Johnson. Dynamic bin packing. SIAM Journal of Computing, 12:227-258, 1983. [CGJ96] E. G. Coffman, M. R. Garey, and D. S. Johnson. Approximation Algorithm for NP-Hard Problems, chapter Approximation Algorithms for Bin Packing: A Survey, pages 46-93. 1996. [CJK+99] J. Csirik, D. S. Johnson, C. Kenyon, P. W. Shor, and R. R. Weber. A self organizing bin packing heuristic. In Proceedings of ALENEX workshop, pages 246-265, 1999. [CJK+00] J. Csirik, D. S. Johnson, C. Kenyon, J. Orlin, P W ; ; fhor' "? R' R- Weber. On the sum-of-squares
Figure 7: Comparison of SS and BF over a range of parameters of request sizes and durations. Here ^"'f^L ? P3Ckmg * *%***** ** f "* r _ _ ,_, „ . A . . Annual ACM Symp. on' the Theory of °* Computing, C ori , 5B = St = 99. For all points above the curve SS has nnitaa , . pages lower waste than BF. For all points below the curve, [CJK+02] J. Csirik, D. S. Johnson, C. Kenyon, J. Orlin, BF has lower waste than SS. R w. Shorj ^ R R. Weber. On the sum-of-squares algorithm for bin packing. Unpublished Manuscript, 2002. [CL77] E. G. Coffman, Jr. and J. Y. Leung. Combinatorial analysis of an efficient algorithm for processor and storage allocation. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science, pages 214-221, 1977. [CL86] E. G. Coffman, Jr. and F. T. Leighton. A provably efficient algorithm for dynamic storage allocation. In Proceedings of the eighteenth annual ACM symposium on Theory of computing, pages 77-90, 1986. [Csi89] J. Csirik. An on-line algorithm for variable-sized bin packing. Ada Informatica, 26:697-709, 1989.
29
[Dyc90] H. Dyckhoff. A typology of cutting and packing problemsacking. European Journal of Operational Research, 44:145-159, 1990. [ECG+91] E.G.Coffman, C. Courcoubetis, M. R. Garey, D. S. Johnson, L. A. McGeogh, P. W. Shor R. R. Weber, and M. Yannakakis. Fundamental discrepancies between average-case analyses under discrete and continuous distributions. In Proceedings of the 23rd Annual Symposium on Theory of Computing, pages 230240, 1991. [ESS01] L. Epstein, S. Seiden, and R. V. Stee. On the fractal beauty of bin packing. Technical Report SEN-R0104, Centrum voor Wiskunde en Informatica, 2001. [FL86] D. K. Friesen and M. A. Langston. On variable sized bin packing. SIAM J. Computing, 15:222-230, 1986. [GJ79] M. R. Garey and D. S. Johnson. Computers and Intractability A Guide to the Theory of NPCompleteness. 1979. [KL88] N. G. Kinnersley and M. A. Langston. Online variable sized bin packing. Discrete Appl. Math., 22:143-148, 1988. [KRS98] C. Kenyon, Y. Rabani, and A. Sinclair. Biases, random walks, Lyapunov functions and stochastic analysis of best fit bin packing. J. Algorithms, 27:218-235, 1998. [LL81] C. C. Lee and D. T. Lee. A simple online bin packing algorithm. J. Assoc. Comput. Mach., 32:562572, 1981. [LNO96] M. G. Luby, J. Naor, and A. Orda. Tight bounds for dynamic storage allocation. SIAM Journal on Discrete Math., 9:155-166, 1996. [Rob74] J. M. Robson. Bounds for some function concerning dynamic storage allocation. Journal of the ACM., 12:491-499, 1974.
30
Practical Aspects of Compressed Suffix Arrays and FM-index in Searching DNA Sequences Wing-Kai Hon* Tak-Wah Lam* Wing-Kin Sung* Wai-Leuk Tse* Chi-Kwong Wong* Siu-Ming Yiu* Abstract Searching patterns in the DNA sequence is an important step in biological research. To speed up the search process, one can index the DNA sequence. However, classical indexing data structures like suffix trees and suffix arrays are not feasible for indexing DNA sequences due to main memory requirement, as DNA sequences can be very long. In this paper, we evaluate the performance of two compressed data structures, Compressed Suffix Array (CSA) and FM-index, in the context of searching and indexing DNA sequences. Our results show that CSA is better than FM-index for searching long patterns. We also investigate other practical aspects of the data structures such as the memory requirement for building the indexes.
1
Introduction
With the availability of different DNA sequences (in particular, the human genome), many biological research activities involve searching DNA sequences for various patterns (say, genes). To speed up the search process, one would naturally try to index a DNA using classical indexing data structures like suffix trees [13] and suffix arrays [12], which require 6(n) words, or O(nlogn) bits, of storage and can locate a pattern P efficiently in 0(\P\ 4- occ) and 0(|P| logn 4- occ) time, where n and occ denote the length of the DNA sequence and the number of occurrence of P, respectively. "Department of Computer Science and Information Systems, The University of Hong Kong, Hong Kong, {ckwongS,twlam,wkhon,witse,smyiu}9csis.hku.hk * School of Computing, National University of Singapore, Singapore, ksungOcomp. mis.edu.sg
However, such efficient schemes are not feasible due to the main memory requirement. For example, the human genome contains 2.88 Giga characters (bases) and even a single chromosome contains several hundred Mega characters. The best known implementation of suffix trees and suffix arrays for DNA sequence, however, requires 17.25n bytes and 4.25n bytes in practice [9,12]. These figures amount to several tens of Gigabytes of main memory, which far exceed the capacity of a PC nowadays.1 In recent years, exciting results have been obtained in compressing the suffix arrays. Compressed Suffix Arrays (CSA) [6] and FM-index [5] are such examples. For indexing DNA sequence, the bound for the basic structure of CSA consists of only 5n bits, while for FM-index, it occupies 3n bits in practice [4]. These figures immediately imply that we can store the CSA and FM-index of DNA of length up to a few Gigabases, which covers almost every known DNA sequences nowadays. In contrast, the suffix tree and suffix arrays can only handle DNA of length up to 180 Mbases and 900 Mbases, respectively. Despite the compactness in size, the searching performance of CSA and FM-index is asymptotically of the same order as (or even better than) suffix arrays. Given this promising theoretical bound, it is natural to ask how fast CSA and FM-index can search in practice? Moreover, which one is better? For FM-index, previous experiments [4] have shown that its performance is comparable to suffix arrays, when searching a pattern whose length is lr
The maximum size of main memory for a PC is 4 Gbytes.
31
short (8-15 bases). However, for searching long patterns (say, a few hundred or even a few thousand bases), it is not known whether the result will be consistent. Another issue with respect to searching is that, in the literature, there are two types of searching methodology for CSA or FM-index. One of them is called forward search, which is the classical approach for suffix arrays. The other one is called backward search, which is the method tailored for CSA or FM-index and has been shown to be optimal in the worst-case. It is interesting to find out in practice, whether the optimal backward search always beat the forward search. Finally, if CSA and FM-index are to be workable in practice, we have to consider yet another important issue: the memory requirement for their construction. This requirement can be more severe than for storage requirement, as the original construction methods are to first build the suffix arrays and then perform compression. For instance, the suffix array for the human genome (2.88 Gbases) already occupies at least 12 Gbytes, which by itself is impossible to be built on an ordinary PC, not to mention performing the compression afterwards. Nevertheless, we can overcome the memory problem by constructing the CSA and FM-index in an alternative way. Previous results show [7, 10] that CSA can be directly constructed from the DNA sequence in 0(n) bits space, while FMindex can be converted from the CSA in negligible space. In this case, we would like to determine the maximum length of a genome whose CSA and FMindex can be constructed, and the time needed to construct such index. This paper evaluates the performance of the CSA and FM-index when they are used to index DNA sequences, and attempts to answer all the above questions. For the searching performance, we have constructed the CSA and the FM-index for E.coli (4.6 Mbases), Fly (98 Mbases) and Human (2.88 Gbases). In each setting, we tested the searching times using patterns of length from 10 to 10,000. Patterns are extracted from random positions in
32
the corresponding DNA sequence to boost the worst-case performance. From our experiments, we find that FM-index is faster than CSA for searching short patterns, while for long patterns, CSA is better. For the comparison between forward search and backward search, we observe that using backward search, FM-index is consistently faster than CSA. However, using forward search, CSA is faster than FM-index. The most surprising result is that, for long patterns, forward search is more efficient than backward search. Roughly speaking, for patterns of length less than 2000, FM-index with backward search is most efficient; otherwise, CSA with forward search is fastest. See Figure 1 for the timing of the experiments on Fly. For the construction limits, we have implemented programs that can successfully construct the CSA and FM-index for DNA sequences of length up to 3 Gbases. The construction times needed are 24 and 28 hours for CSA and FM-index, respectively. Remark: For the storage space, we observe that the basic structure for CSA in practice occupies 4n bits of space. The paper is organized as follows. Section 2 gives some background on CSA and FM-index. Section 3 investigates the searching performance of CSA and FM-index, while Section 4 comments the construction requirements. Concluding remarks are shown in Section 5. 2 Background In this section, we give a brief introduction on CSA and FM-index. The first part describes their basic structures, while the second part discusses the two searching methodologies associated with them. Finally, we highlight our implementation details in the third part. 2.1 Basic Structure Given a text T[l..n], the suffix array 5.A[l..n] is a permutation of integers of {0,1,... ,n — 1} that denotes the lexicographical relation of the suffixes of T. Precisely, SA[i] = j if and only if T[j..n] is the i-th smallest suffix
Figure 1: Searching performance of CSA and FM-index of Fly. among all the other suffixes. The basic structure for both CSA and FM-index is a sequence of n integers obtained from a transformation of SA. Precisely, CSA stores the function #, where *[i] = SA~l[SA[i] 4- 1]; for FM-index, it stores the function $, where $[t] = SA-^SAfi] - 1]. These integers can be stored in 32n bits in a brute force manner, but with suitable transformation, we can show that most of the integers in the transformed sequence are small, and space reduction is achieved when we encode each integer using prefix-free code. The size of the basic structure for CSA is bounded by 5n bits, while in practice it occupies only 4n bits. For FM-index, the worst-case bound is lOn bits, while in practice it is close to 3n bits. However, the use of prefix-free code has a disadvantage: to decode any region of the sequence, we must start the decoding process from the beginning of the encoded sequence. In order to speed up the decoding process, which is important to the performance of the compressed indexes, the original papers suggested to store auxiliary information in addition to the basic structure. These auxiliary information are markers that are stored at regular locations within the encodings, so that decoding can be started from the closest marker instead of from the very beginning. As the performance of the compressed indexes is poor when the auxiliary information is stringent, because of this, the rest of the paper will consider the auxiliary information to occupy a space of 0(n) bits.
2.2 Searching Methodology Let Ti denote the the suffix T[i..n}. Given the text T and the corresponding suffix array SA, we can search for the occurrence of a pattern P in the following way [12]: (1) For i — 1,2,... ,m, find the range s and t such that TSA[S-I] < P[l.-«] < TSA(S] and TSA[t] < P[l-i] < TSA[t]; (2) If s < t, report SA[s],... , SA[t] as the occurrences of P in T. Otherwise, report no occurrences found. In the above algorithm, we recursively compute the range of the suffix array that corresponds to the occurrences of F[l..«], for i = 1,2,... , m. We refer this searching methodology as the forward search, as we are intuitively matching the characters of P with T in a forward manner. On the other hand, the definition of the CSA, or the FM-index, favours another searching methodology, in which the pattern P should be matched with the text T starting from the last character. We refer this as the backward search. Table 1 gives a summary of the worst-case searching performance of CSA and FM-index. Note that in the table, the bounds depend on a parameter e, where 0 < e < 1. This e is in fact a tradeoff parameter for the searching time and index space. In order to achieve the stated timing bounds, the corresponding CSA or FM-index would require \ times the size of the basic structure. In this paper, we consider e to be 1. Remark: Although the worst-case behaviour of backward search (in both indexes) is better than that of the forward search, in practice, however, forward search may outperform backward search
33
Table 1: Searching performance of CSA and FM-index Index CSA
CSA (e = 1) FM-index FM-index (e = 1)
Forward Search G(|P|log1+€n + occlog e n) 0(|P| log2 n + occlogn) Q(|P|log€n + occlogen) 0(|P|logn + occlogn)
for two reasons: Firstly, based on Manber and Myers [12], the average time of forward search is expected to be 0(|P| log n + occ log n) and 0(|P| + occlogn) for CSA and FM-index, respectively, which matches to the worst-case time of backward search. Secondly, each operation of the backward search is usually more computationally involved, so that the hidden constant in the worst-case bound could be very high. 2.3 Implementation Details Our implementation for CSA is based on Lam et al.[10].2 In their paper, they observed that the basic structure, \&, can be partitioned into four increasing sequences with values between 0 and n. Then, for each sequence $1, 82 1 . • • , we can store the difference values (that is, si,S2 — si,5a — 82, and so on) instead of the original values. Compression can be achieved by encoding each of these difference values separately using variable-length prefix-free codes, such as 7 code or 6 code [3]. The original paper suggests to use 6 code, which has better worst-case size bound for general alphabets; but for our implementation, we use 7 code instead, for it achieves smaller size than 6 code when encoding the ^ function of DNA sequence in practice. As mentioned before, we need to store auxiliary structure for the prefix-free coded sequence to speed up the decoding process. In our implementation, we store the V[i] explicitly whenever i mod t = 0 for some I = 0(n/logn). Then, to compute *[fc], we first find ^[c£] with c£ < k < 2
We have implemented and tested three other variations of CSA. Their performance for searching DNA sequence is quite similar; the variation reported in this paper is consistently the best in our studies.
34
Backward Search 0(|P|logn + occlogen) 0(|P|logn + occlogn) 0(|P| + occlogen) 0(|P|+ occlogn)
(c -f IX i*1 constant time. Afterwards, we look at the encoded sequence for V[c£ +l]-ty[c£\,ty[cl + 2] - ^[ct + 1],... ,tf[fc] - V[k - 1] to compute ^[fc] —ty[cC}.The latter part can be done using constant number of standard table-lookups, so that the overall time to compute ^ takes 0(1) time. Moreover, the space required is 0(n) bits. Note that the more values of \I> we store explicitly, the more space we need, but the faster the decoding process. To support computing the SA values, we need to store another auxiliary data structure. Precisely, we store the SA[i] explicitly whenever i mod t — 0 for some t — 0(n/logn). To compute SA[i], we compute ^[t],^ 2 [t],\f 3 [i] and so on, until tyk[i] mod t = 0. Then, we obtain the value s = SA[fyh[i\], and it is easy to verify that SA[k] = s — k. Note that this implementation requires 0(n) bits space. On the other hand, it does not guarantee worst-case time for retrieving the SA values. Despite of this, it has good average case performance, where an SA value can be retrieved in expected O(logn) time in practice. In fact, we can modify our implementation to guarantee worstcase time for getting the SA, but this would require some complicated data structure that answers the rank and select queries [8] in 0(1) time.3 In practice, this means more space for the index and more time for getting the SA. For the same reason, our implementation of FM-index adopts an auxiliary data structure that supports the retrieval of SA in average O(logn) time instead of worst-case. For the other parts, it follows closely to that in [5], which includes techniques like Burrows-Wheeler transform [2], move3 Though the theoretical bound is O(l), the hidden constant, however, is high in practice.
to-front encoding [1] and run-length encoding. (See peated for 1000 times to obtain an average timing. the original paper for more details.) Finally, only searching times for existence (that is, to determine whether the pattern P occurs in the 3 Searching a Pattern with CSA and DNA sequence T) are reported, as the time for enuFM-index meration (that is, reporting the occurrence of P in This section investigates the practical searching T) is independent of the searching methodology. From our experiments, we observe that if the behaviour of CSA and FM-index. In the first part, we compare the two searching methodologies for index are provided with the same amount of space CSA and FM-index, and show that forward search (in terms of n), the searching performance with in practice performs better than backward search genomes of different length does not vary a lot. For when the pattern length is long. In addition, other instance, compare Tables 2 and 3 to see the timing interesting findings are also observed. In the second difference of searching with Fly and with Human. part, we perform a case study on the practical In this section, we shall focus on the experimental performance of CSA and FM-index with the suffix result of Fly (Table 2), and use it to present the general observations that we have made. tree and suffix arrays. Some interesting findings can be summarized We ran all the experiments on a machine equipped with a 1.7 GHz Pentium IV processor as follows. with 512 Kbytes of L2 cache, and 4 Gbytes of RAM. • Using backward search, FM-index is at least The operating system was Solaris 9. ten times faster than CSA in all testing cases. 3.1 Forward Search and Backward Search Both CSA and FM-index have searching time We have constructed the CSA and the FM-index increasing linearly on pattern length. for the following genomes: E.coli (4.6 Mbases), Fly • Using forward search, CSA is, however, two (98 Mbases), Human Genome (2.88 Gbases). As times faster than FM-index in the medium or mentioned, the amount of auxiliary information the large implementation, and slightly slower affects the performance of the compressed index than FM-index in the small implementation. greatly. For our experiments, we have investigated three implementations of each index, which respecUnlike backward search, forward search is not tively requires a total of 4.5n, 6n and 8n bits of sensitive to pattern length. We believe that in space, which are refer to as the small, medium and practice, forward searching with CSA and FMlarge implementations.4 In case where we conduct index requires a\P\ + (3\ogn time, for some forward search, an additional 2n bits of memory is constants a -C (3, In other words, the time is required for storing the DNA text. determined by the logn factor instead of the For each genome, we have tested the searching pattern length. times using patterns of lengths 10, 50, 100, 500, 1000, 5000, and 10000, where patterns are selected • Theoretically, backward search is better than from the corresponding genome at random posiforward search for both indexes. Most surpristions, so as to get a more accurate account of the ingly, experiments show that for long patterns, worst-case behaviour.5 For each test case, it is reforward search is more efficient. For CSA, forward search outperforms backward search for 4 We have also investigated another setting in which both data patterns of length 50 or more; for FM-index, structures have the same amount of auxiliary storage (precisely, this occurs for patterns of length around 3000 2n bits). Note that this setting is not fair to FM-index as we allow CSA together with its auxiliary storage to use more memory; or more.
nevertheless, the finding is similar to those to be reported below. 5 As DNA is a very biased string, a random pattern is unlikely to be found there; thus, searching for a random string is often very fast as a few comparisons could confirm the non-existence.
To test the worst behavior of the indexing data structures, we use substrings or modified substrings of the DNA.
35
Table 2: Searching performance (in msec) of CSA and FM-index for Fly. Index
Index Size small
CSA
medium large small
FMI
medium large
Searching Method forward backward forward backward forward backward forward backward forward backward forward backward
10
50
28.70 1.87 2.654 0.635 1.001 0.435 20.47 0.062 7.983 0.037 2.699 0.027
28.70 9.36 2.655 3.160 1.001 2.170 20.48 0.283 8.021 0.175 2.700 0.130
Pattern Length 500 1000 28.71 28.72 28.74 18.81 95.00 189.3 2.660 2.677 2.684 6.357 31.95 63.87 1.021 1.022 1.031 4.371 21.97 43.95 20.50 20.50 20.53 2.668 5.321 0.551 8.023 8.029 8.040 1.678 3.325 0.343 2.732 2.707 2.711 2.446 0.253 1.231 100
5000 28.80 947.5 2.753 319.4 1.094 220.0 20.54 26.54 8.101 16.54 2.796 12.21
10000 28.86 1894 2.821 638.5 1.160 439.0 20.62 52.99 8.180 33.29 2.877 24.41
Table 3: Searching performance (in msec) of CSA and FM-index for Human. Index
Index Size small
CSA
medium large small
FMI
medium large
Searching Method forward backward forward backward forward backward forward backward forward backward forward backward
10
50
38.96 2.51 3.603 0.852 1.359 0.584 20.01 0.074 7.805 0.044 2.638 0.032
38.96 12.37 3.604 4.200 1.359 2.910 20.02 0.352 7.805 0.208 2.639 0.155
Roughly speaking, for patterns of length less than 1500, FM-index with backward search is the best; otherwise, CSA with forward search is fastest. 3.2 Comparison with Suffix Trees and Suffix Arrays As one can expect, the searching performance for suffix tree or suffix arrays should be better than CSA and FM-index, as there are no compression involved in the former indexes. In this section, we try to give a quantitative comparison between these four indexes in practice. We have constructed the four index for the E.coli (4.6 Mbases) genome. For CSA and FMindex, we just consider the medium implementations, which each occupies 6n bits. We have tested
36
Pattern Length 1000 500 38.98 38.98 39.01 24.74 123.7 247.6 3.656 3.607 3.618 8.443 42.66 85.15 1.362 1.370 1.360 5.802 29.17 58.45 20.04 20.05 20.08 3.554 7.120 0.710 7.806 7.808 7.811 0.412 2.051 4.108 2.640 2.644 2.672 3.084 0.312 1.536 100
5000 39.10 1235 3.658 425.2 1.454 291.5 20.13 35.62 7.819 20.55 2.735 15.37
10000 39.20 2474 3.757 851.2 1.545 583.2 20.22 71.31 7.829 41.15 2.821 30.72
the searching times using patterns of lengths 10, 50, 100, 500, 1000, 5000, and 10000. Forward search are conducted for all the indexes, and backward search are conducted for CSA and FM-index. In case where we conduct forward search, an additional 2n bits of memory is required for storing the DNA text. Patterns are selected from the E.coli genome at random positions. For each test case, it is repeated for 1000 times to obtain an average timing. The searching times are separated into two parts: the time for reporting whether the pattern exists in the text, and the time for enumerating the location of each occurrence. Tables 4(a) and 4(b) show the best time obtained by each index. From Table 4, we observe that both CSA
Table 4: Searching performance of different indexes, (a) Average time (in msec) for one existential query for different pattern length, (b) Average time (in /^sec) for reporting the location of one occurrence Index Suffix Tree Suffix Array CSA
FM-index
10
50
0.003 0.010 0.512 0.035
0.003 0.010 2.145 0.162
Pattern Length 1000 500 0.004 0.006 0.010 0.011 0.015 0.021 2.150 2.173 2.166 0.320 1.573 3.152 100
5000 0.035 0.060 2.232 5.225
10000 0.067 0.111 2.318 5.309
(a)
and FM-index are much slower than suffix trees and suffix arrays, in terms of both existential query or enumerating occurrences. Nevertheless, in terms of absolute tune, each existential query using CSA or FM-index can be answered within a few milliseconds, and the enumeration of one occurrence is in the order of microseconds, which is acceptable in most applications. 4
Construction Requirements
Index Suffix Tree Suffix Array CSA FM-index
Time 11.5 0.5 49.0 114.0
(b)
shows the corresponding construction space and time for each genome. To conclude this section, Table 6 shows the limitations on the index that can be constructed in an ordinary PC nowadays, assuming a RAM of size 4 Gbytes.6 5
Concluding Remarks
We have demonstrated that CSA and FM-index can be constructed for a genome of length up to a few Gigabases. This allows most of the genomes nowadays to become indexable, which was previously infeasible if we use suffix trees or suffix arrays. For the searching performance, we observe that CSA is better than FM-index for searching long patterns. Moreover, we have compared backward search with the forward search of both indexes, and find that forward search is faster than backward search when the input pattern is long. This is counter-intuitive, since in theory, backward search is always better than forward search.
When considering the memory requirement for CSA and FM-index, one often focuses on the size of the resulting data structures. In fact, another important concern is the memory requirement for constructing these data structures. Original methods for constructing CSA and FM-index requires building the suffix array first and then perform compression, which is infeasible for genome whose length is of some Gigabases. Nevertheless, we can overcome the memory problem by constructing the CSA and FM-index in an alternative way. Previous results show [7,10] that CSA can be directly constructed from the DNA sequence in (5 4- e)n bits space, for any e > 0, References while FM-index can be converted from the CSA in negligible extra space. Note that the larger the e, the faster the algorithm is. Based on this, [1] J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A we have implemented space-efficient programs for locally adaptive compression scheme. Communiconstructing CSA and FM-index on a PC, and cations of the ACM, 29(4):320-330, 1986. tested it with E.coli, Fly and Human as the input genomes. For the actual construction, we set e = 5 to ^Technically speaking, we cannot make full use of all the RAM make the maximal use of the main memory. Table 5 as some space must be reserved for the operating system. In our case, only a maximum of 3.6 Gbytes out of 4 Gbytes are available.
37
Table 5: Construction time and space for CSA and FM-index DNA
Construction Space
E.coli Fly Human
5.8M byte (lOn bits) 125M byte (lOn bits) 3.6G byte (lOn bits)
Construction Time CSA FM-index 60 sec 72 sec 30 min 36 min 24 hour 28 hour
Table 6: Limitations on the genome to be constructed Index Suffix Tree Suffix Array CSA FM-index
Construction Algorithm Kurtz [9] Larsson-Sadakane [11] Lam et al. [10] Hon et al. [7]
[2] M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, California, 1994. [3] P. Elias. Universal codeword sets and representation of the integers. IEEE Transactions on Information Theory, 21 (2): 194-203, 1975. [4] P. Ferragina and G. Manzini. An experimental study of an opportunistic index. In Proceedings of Symposium on Discrete Algorithms, pages 269278, 2001. [5] P. Ferragine and G. Manzini. Opportunistic Data Structures with Applications. In Proceedings of Symposium on Foundations of Computer Science, pages 390-398, 2000. [6] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In Proceedings of Symposium on Theory of Computing, pages 397406, 2000. [7] W. K. Hon, T. W. Lam, K. Sadakane, and W. K. Sung. Constructing Compressed Suffix Arrays with Large Alphabets. In Proceedings of International Conference on Algorithms and Computation, 2003. To appear. [8] G. Jacobson. Space-efficient Static Trees and Graphs. In Proceedings of Symposium on Foundations of Computer Science, pages 549-554, 1989. [9] S. Kurtz. Reducing the Space Requirement of Suffix Trees. Software Practice and Experiences, 29:1149-1171, 1999. [10] T. W. Lam, K. Sadakane, W. K. Sung, and S. M.
38
Maximum Genome Constructed 180Mb 450 Mb 5000 Mb 5000 Mb
Maximum Genome Able To Reside 180Mb 900 Mb 7200 Mb 9600 Mb
Yiu. A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays. In Proceedings of International Conference on Computing and Combinatorics, pages 401-410, 2002. [11] J. Larsson and K. Sadakane. Faster Suffix Sorting. Technical Report Technical Report LU-CS-TR:99214, LUNDFD6/(NFCS-3140)/l-43/(1999), Lund University, 1999. [12] U. Manber and G. Myers. Suffix Arrays: A New Method for On-Line String Searches. SI AM Journal on Computing, 22(5):935-948, 1993. [13] E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(2):262-272, 1976.
Faster placement of hydrogens in protein structures by dynamic programming Andrew Leaver-Fay
Abstract
Yuanxin Liu
Jack Snoeyink
the computational biologists cannot be computed within acceptable time. We found a dynamic programming formulation of the optimization problem which succeeds for low treewidth graphs, implemented our algorithm, and replaced their search subroutine with ours. We compared the modified version of REDUCE with the original, using the "bad" input cases. Our program produces identical outputs, and is faster by orders of magnitude.
Word and coauthors from the Richardsons' 3D Protein Structure laboratory at Duke University propose dot scores to measure inter-atomic interactions in molecular structures. They use these scores in their program REDUCE, which searches for the optimal placement of hydrogen atoms in molecular structures from the Protein DataBank (PDB). We investigate the accuracy and computation of these scores. By replacing part of their search by a dynamic programming algorithm based on partitioning the interaction graph between amino 2 Representations of protein structure acid residues, we observe a speed-up of up to seven Biologists speaking of proteins will often say that a orders of magnitude. protein's sequence determines its structure, which determines function. This "sequence" is the sequence of 1 Introduction amino acids, which can be identified by their names or In molecular biology, it is frequently important to an- by their three letter abbreviations. There are 20 differalyze atom contacts within a protein molecule to accu- ent amino acids which nature uses to build its proteins. rately determine and evaluate its structure. As tools Figure ?? shows two amino acids in a portion of a profor atom contact analysis, Word and others from David tein. The backbone of the protein is the composed of a and Jane Richardsons' 3D Protein Structure labora- repeating sequence of -N-C -C- atoms. Each amino a tory [2, 3, 4, 5, 6] have developed a set of programs, all of acid contributes one unit of this repeat. The 20 amino which are open-source and can be directly used over the acids differ in the chemical groups bound to the C a internet at http: //kinemage. biochem. duke. edu. One atom. A protein's sequence refers to the side chain comof these programs, named REDUCE, is used to add hy- position but it is the combination of the side chain and drogens to a Protein DataBank (PDB) molecular struc- backbone chemical structure that determine it's physiture based on atom contacts [6]. It uses a brute-force cal structure, its "fold." search for the hydrogens placement that minimizes an A popular model for proteins is the hard-sphere energy score function. molecular model: an atom is modelled as a sphere in 3D centered at the atom's coordinate and with radius equal to the atom's van der Waals radius. Spheres may overlap only for atoms that share a chemical or hydrogen bond. Biologists like this simple geometric model of proteins not only because it makes visualization easy but also because the interaction between protein atoms can be modelled reasonably well as collisions between spheres. As mentioned above, the Richardsons' 3D Protein Structure laboratory has developed a set of software Figure 1: Protein backbone & side chain bonds with tools that all take a protein model as input and analyze three van der Waals spheres the protein's atom interactions based on their model REDUCE's brute-force search is, in fact, reasonably as spheres. On the "front end", a program named fast for the majority of the input cases encountered. KINEMAGE visualizes a protein model and shows Unfortunately, a number of "bad" input cases noted by colored dots and spikes for atom contacts between non-
39
bonded atoms. The dots and spikes are produced by a program named PROBE. Finally, there is a program named REDUCE, the topic of this paper, which takes a PDB file and adds hydrogens to it. This needs some explanation. The Protein Data Bank (PDB) is a public repository at http://www.rcsb.org/pdb for all experimentally determined protein structures. A PDB file contains the coordinates of the atoms specified in angstroms and the bonds between the atoms. The most popular method for structure determination, X-ray crystallography, is based on electron density. Unfortunately, due to technology limitations, electron density is sufficiently resolved for only the heavy atoms such as C, N and O. Hydrogen atoms, though roughly half of the atoms in any given protein, are not visible in an X-ray diffraction experiment. REDUCE takes a protein model from a PDB file and finds the best placement of hydrogens onto the model. The output of REDUCE can then be used by PROBE and KINEMAGE to analyze the interaction of atoms participated by hydrogen atoms—which the biologists have considered to be important [3, 4]. The broad goal of REDUCE, when searching for the best hydrogen placement, is to minimize "clashes" between atoms—that is, when two atoms van der Waals spheres overlap in space—and to maximize the amount of hydrogen bonding. A hydrogen bond is formed as an overlap between a hydrogen atom and a polar heavy atom to which the hydrogen is not chemically bound. Not all hydrogen atoms can participate in hydrogen bonds - only those chemically bound to polar heavy atoms. The heavy atom not chemically bound to the hydrogen is called the acceptor. The hydrogen bond donor refers sometimes to the other, chemically bound, heavy atom, and sometimes to the hydrogen. In Appendix A, we refer to the hydrogen when we use the term hydrogen bond donor. What choices does REDUCE have to add hydrogens? The positions of hydrogen atoms bound to most heavy atoms are fixed. REDUCE computes these positions with simple vector geometry. Other hydrogens belong to OH, SH or NHj groups that have rotational freedom. The rotation is commonly sampled at fixed intervals to make the choices discrete. Finally, there are certain side chains —ASN, GLN, and HIS, in particular— that appear symmetric in two different "flip" positions to the structure determination technology unless hydrogens are placed on them. This is illustrated in Figure 2 and Figure 3. Unless the structural biologist depositing the protein to the PDB has paid careful attention to where the hydrogens must lie, the structure is likely not correct. By explicitly modelling the hydrogen atoms
40
for these groups, REDUCE is able to resolve structural ambiguity for heavy atoms as well. REDUCE formalizes the search for possible hydrogen placement choices as an assignment of states to a collection of movers. A mover is a group of atoms that can be rotated or "flipped" together to change the coordinates of the hydrogens in the group. Each choice of rotation or flip defines a state for a mover. The "best" hydrogen placement is defined to be the configuration of movers that minimizes an energy function. The energy function inspects all the nonbonded atom contacts—an atom contact is made whenever two atoms' van der Waals spheres overlap in space. When a contact is made by two atoms that form a donor/acceptor pair for a hydrogen bond, the overlap volume is rewarded by a negative energy score, otherwise, it is penalized by a positive energy score. Specifically,
In theory, the constants are chosen to give "an overall scoring profile similar in shape to the van der Waals function for an isolated pairwise interaction" [5]. The overlap volume is approximated with dots sampled on the surface of the overlapping atoms weighted by their penetration depth. Interestingly, this approximation does not converge to the overlap volume as the sampling density increases, even though the biologists are satisfied with its accuracy. We present a detailed analysis of this scoring function in Appendix A. To minimize the energy score, the movers are first grouped into sets, called "cliques" that are isolated in their atom contact interactions. To be more formal, we define an interaction graph with a vertex for each mover in the clique and an edge between two mover vertices if there exist some states of the movers that induce contacts between their atoms. A clique is a set of movers corresponding to the vertices of a connected component of the interaction graph. Clearly, minimizing the energy score for each clique also minimizes the total energy score. The minimization for each clique is done by brute-force enumeration of all possible combination of the movers' states.
3
Dynamic Programming
Section 2 mentioned the interaction graph that REDUCE uses to find atom groups, called cliques, so that all contact interactions occur within cliques. REDUCE discards this interaction graph when searching for a minimum score for a clique. Instead, we partition this
Figure 2: From left to right: two flip states for each of three amino acids: ASN, GLN and HIS
Figure 3: ASN induces different atom contacts in two "flip" positions. Spikes (red) indicate sphere overlaps, while dots pillows (green) indicate hydrogen bonds. (Images produced by KINEMAGE program) interaction graph and apply dynamic programming to compute the score. First, however, we need to extend the interaction graph so that it captures not only pairwise atom contact interactions but also interactions among more than two atoms. Specifically, define a hypergraph G = (V,E). Each vertex in V represents a mover. Each hyperedge in E is a subset of V; exactly which subsets are hyperedges is defined in the rest of this paragraph. We say there is a possible complete interaction between a set of movers if there exist some states for the movers such that we can choose an atom from each mover, intersect their van der Waals spheres, and discover that the intersection is nonempty. A set of vertices V C V is a hyperedge if and only if there is a possible complete interaction among the movers represented by V. We should note that the collection of hyperedges, E, in fact is an abstract simplicial complex. That is, if e is a hyperedge, than any set of vertices e' C e is also a hyperedge. We should also note that a set containing a single vertex is a hyperedge by definition. With this interaction graph, we can decompose the scoring function. Given a set of state assignments to the movers, the score is the sum of the score for each dot. The score of a dot can either be determined by its interaction with another mover or with background stationary atoms. We can re-distribute the summation of dot scores to the hyperedges as the following: for each dot score, take the maximal set of movers whose van der
Waals spheres contain the dot. These movers identify a hyperedge in the interaction graph which we distribute the dot score to. In the case when a dot from a mover represented by vertex v is not contained in any other mover's spheres but is contained in some sphere from a stationary atom — and therefore still contributes a score, we distribute the dot's score to the hyperedge {v}. The score can now be described as the sum of the scores for the hyperedges. We can summarize this a little more formally as the following. Let d denote a dot, {d} the set of all dots, score(d) the score for a dot, and d —* e a shorthand for the statement that the dot is distributed to the hyperedge.
Given a hyperedge e and an assignment of states to its vertices, we now have a definition of its score as Ed-e score(d). With the decomposition of the total score of a clique into the sum of its hyperedge scores, we are ready to decompose our optimization problem into subproblems as well. Let us first formally define the problem. Problem: Given • a hypergraph G = (V,E), • an associated set of states S(v ) for each vertex v G V, and • a scoring function fe for each hyperedge e € E, defined as a mapping fe: Ylv&e<S(v) —» $1 that
41
returns score for an assignment of states to the vertices in e, find the assignment of states to all vertices such that the sum of the hyperedge scores is minimized. We also introduce the following notation for convenience. • For a set of vertices V, let S(V) = T[V€V, S(v) denote the state space of the vertices in V, and let Sy> € S(V) denote a particular state vector in the state space of V . • Given a set of vertices V, their state vector Sy, we let Svcv denote the partial state vector just for a subset" of V, V" . The way this problem can be decomposed into subproblems depends on the connectivity of the graph. Suppose we can find a vertex cut, A, of the graph. That is, the rest of the vertices, V — A, can be partitioned into two Figure 4: Partition the sets, L and R, such that no hyperedges into three hyperedges exist between sets. them(See Figure 4). Equivalently, the hyperedges can be partitioned into three sets: • EA, the hyperedges that are subsets of A. • EL, the hyperedges that are subsets of A U L but not in EA• ER, the hyperedges that are subsets of A U R but not in EALet T(E,Sy) denote the minimum score for a collection of hyperedges E with some of its vertices V fixed in states Sy. Then, the minimum score for the graph is
This formula illustrates the essential idea of how the interaction graph can help us decompose and combine subproblems. We now show a recursive formula which can be implemented as a bottom-up dynamic program al- Figure 5: Identify gorithm. an edge set by its In order to make the recursive vertex cut a. formula concise, let us first name some objects in the graph. • V/, a set of vertices that we want to fix in states Svf • a vertex x such that A = Vf U {x} cuts the graph into multiple subgraphs. That is, the rest of the vertices, V — A, can be partitioned into multiple sets of vertices such that no hyperedges exist between them. Equivalently, the hyperedges can be partitioned into sets of two kinds. — EA, the hyperedges that are subsets of A. — for each vertex partition V, we have a set of hyperedges that are subsets of A U V but not in EA- This hyperedge set can actually be identified by a set of vertices a C A which cut V from the rest of the vertices. We therefore denote the hyperedge set identified by its vertex cut a as Exa (See Figure 5), and we denote the collection of vertex cuts by { x}. Then, the minimum score for a set of hyperedges E with fixed vertex states Sv} can be recursively defined as the following.
The recursion terminates when we have a single hyperedge. That is,
The recursion can be implemented as a bottomup dynamic programming algorithm as the following. For data structures, we keep a hypergraph G' to store solution scores to the subproblems. G' is initialized to be the interaction graph G but will have its vertices and hyperedges removed, as well as hyperedges added, by the process. We keep the following invariant. • A hyperedge a in G' is either a hyperedge of G or a vertex cut that identifies a portion of the graph that was removed. Furthermore, let Exa denote the hyperedges of the removed graph. We assert that a also stores the set of solution scores for the subproblems on E xa , {T(Exa, Sa),VSa € <S(a)}.
42
To initialize the process, set G' = G. At each step, the following steps are performed. 1. Choose a vertex x and identify the set of vertices V which x is connected to (by hyperedges). 2. Add a new hyperedge e = V if V is not already a hyperedge. Then for each Se € «S(e), compute T(Exe,Se) and store the result at e. Note that when we apply the recursive formula to compute T(Exe,Se), each recursive part of the right hand side must have already be stored on a hyperedge, according to the invariant. This step also maintains the invariant on the hyperedge e. When we remove the last vertex x, we have a set of scores, each denoting the minimum of the total score when x is in some fixed state. Therefore, we report the minimum of these scores as the solution. We have now completed describing a dynamic programming algorithm for computing the minimum score of a clique. Our implementation of the dynamic programming algorithm, however, is a more restricted version. We choose vertices for removal from G' which are connected to at most two other vertices in the graph. It can be easily shown that we can always succeed in finding such a vertex if and only if the graph has treewidth at most two. ( Treewidth is a well-studied graph property. For a complete review of treewidth and algorithms on graphs of limited treewidth, please see [1]). If such a vertex cannot be found, however, our implementation applies brute force minimization on the rest of the interaction graph. To give a few more details of our implementation, in order to store the subproblem scores on a hyperedge, we let each vertex—or a hyperedge containing one vertex— keep a score list indexed by the vertex's states. We let each edge keep a score table indexed by the state pairs of the edge's end-point vertices. We also observe that a hyperedge score is computed by summation of hundreds of dots—a considerable overhead if the computation is to be repeated. Therefore, we pre-compute these scores and store them either on the vertex's score list or on an edge's score table.
4
Experiments
We implemented the dynamic programming algorithm in C++ to integrate with the existing source for REDUCE. The program runs on an SGI platform as well as the Mac. Our goal was to make minimal modifications to the original source and we confined our substantial changes to within two subroutines: the one that performed the exhaustive search and the one that scored individual dots. We also made a small change to the implementation of one of their data structures to reduce,
which alone resulted in a six-fold performance increase. To test the performance gain, we executed our code for a small set of PDB files. For these files, the old version of REDUCE either took too long to optimize the cliques or had decided not to try. The PDB codes for these proteins were 1GCA, 1SVY, 4XIS, 1A7S 1SBP, and 1QRR, displayed in Figure 6. We present the running time of the entire program, running on a 300 MHtz SGI machine under similar processor loads, in Table 1. The network for clique 1QRR had over 1 billion possible states, and we accordingly present only a projected running time. It could be noted that while the time running the previous program was almost entirely spent optimizing these large cliques, our program spends most of its time elsewhere, and our performance gain for clique optimization alone is even better.
1GCA 1SVY 4XIS 1A7S 1SBP 1QRR
old running time (t) 169.54 sec. 142.67 sec. 916.03 sec. 1312.11 sec. 4.5 hours 6.5 years
new (*') 10.05 sec. 4 sec. 10.75 sec. 8.1 sec. 9.63 sec. 22.26 sec.
lQgio(£) 1.23 1.55 1.93 2.21 3.23 6.95
Table 1: running time comparison We also have tested our program against a larger set of 100 PDB files containing 677 cliques. Of these, there are 6 that show simultaneous overlap of three movers and which require a hyperedge of degree three to be accurately modelled. With minimal effort, we can handle these cases in conjunction with the current implementation of the dynamic programming algorithm by simply scoring their three-way interaction separately after having reduced the rest of the graph. None of the cliques showed any four-way overlap. A final observation could be made about the treewidth of the interaction graphs we observed. With the exception of the few degree-3 hyperedge containing cliques mentioned above, all of the interaction graphs were of hyperedge-degree and treewidth of two or less. Thus the our implementation of the dynamic programming algorithm which handles exactly these interaction graphs is quite appropriate. The main programming challenge has been in working with the dot- based scoring function. While this function is not in principle pairwise decomposable, we know we are able to get away with a degree 2 hypergraph due to the very low frequency of more complex networks. The existing code that surrounds the scoring function, however, does not lend itself to being picked
43
Figure 6: The interaction graph for six cliques. is volume the biochemists sought to approximate. We would like to use this second source of error within the scoring scheme as a justification to change the way in 5 Future Work which energy scores are evaluated, moving instead to an There are several issues which remain at the conclusion exact, Voronoi-like scoring scheme. Due to the high cost of this paper. First, the optimization that is being of the dot-based scoring function, we expect not only to computed by the original version and the current version see accuracy improvements, but also performance ones we have implemented is not actually finding the optimal as well. configuration for the entire molecule. When evaluating the interaction between two movers, there are two sets 6 Acknowledgements of dots that get scored: mover 1's dots inside mover We would like to thank Professors Dave and Jane 2 and mover 2's dots inside mover 1. When scoring a Richardson, Michael Word for their help. This research mover against the background, the mover's dots inside has been partially supported by NSF grant 0076984. the background are counted, but the background's dots are not counted. This skews the optimization function A Contact dots for scoring atomic to prefer good mover-mover interactions over moverinteraction background ones in selecting the optimal state. Second, we would like to increase the discretization Word et al. [5, 6] define contact dots and describe their level to make more fine-grained choices. After complet- use to visualize and score interactions between atoms ing brute force optimization of each clique, the old ver- in a protein model. Initially, between 200 and 600 dots sion of reduce performs a "local optimization" step for are placed on the van der Waals surface of each atom, the rotatable movers using 1 degree rotations near the A, at roughly uniform density and spacing. Dots that "optimal" state. While this step generally reduces the fall inside an atom bonded to A are discarded. Each energy score further, it is clearly missing the optimal remaining dot p on A finds the closest van der Waals solution were these fine grain discretizations available sphere B of an atom not bonded to A. This distance in the brute force optimization. With the performance is considered negative if p is inside the van der Waals gains we have achieved, it makes sense to include this sphere B. Dot p is assigned favorable (i.e. negative) contact energy if it is outside of B, but within a small local optimization in the global optimization step. Finally, the scoring technique must be adjusted. As probe radius. Dot p is assigned favorable hydrogen discussed in detail below, the volume approximation be- bond energy if it is inside B (but not too deeply), and ing performed in the previous and current version of RE- the atoms A and B are a hydrogen bond donor and DUCE is not accurate. The actual dot density placed acceptor. Finally, dot p is assigned an unfavorable (i.e. on the surface of the van der Waals spheres that is positive) overlap penalty if p lies inside B and either achieved is not what is sought, and thus the volume ap- A,B is not a hydrogen bond donor/acceptor pair, or p is proximation is significantly off. This absolutely should too deeply inside B. These contacts are easily visualized be corrected for the next version of REDUCE. Also, by drawing dots with favorable energies in cool blues the surface integral being estimated by the dots does and greens, and unfavorable overlaps with spikes of pink not converge to the actual volume of overlap, though it and red. apart, and forces complex logic in the innermost loop.
44
For applying dynamic programming in the next can be expressed as a function of r: section, the key feature of contact dot scoring is its decomposability. Each dot can independently determine whether it is contained in a bonded atom, or what is the closest van der Waals sphere, and from that calculate its contribution to the total energy. Contact dots are motivated in [5] as a discrete The derivative A'(r) = 2irrir/d. Thus, the overlap approximation to a continuous scoring function on the calculation is the integral overlap volume, but they actually approach a function on surface area in an overlap region as the density increases. In the rest of this section, we derive both of these continuous functions and compare them to the discrete approximation. This has no effect on how we can speed up the computation in REDUCE, but this analysis may be important for extending contact dots to a more continuous scoring based on Voronoi diagrams. The volume enclosed by the plane X = x and It also shows the amount of approximation error the the cap of sphere Si can be expressed as a function, biochemists are willing to tolerate for this problem. Vri(x) = (7r/3)(z3 - 3rfx 4- 2r!3). The plane X = Assume that we are (d2 4- ri2 — r22)/(2d) contains the intersection Si Pi S2, given the van der Waals and partitions the overlap volume into regions bounded spheres, Si and £2, for by two spherical caps. Thus, the total overlap volume a pair of non-bonded is atoms. Since hydrogen bond and overlap scores can be computed by the same function with a different multiplier, we look only at the overlap score for now. Let us consider the We could choose to assess sphere S\ either half the total score of a dot at (x, y) volume, V/2, the volume of its cap, Vri((d2 + r\l Fi ure 7: on a sphere Si centered g Notation for two r22)/(2d)), or the portion of volume on the side of the at the origin of radius n atom spheres bisector between spheres Si and S2 that is closest to if that dot lies inside a sphere 52 centered at (d, 0) of S2, which is a more complicated formula, but is closest radius r2. By rotation and translation,, we can bring to the aim of the dot approximation. These three any pair of atoms to such a configuration, as illustrated quantities are identical when the van der Waals radii in Figure 7. are the same for the two atoms. Word et al. [5] state that, "Hydrogen bonds and The surface integral and dot approximation are other overlaps are quantified by the volume of the overreasonably close to the volume score when a contact lap. Those volumes are easily measured by summing is measured from both sides, but can deviate when the the spike length (lsp) at each dot..." In fact, this is difcontact is scored on only one side for atoms whose radii ferent from the volume. As the sampling density goes differ. to infinity, we can express this as an integral of spike The dot scores depend on precisely how the samlength over the surface of the overlap region. ple dots lie relative to neighboring spheres. Even If we denote the distance from (x,y) to (d, 0) by while maintaining distance d and penetration distance r, then x2 + y2 = r^l and (d - x)2 + y2 = r2, so d — r\ — r2, rotating the spheres can change the scores x = (d2 4- ri2 - r2)/(2d). Spike length for a dot inside as different number of dots enter overlap configurations. sphere 62 is half the distance to the boundary, which in We used MATLAB's fminsearch to determined the our notation is lsp = (r2 — r)/2. minimum and maximum scores from about 30 initial From geometry, the portion of sphere S\ in a starting positions distributed on the sphere of direchalfspace X > x is a spherical cap whose surface area tions. Figure 8 displays graph of the overlap scores for sphere of radius 1 overlapping a sphere of radius 1.4
45
Figure 8: Overlap scores and errors as functions of the penetration distance d — r x — r2 for a sphere of radius 1 overlapping a sphere of radius 1.4 (left) and 1.8 (right). The volume score (V) is a solid line marked with diamonds. Dashed and dotted lines are the limit of the dot scores with infinite dot density; upper and lower triangles mark the highest and lowest dot scores returned by REDUCE for these radii and penetration depths. There are three sets of dotted lines and triangles since the score can be computed from the larger atom (e.g. Si.s), the smaller atom (Si), or the average of these two ( S ) .
46
(left) and 1.8 (right). It also displays absolute and relative error of dot and surface scores to the volume score. The x axis in these plots is the depth of penetration d — r\ — T-Z, which ranges from 0 to 0.6. Thus 2.4 > d > 1.8 on the left, and 2.8 > d > 2.2 on the right. In each graph, the volume score (V) is a solid line marked with diamonds. Dashed and dotted lines are the surface score, which is the limit of the dot scores as the number of dots increases without bound. Each line has pairs of upper and lower triangles, which mark the highest and lowest dot scores returned by REDUCE for these radii. For each pair of different radii, the score can be computed from the larger atom (e.g. Si.g), the smaller atom (Si), or the average of these two (5). The lowest graphs show that discretization gives high relative error on the small overlaps, but since the score is small (top) and differences are small (middle) we can guess that this is probably not significant for evaluation of dot scores. Averaged surface score (S) tracks the volume score (V) well. The larger or smaller atom scores, which are relevant only for atoms that are scored from one side, show greater deviation from the volume score. In fact, these consistently over or under estimate the volume, depending on whether we compute on the side of the smaller or larger atom. If we had added the other options for volume scores (cap volume, or bisected volume) they would be outside of the dotted lines for surface scores. For example, when the radii are 1 and 1.4, then the cap volumes are within ±16-20% of V/2, and the bisected volume are within ±8-14% than V/2. The exact surface scores, S\ and 5i.4 are within ±1-8%. Thus, we did not include comparisons with these other volume scores, although they are interesting options for continuous, Voronoibased scoring. Figure 9 shows that the dot scores for other radii have similar behavior. It plots absolute and relative errors for approximating volume score by the dot scores from atoms whose radii take on all pairs of values from [1,1.17,1.4,1.55,1.65,1.75,1.8]. Min and max error are plotted for each ordered pair of radii.
[3] D. C. Richardson and J. S. Richardson. Mage, probe, and kinemages. In M.G. Rossmann and E. Arnold, editors, International Tables for Crystallography, volume F, pages 727-730. Kluwer Publishers, Dordrecht, 2001. [4] J. M. Word, R. C. Bateman Jr., B. K. Presley, S. C. Lovell, and D. C. Richardson. Exploring steric constraints on protein mutations using mage/probe. Protein Sci, 9:2251-2259, 2000. [5] J. Michael Word, Simon C. Lovell, Thomas H. LaBean, Hope C. Taylor, Michael E. Zalis, Brent K. Presley, Jane S. Richardson, and David C. Richardson. Visualizing and quantifying molecular goodness-of-fit: Smallprobe contact dots with explicit hydrogen atoms. Journal of Molecular Biology, 285:1711-1733, 1999. [6] J. Michael Word, Simon C. Lovell, Jane S. Richardson, and David C. Richardson. Asparagine and glutamine: Using hydrogen atom contacts in the choice of sidechain amide orientation. Journal of Molecular Biology, 285(4): 1735-1747, 1999.
References [1] Hans L. Bodlaender. A tourist guide through treewidth. Technical Report 1992, Dept. Comput. Sci., Utrecht Univ. [2] Simon C. Lovell, Ian W. Davis, W. Bryan Arendall III, Paul I. W. de Bakker, J. Michael Word, Michael G. Prisant, Jane S. Richardson, and David C. Richardson. Structure validation by c-alpha geometry: phi, psi, and c-beta deviation. Proteins: Structure, Function, and Genetics, 50:437-450, 2003.
47
Figure 9: Absolute and relative volumes for all pairs of radii.
48
An Experimental Analysis of a Compact Graph Representation Daniel K. Blandford Guy E. Blelloch Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213
Ian A. Kash
{blandf ord.blelloch, iak}Qcs. emu. edu
Abstract In previous work we described a method for compactly representing graphs with small separators, which makes use of small separators, and presented preliminary experimental results. In this paper we extend the experimental results in several ways, including extensions for dynamic insertion and deletion of edges, a comparison of a variety of coding schemes, and an implementation of two applications using the representation. The results show that the representation is quite effective for a wide variety of real-world graphs, including graphs from finite-element meshes, circuits, street maps, router connectivity, and web links. In addition to significantly reducing the memory requirements, our implementation of the representation is faster than standard representations for queries. The byte codes we introduce lead to DFT times that are a factor of 2.5 faster than our previous results with gamma codes and a factor of between 1 and 9 faster than adjacency lists, while using a factor of between 3 and 6 less space. 1 Introduction We are interested in representing graphs compactly while supporting queries and updates efficiently. The goals is to store large graphs in physical memory for use with standard algorithms requiring random access. In addition to having applications to computing on large graphs (e.g. the link graph of the web, telephone call graphs, or graphs representing large meshes), the representations can be used for medium-size graphs on devices with limited memory (e.g. map graphs on a handheld device). Furthermore even if the application is not limited by physical memory, the compact representations can be faster than standard representations because they have better cache characteristics. Our experiments confirm this. "This work was supported in part by the National Science Foundation as part of the Aladdin Center (www.aladdin.cmu.edu) under grants CCR-0086093, CCR-0085982, and CCR-0122581.
Many methods have been proposed for compressing various specific classes of graphs. There have been many results on planar graphs and graphs with constant genus [32, 15, 13, 18,14, 21, 8, 6]. These representations can all store an n-vertex unlabeled planar graph in O(ri) bits, and some allow for 0(l)-time neighbor queries [21, 8, 6]. By unlabeled we mean that the representation is free to choose an ordering on the vertices (integer labels from 0 to n — 1). To represent a labeled graph one needs to additionally store the vertex labels. Other representations have been developed for various other classes of graphs [15, 12, 24, 30]. A problem with all these representations is that they can only be used for a limited class of graphs. In previous work we described a compact representation for graphs based on graph separators [5]. For unlabeled graphs satisfying an 0(nc), c < I separator theorem, the approach uses O(n) bits and supports neighbor or adjacency queries in 0(l)-time per edge. A property of the representation, however, is that it can be applied to any graph, and the effectiveness of compression will smoothly degrade with the quality of the separators. For random graphs, which don't have small separators (in expectation), the space requirement asymptotically matches the informational theoretical lower bound. This smooth transition is important in practice since many real-world graphs might not strictly satisfy a separator theorem, but still have good separator properties. In fact, since many graphs only come in a fixed size, it does not even make sense to talk about separators theorems, which rely on asymptotic characteristics of separators. As it turns out, most "real world" graphs do have small separators (significantly smaller than expected from a random graph). This is discussed in Section 1.1. This paper is concerned with the effectiveness of the separator-based representation in practice. We extend the previous approach to handle dynamic graphs (edge insertions and deletions) and present a more complete set of experiments, including a comparison of different
49
prefix codes, comparison on two machines with different cache characteristics, a comparison with several variants of the adjacency-list representation, and experimental results of two algorithms using the representation. Our experiments show that our representations mostly dominate standard representations in terms of both space and query times. Our dynamic representation is slower than adjacency lists for updates. In Section 2 we review our previous representation as applied to edge-separators. The representation uses a separator tree for labeling the vertices of the graph and uses difference codes to store the adjacency lists. In Section 3 we describe our implementation, including a description of the prefix codes used in this paper. In Section 4 we describe an extension of the separator-based representation that supports dynamic graphs, i.e., the insertion and deletion of edges. Our original representation only supported static graphs. The extension involves storing the bits for each representation in fixedlength blocks and linking blocks together when they overflow. A key property is that the link pointers can be kept short (one byte for our experiments). The representation also uses a cache to store recently accessed vertices as uncompressed lists. In Sections 5 and 6 we report on experiments analyzing time and space for both the static and dynamic graphs. Our comparisons are made over a wide variety of graphs including graphs taken from finite-element meshes, VLSI circuits, map graphs, graphs of router connectivity, and link graphs of the web. All the graphs are sparse. To analyze query times we measure the time for a depth-first search (DPS) over the graph. We picked this measure since it requires visiting every edge exactly once (in each direction) and since it is a common subroutine in many algorithms. For static graphs we compare our static representation to adjacency arrays. An adjacency array stores for each vertex an array of pointers to its neighbors. These arrays are concatenated into one large array with each vertex pointing to the beginning of its block. This representation takes about a factor of two less space than adjacency lists (requiring only one word for each directed edge and each vertex). For our static representation we use four codes for encoding differences: gamma codes, snip codes, nibble codes, and byte codes (only gamma codes were reported in our previous paper). The different codes present a tradeoff between time and space. Averaged over our test graphs, the static representation with byte codes uses 12.5 bits per edge, and the snip code uses 9 bits per edge. This compares with 38 bits per edge for adjacency arrays. Due to caching effects the time performance of adjacency arrays depends significantly on the ordering of the vertices. If the ver-
50
tices are ordered randomly, then our static representation with byte codes is between 2.2 and 3.5 times faster than adjacency arrays for a DPS (depending on the machine). If the vertices are ordered using the separator order we use for compression then the byte code is between .95 and 1.3 times faster than adjacency arrays. For dynamic graphs we compare our dynamic representation to optimized implementation of adjacency lists. The performance of the dynamic separator-based representation depends on the size of blocks used for storing the data. We present results for two settings, one optimized for space and the other for time. The representation optimized for space uses 11.6 bits per edge and the one optimized for time uses 18.8 bits per edge (averaged over all graphs). This compares with 76 bits per edge for adjacency lists. As with adjacency arrays, the time performance of adjacency lists depends significantly on the ordering of the vertices. Furthermore for adjacency lists the performance also depends significantly on the order in which edges are inserted (i.e., whether adjacent edges end up on the same cache line). The runtime of the separator-based representation does not depend on insertion order. It is hard to summarize the time results other than to say that the performance of our time optimized representation ranges from .9 to 8 times faster than adjacency lists for a DPS. The .9 is for separator ordering, linear insertion, and on the machine with a large cache-line size. The 8 is for random ordering and random insertion. The time for insertion on the separator-based representation is up to 4 times slower than adjacency lists. In Section 7 we describe experimental results analyzing the performance of two algorithms. The first is a maximum-bipartite-matching algorithm and the second is an implementation of the Google page-rank algorithm. In both algorithms the graph is used many times over so it pays to use a static representation. We compare our static representation (using nibble codes) with both adjacency arrays and adjacency lists. For both algorithms our representation runs about as fast or faster, and saves a factor of between 3 and 4 in space. All experiments run within physical memory so our speedup has nothing to do with disk access. 1.1 Real-world graphs have good separators An edge-separator is a set of edges that, when removed, partitions a graph into two almost equal sized parts (see [23] for various definitions of "almost equal"). Similarly a vertex separator is a set of vertices that when removed (along with its incident edges) partitions a graph into two almost equal parts. The minimum edge (vertex) separator for a graph is the separator
that minimizes the number of edges (vertices) removed. Informally we say that a graph has good separators if it and its subgraphs have minimum separators that are significantly better than expected for a random graph of its size. Having good separators indicates that the graph has some form of locality—edges are more likely to attach "near" vertices than far vertices. Along with sparsity, having good separators is probably the most universal property of real-world graphs. The separator property of graphs has been used for many purposes, including VLSI layout [2], nested dissection for solving linear systems [16], partitioning graphs on to parallel processors [27], clustering [29], and computer vision [26]. Although finding a minimum separator for a graph in NP-hard, there are many algorithms and codes that find good approximations [23]. Here we briefly review why graphs have good separators. Many graphs have good separators because they are based on communities and hence have a local structure to them. Link graphs for the web have good separators since most links are either within a local domain or within some other form of community (e.g. computer science researchers, information on gardening, ...). This is not just true at one level (i.e., either local or not), but is true hierarchically. Most graphs based on social networks have similar properties. Such graphs include citation graphs, phone-call graphs, and graphs based on friendship-relations. In fact Watts and Strogatz [36] conjecture that locality is one of the main properties of graphs based on social networks. Many graphs have good separators because they are embedded in a low dimensional space. Most meshes that are used for various forms of simulation (e.g. finite element meshes) are embedded in two or three dimensional space. Two dimensional meshes are often planar (although not always) and hence satisfy an O(nJ/2) vertexseparator theorem [17]. Well shaped three dimensional meshes are known to satisfy an O(n2/3) vertex-separator theorem [20]. Graphs representing maps (roads, powerlines, pipes, the Internet) are embedded in a little more than two dimensions. Road maps are very close to planar, except in Pittsburgh. Power-line graphs and Internet graphs can have many crossings, but still have very good separators. Graphs representing the connectivity of VLSI circuits also have a lot of locality since ultimately they have to be laid out in two dimensions with only a small constant number of layers of connections. It is well understood that the size of the layout depends critically on the separator sizes [33]. Clearly certain graphs do not have good separators. Expander graphs by their very definition cannot have small separators.
2
Encoding with Separators
In previous work [5] we described an O(n)-bit encoding with O(l) access time for graphs satisfying either an nc, c < 1 edge or vertex separator theorem. In this paper we only consider the simpler version based on edge separators. Here we review the algorithm for edgeseparators. Edge Separators. Let 5 be a class of graphs that is closed under the subgraph relation. We say that S satisfies a f(n)-edge separator theorem if there are constants a < 1 and f3 > 0 such that every graph in S with n vertices has a set of at most /3/(n) edges whose removal separates the graph into components with at most an vertices each [17]. Given a graph G it is possible to build a separator tree. Each node of the tree contains a subgraph of G and a separator for that subgraph. The children of a node contain the two components of the graph induced by the separator. The leaves of the tree are single nodes. Without loss of generality we will consider only graphs in which all vertices have nonzero degree. We will also assume the existence of a graph separator algorithm that returns a separator within the O(nc) bound. Adjacency Tables. Our data structures make use of an encoding in which we store the neighbors for each vertex in a difference-encoded adjacency list. We assume the vertices have integer labels. If a vertex v has neighbors vi, V2, i>3, • • • , vj. in sorted order, then the data structure encodes the differences v\ — v, V2 — fi, V3 — v-z, . . . , Vd — Vd-\ contiguously in memory as a sequence of bits. The differences are encoded using any logarithmic code, that is, a prefix code which uses O(log d) bits to encode a difference of size d. The value v\ — v might be negative, so we store a sign bit for that value. At the start of each encoded list we also store a code for the number of entries in the list. We form an adjacency table by concatenating the adjacency lists together in the order of the vertex labels. To access the adjacency list for a particular vertex we need to know its starting location. Finding these locations efficiently (in time and space) requires a separate indexing structure [5]. For the experiments in this paper we use an indexing structure (see Section 3) that is not theoretically optimal, but works well in practice and is motivated by the theoretically optimal solutions. Graph Reordering. Our compression algorithm works as follows: 1. Generate an edge separator tree for the graph. 2. Label the vertices in-order across the leaves.
51
3. Use an adjacency table to represent the relabeled graph. LEMMA 2.1. [5] For a class of graphs satisfying an nc-edge separator theorem, and labelings based on the separator tree satisfying the bounds of the separator theorem, the adjacency table for any n-vertex member requires O(n) bits. 3
Implementation Separator trees. There are several ways to compute a separator tree from a graph, depending on the separator algorithm used. In our previous paper we tested three separator algorithms and described a "child-flipping" postprocessing heuristic which could be used to improve their performance. Here we use the "bottom-up" separator algorithm with child-flipping. This algorithm gave the best performance on many of our test graphs while being significantly faster than the runner-up for performance. Generating the separator tree and labeling with this algorithm seems to take linear time and takes about 15 times as long as a depthfirst search on the same graph. The bottom-up algorithm begins with the complete graph and repeatedly collapses edges until a single vertex remains. There are many heuristics that can be used to decide in what order to collapse the edges. After some experimentation, we settled on the priority metric w AB ^I\ i n\ > where W(EAB) is the number of edges between S\s\)S\t>) the multivertices A and £?, and s(A) is the number of original vertices contained in multivertex A. The resulting process of collapsing edges creates a separator tree, in which every two merged vertices become the children of the resulting multivertex. We do not know of any theoretical bounds on this or similar separator algorithms. There is a certain degree of freedom in the way we construct a separator tree: when we partition a graph, we can arbitrarily decide which side of the partition will become the left or right child in the tree. To take advantage of this degree of freedom we use an optimization called "child-flipping". A child-flipping algorithm traverses the separator tree, keeping track of the nodes containing vertices which appear before and after the current node in the numbering. (These nodes correspond to the left child of the current node's left ancestor and the right child of the current node's right ancestor.) If those nodes are NL and NR, the current node's children are N\ and N%, and EAB denotes the number of edges between the vertices in two nodes, then our child-flipping heuristic rotates N\ and 7V2 to ensure that ENLNl+EN^NR > ENLN2+ENlNR. This heuristic can be applied to any separator tree as a postprocessing
52
phase. Indexing structure. Our static algorithms use an indexing structure to map the number of a vertex to the bit position of the start of the appropriate adjacency list. In our previous paper we tested several types of indexing structure and demonstrated a tradeoff between space used and lookup speed. Here we use a new structure called semi-direct-16 which stores the start locations for sixteen vertices in five 32-bit words. The first word contains the offset to vertex 0—that is, the first of the sixteen vertices being represented. The second word contains three ten-bit offsets from the first vertex to starts of vertices 4, 8, and 12. The next three words contain twelve eight-bit offsets to the remaining twelve vertices. Each of the twelve vertices is stored by an offset relative to one of the four vertices already encoded. For example, the start of vertex 14 is encoded by its offset from the start of vertex 12. If at any point the offsets do not fit in the space provided, they are stored elsewhere, and the table contains a pointer to them. This indexing method saves about six bits per vertex over our previous semidirect index while causing almost no slowdown. Codes and Decoding. We considered several logarithmic codes for use in our representations. In addition to the gamma code [9], which we used in our previous experiments, we implemented byte codes, nibble codes, and snip codes—three closely related codes of our own devising. Gamma codes store an integer d using a unary code for flogd] followed by a binary code for d-2 riogd l. This uses a total of 1 + 2 [log d\ bits. Assuming the machine word size is at least logd bits, gamma codes can be decoded in constant time using table lookup. Decoding the gamma codes is the bottleneck in making queries. To reduce the overhead we devised three codes, snip, nibble, and byte codes, that better take advantage of the fact that machines are optimized to manipulate bytes and words rather than extract arbitrary bit sequences. These codes are special 2, 4-, and 8-bit versions of a more general fc-bit code which encodes integers as a sequence of fc-bit blocks. We describe the fc-bit version. Each block starts with a continue bit which specifies whether there is another block in the code. An integer i is encoded by checking if is less or equal to 2 fc ~ 1 . If so a single block is created with a 0 in the continue bit and the binary representation for i — 1 in the other k — 1 bits. If not, the first block is created with a 1 in the continue-bit and the binary representation for (i — 1) mod 2 fc ~" 1 in the remaining bits (the mod is implemented with a bitwise and). This block is then followed by the code
for [(i — l)/2 fc 1J (the / is implemented with a bitwise shift). The 8-bit version (byte code) is particularly fast to encode and decode since all memory accesses are to bytes. The 4-bit version (nibble code) and 2-bit version (snip code) are somewhat slower since they require more accesses and require extracting pieces of a byte. We also considered using Huffman and Arithmetic codes which are based on the particular distribution at hand. However, we had used these codes in previous work [3] and found that although they save a little space over gamma codes (about 1 bit per edge for arithmetic codes), they are more expensive to encode and decode. Since our primary goal was to improve time1 performance, we did not implement these codes. 4 Dynamic Representation Here we present a data structure that permits dynamic insertion (and deletion) of edges in the graph. In the static data structure, the data for each vertex is concatenated and stored in one chunk of memory, with a separate index to allow finding the start of each vertex. In the dynamic data structure, the size of a vertex can change with each update, so it is necessary to dynamically assign memory to vertices. Our dynamic structure manages memory in blocks of fixed size. The data structure initially contains an array with one memory block for each vertex. If additional memory is needed to store the data for a vertex, the vertex is assigned additional blocks, allocated from a pool of spare memory blocks. The blocks are connected into a linked list. When we allocate an additional block for a vertex, we use part of the previous block to store a pointer to the new one. We use a hashing technique to reduce the size of these pointers to only 8 bits. To work efficiently the technique requires that a constant fraction of the blocks remain empty. This requires a hash function that maps (address, i) pairs to addresses in the spare memory pool. Our representation tests values of i in the range 0 to 127 until the result of the hash is an unused block. It then uses that value of i as the pointer to the block. Under certain assumptions about the hash function, if the memory pool is at most 80% full, then the probability that this technique will fail is at most .80128 ~ 4 * 10-13. To help ensure memory locality, a separate pool of contiguous memory blocks is allocated for each 1024 vertices of the graph. If a given pool runs out of memory, it is resized. Since the pools of memory blocks are fairly small this resizing is relatively efficient. Caching. For graph operations that have high locality, such as repeated insertions to the same vertex, it may be inefficient to repeatedly encode and decode
Graph auto feocean m!4b ibml? ibmlS CA PA googlel googleO lucent scan
Vtxs 448695 143437 214765 185495 210613 1971281 1090920 916428 916428 112969 228298
Max Source Edges Degree 3D mesh [35] 6629222 37 6 3D mesh [35] 819186 40 3D mesh [35] 3358036 circuit [1] 4471432 150 circuit [1] 4443720 173 street map [34] 12 5533214 street map [34] 3083796 9 5105039 6326 web links [10] web links [10] 5105039 456 routers [25] 423 363278 routers [25] 640336 1937
Table 1: Properties of the graphs used in our experiments. the neighbors of a vertex. We implemented a variant of our structure that uses caching to improve access times. When a vertex is queried, its neighbors are decoded and stored in a temporary adjacency list structure. Memory for this structure is drawn from a separate pool of list nodes of limited size. The pool is managed in first in first out mode. A modified vertex that is flushed from the pool is written back to the main data structure in compressed form. We maintain the uncompressed adjacency lists in sorted order (by neighbor label) to facilitate writing them back. 5
Experimental Setup Graphs. We drew test graphs for our experiments from several sources: 3D Mesh graphs from the online Graph Partitioning Archive [35], street connectivity graphs from the Census Bureau Tiger/Line data [34, 28], graphs of router connectivity from the SCAN project [25], graphs of webpage connectivity from the Google [10] programming contest data, and circuit graphs from the ISPD98 Circuit Benchmark Suite [1]. The circuit graphs were initially hypergraphs; we converted them to standard graphs by converting each net into a clique. Properties of these graphs are shown in Table 1. For edges we list the number of directed edges in the graph. For the directed graphs (googlel and googleO) we take the degree of a vertex to be the number of elements in its adjacency list. Machines and compiler. The experiments were run on two machines, each with 32-bit processors but with quite different memory systems. The first uses a .7GHz Pentium III processor with .iGHz frontside bus and 1GB of RAM. The second uses a 2.4GHz Pentium 4 processor with .8GHz frontside bus and 1GB of RAM. The Pentium III has a cache-line size of 32 bytes, while the Pentium 4 has an effective cache-line size of 128
53
bytes. The Pentium 4 also supports quadruple loads and hardware prefetching, which are very effective for loading consecutive blocks from memory, but not very useful for random access. The Pentium 4 therefore performs much better on the experiments with strong spacial locality (even more than the factor of 3.4 in processor speed would indicate), but not particularly well on the experiments without spacial locality. All code is written in C and C++ and compiled using g++ (3.2.3) using Linux 7.1. Benchmarks. We present times for depth-firstsearch as well as times for reading and inserting all edges. We select a DPS since it visits every edge once, and visits them in a non-trivial order exposing caching issues better than simply reading the edges for each vertex in linear order. Our implementation of DPS uses a character array of length n to mark the visited vertices, and a stack to store the vertices to return to. It does nothing other than traverse the graph. For reading the edges we present times both for accessing the vertices in linear order and for accessing them in random order. In both cases the edges within a vertex are read in linear order. For inserting we insert in three different orders: linear, transpose, and random. Linear insertion inserts all the out-edges for the first vertex, then the second, etc.. Transpose insertion inserts all the in-edges for the first vertex, then the second, etc.. Note that an in-edge (z, j) for vertex j goes into the adjacency list of vertex i not j. Random insertion inserts the edges in random order. We compare the performance of our data structure to that of standard linked-list and array-based data structures, and to the LED A [19] package. Since small differences in the implementation can make significant differences in performance, here we describe important details of these implementations. Adjacency lists. We use a singly linked-list data structure. The data structure uses a vertex-array of length n to access the lists. Each array element i contains the degree of vertex i and a pointer to a linked list of the out-neighbors of vertex i. Each link in the list contains two words: an integer index for the neighbor and a pointer for the next link. We use our own memory management for the links using free lists—no space is wasted for header or tail words. The space required is therefore 2n + 2m + 0(1) words (32 bits each for the machines we used). Assuming no deletions, sequential allocation returns consecutive locations in memory— this is important for understanding spacial locality. In our experiments we measured DPS runtimes after inserting the edges in three orders: linear, transpose, and random. These insertion orders are describe above. The insertion orders have a major affect on the runtime
54
for accessing the linked lists—the times for DPS vary by up to a factor of 11 due to the insertion order. For linear insertion all the links for a given vertex will be in adjacent physical memory locations giving a high degree of spacial locality. This means when an adjacency list is traversed most of the links will be found in the cache— they are likely to reside on the same cache line as the previous link. This is especially true for our experiments on the Pentium 4 which has 128-byte cache lines (each cache line can fit 16 links). For random insertion, and assuming the graph does not fit in cache, accessing every link is likely to be a cache miss since memory is being accessed in completely random order. We also measured runtimes with the vertices labeled in two orders: randomized and separator. In the randomized labeling the integer labels are assigned randomly. In the separator labeling we use the labeling generated by our graph separator—the same as used by our compression technique. The separator labeling gives better spacial locality in accessing both the vertexarray and the visited-array during a DFS. This is because loading the data for a vertex will load the data for nearby vertices which are on the same cache-line. Following an edge to a neighbor is then likely to access a vertex nearby in the ordering and still in cache. If linear insertion is used the separator labeling also improves locality on accessing the links during a DFS. This is because the links for neighboring vertices will often fall on the same cache lines. We were actually surprised at what a strong effect labeling based on separators had on performance. The performance varied by up to a factor of 7 for the graphs with low degree and the machine with 128-byte cache lines. Adjacency Array. The adjacency array data structure is a static representation. It stores the outedges of each vertex in an edge-array, with one integer per edge (the index of the out neighbor). The edgearrays for the vertices are stored one after the other in the order of the vertices. A separate vertex-array points to the start of the edge-array for each vertex. The number of out-edges of vertex i can be determined by taking the difference of the pointer to the edge array for vertex i and the edge array for vertex i + 1. The total space required for an adjacency array i s n + ra + 0(l) words. For static representations it makes no sense to talk about different insertion orders of the edges. The ordering of the vertex labeling, however, can make a significant difference in performance. As with the linkedlist data-structure we measured runtimes with the vertices labeled in randomized and separator order. Also as with linked lists, using the separator ordering improved performance significantly, again by up to a factor of 7.
Graph auto feocean m!4b ibml? ibmlS CA PA lucent scan googlel googleO Avg
Rand Ti 0.268s 0.048s 0.103s 0.095s 0.113s 0.920s 0.487s 0.030s 0.067s 0.367s 0.363s
Array Sep T/Ti 0.313 0.312 0.388 0.536 0.398 0.126 0.137 0.266 0.208 0.226 0.250 0.287
Space 34.17 37.60 34.05 33.33 33.52 43.40 43.32 41.95 43.41 37.74 37.74 38.202
Byte T/Ti Space 0.294 10.25 0.312 12.79 0.349 10.01 0.536 10.19 0.442 10.24 14.77 0.146 0.156 14.76 0.3 14.53 0.253 15.46 0.258 11.93 0.278 12.59 0.302 12.501
Nibble Space T/Ti 7.42 0.585 0.604 10.86 0.728 7.10 7.72 1.115 0.867 7.53 0.243 10.65 0.258 10.65 0.5 11.05 0.402 11.84 0.405 8.39 9.72 0.460 0.561 9.357
Our Structure Snip Space T/Ti 0.776 6.99 11.12 0.791 0.970 6.55 7.58 1.400 7.18 1.070 0.293 10.55 10.60 0.310 0.566 10.79 0.477 11.61 0.452 7.37 0.556 9.43 0.696 9.07
Gamma Space T/Ti 7.18 1.063 11.97 1.0 1.320 6.68 7.70 1.968 7.17 1.469 0.333 11.25 0.355 11.28 0.700 11.48 0.552 12.14 0.539 7.19 0.702 9.63 0.909 9.424
DiffByte Space 0.399 12.33 0.374 13.28 0.504 11.97 0.747 12.85 0.548 12.16 0.167 14.81 0.178 14.80 0.333 14.96 0.298 16.46 0.302 13.39 0.327 13.28 0.380 13.662
T/T!
Table 2: Performance of our static algorithms compared to performance of an adjacency array representation. Space is in bits per edge; time is for a DPS, normalized to the first column, which is given in seconds.
LEDA. We also ran all our experiments using LED A [19] version 4.4.1. Our experiments use the LEDA graph object and use the forall_outedges and forall_vertices for the loops over edges and vertices. All code was compiled with the flag LEDA_CHECKING_OFF. For analyzing the space for the LEDA data structure we use the formula from the LEDA book [19, page 281]: 52n + 44m + O(l) bytes. We note that comparing space and time to LEDA is not really fair since LEDA has many more features than our data structures. For example the directed graph data structure in LEDA stores a linked list of both the inedges and out-edges for each vertex. Our data structures only store the out-edges. LEDA also stores the edges in a doubly-linked list allowing traversal in either direction and a simpler deletion of edges6 Experimental Results Our experiments measure the tradeoffs of various parameters in our data structures. This includes the type of prefix code used in both the static and dynamic cases, and the block size used and the use of caching in the dynamic case. We also study a version that difference encodes out-edges relative to the source vertex rather than the previous out-edge. This can be used where the user needs control of the ordering of the out-edges. We make use of this in a compact representation of simplicial meshes [4]. 6.1 Static representations Table 2 presents results comparing space and DPS times for the static representations for all the graphs on the Pentium 4. Tables 5 and 6 present summary results for a wider set of operations on both the Pentium III and Pentium 4. In Ta-
ble 2 all times are normalized to the first column, which is given in seconds. The average times in the bottom row are averages of the normalized times, so the large graphs are not weighted more heavily. All times are for aDFS. For the adjacency-array representation times are given for the vertices ordered both randomly (Rand) and using our separator ordering (Sep). As can be seen the ordering can affect performance by up to a factor of 8 for the graphs with low average degree (i.e., PA and CA), and a factor of 3.5 averaged over all the graphs. This indicates that the ordering generated by graph separation is not only useful for compression, but is also critical for performance on standard representations (we will see an even more pronounced effect with adjacency lists). The advantage of using separator orders to enhance spacial locality has been previously studied for use in sparse-matrix vector multiply [31, 11], but not well studied for other graph algorithms. For adjacency arrays the ordering does not affect space. For our static representation times and space are given for four different prefix codes: Byte, Nibble, Snip and Gamma. The results show that byte codes are significantly faster than the other codes (almost twice as fast as the next fastest code). This is not surprising given that the byte codes take advantage of the byte instructions of the machine. The difference is not as large on the Pentium III (a factor of 1.45). It should be noted that the Gamma codes are almost never better than Snip codes in terms of time or space. We also include results for the DiffByte code, a version of our byte code that encodes each edge as the difference between the target and source, rather than the difference between the target and previous target.
55
4
3
Graph auto feocean m!4b ibm!7 ibmlS CA PA lucent scan googlel googleO Avg
2i 0.318s 0.044s 0.146s 0.285s 0.236s 0.212s 0.119s 0.018s 0.034s 0.230s 0.278s
Space 11.60 14.66 11.11 12.95 12.41 10.62 10.69 13.67 15.23 11.91 13.62 12.58
T/Ti 0.874 0.863 0.876 0.849 0.847 0.943 0.941 0.888 0.941 0.895 0.863 0.889
12
8
Space 10.51 13.79 10.07 11.59 11.14 12.42 12.41 14.79 16.86 12.04 13.28 12.62
T/Tx 0.723 0.704 0.684 0.614 0.635 0.952 0.949 0.833 0.852 0.752 0.694 0.763
Space 9.86 12.97 9.41 10.44 10.12 23.52 23.35 22.55 26.39 15.71 15.65 16.36
T/Ti 0.613 0.681 0.630 0.529 0.563 1.0 1.0 0.833 0.852 0.730 0.658 0.735
16
Space 10.36 17.25 10.00 10.53 10.36 35.10 34.85 31.64 37.06 20.53 19.52 21.56
T/Ti 0.540 0.727 0.554 0.491 0.521 1.018 1.025 0.833 0.852 0.730 0.640 0.721
Space 9.35 22.94 8.92 10.95 10.97 46.68 46.35 41.22 48.08 25.78 24.24 26.86
r/Ti
0.534 0.750 0.554 0.459 0.5 1.066 1.058 0.888 0.882 0.726 0.676 0.736
20
Space 11.07 28.63 10.46 11.39 11.64 58.26 57.85 51.09 59.34 31.21 29.66 32.78
Table 3: Performance of our dynamic algorithm using nibble codes with various block sizes. For each size we give the space needed in bits per edge (assuming enough blocks to leave the secondary hash table 80% full) and the time needed to perform a DPS. Times are normalized to the first column, which is given in seconds.
This increases the space since the differences are larger and require more bits to encode. Furthermore each difference requires a sign bit. It increases time both since there are more bits to decode, and because the sign bits need to be extracted. Overall these effects worsens the space bound by an average of 10% and the time bound by an average of 25%. Comparing adjacency arrays with the separator structures we see that the separator-based representation with byte codes is a factor of 3.3 faster than adjacency arrays with random ordering but about 5% slower for the separator ordering. On the Pentium III the byte codes are always faster, by factors of 2.2 (.729/.330) and 1.3 (.429/.330) respectively (see Table 6). The compressed format of the byte codes means that they require less memory throughput than for adjacency arrays. This is what gives the byte codes an advantage on the Pentium III since more neighbors get loaded on each cache line requiring fewer main-memory accesses. On the Pentium 4 the effective cache-line size and memory throughput is large enough that the advantage is reduced. Table 5, later in the section, describes the time cost of simply reading all the edges in a graph (without the effect of cache locality). 6.2 Dynamic representations A key parameter for the dynamic representation is selecting the block size. Large blocks are inefficient since they contain unused space; small blocks can be inefficient since they require proportionally more space for pointers to other blocks. In addition, there is a time cost for traversing from one block to the next. This cost includes both the time for computing the hash pointer and the potential time for
56
a cache miss. Because of this larger blocks are almost always faster. Table 3 presents the time and space for a range of block sizes. The results are based on nibble-codes on the Pentium 4 processor. The results for the other codes and the Pentium III are qualitatively the same, although the time on the Pentium III is less sensitive to the block size. For all space reported in this section we size the backup memory so that it is 80% full, and include the 20% unused memory in the reported space. As should be expected, for the graphs with high degree the larger block sizes are more efficient while for the graphs with smaller degree the smaller block sizes are more efficient. It would not be hard to dynamically decide on a block size based on the average degree of the graph (the size of the backup memory needs to grow dynamically anyway). Also note that there is a timespace tradeoff and depending on whether time or space is more important a user might want to use larger blocks (for time) or smaller blocks (for space). Table 4 presents results comparing space and DFS times for the dynamic representations for all the graphs on the Pentium 4. Tables 5 and 6 give summary results for a wider set of operations on both the Pentium III and Pentium 4. Table 3 gives six timings for linked lists corresponding to the two labeling orders and for each labeling, the three insertion orders. The space for all these orders is the same. The table also gives space and time for two settings of our dynamic data structure: Time Opt and Space Opt. Time Opt uses byte codes and is based on a block size that optimizes time.1 Space Opt uses the 1
We actually pick a setting that optimizes T3S where T is time
Graph auto feocean m!4b ibral? ibmlS CA PA lucent scan googlel googleO Avg
Random Vtx Order Lin Rand Trans r/7\ T/Ti 7i 1.160s 0.512 0.260 0.136s 0.617 0.389 0.565s 0.442 0.215 0.735s 0.571 0.152 0.730s 0.524 0.179 1.240s 0.770 0.705 0.660s 0.780 0.701 0.063s 0.634 0.492 0.117s 0.735 0.555 0.975s 0.615 0.376 0.960s 0.651 0.398 0.623 0.402
Linked List Sep Vtx Order Lin Rand Trans T/Ti T/Ti T/Ti 0.862 0.196 0.093 0.147 0.176 0.801 0.884 0.184 0.090 0.904 0.357 0.091 0.890 0.276 0.080 0.107 0.101 0.616 0.109 0.625 0.112 0.142 0.730 0.190 0.128 0.700 0.188 0.774 0.164 0.096 0.108 0.786 0.162 0.108 0.779 0.192
Space 68.33 75.21 68.09 66.66 67.03 86.80 86.64 83.90 86.82 75.49 75.49 76.405
Our Structure Space Opt Time Opt Block Time Block Time Size T/Ti Space Size T/T! Space 0.087 16 0.148 9.35 20 13.31 0.227 0.117 12.97 10 8 14.71 8.92 0.086 16 0.143 20 13.53 12 0.205 14.52 20 0.118 10.53 10 0.190 14.97 20 0.108 10.13 3 0.170 10.62 5 0.108 15.65 15.64 3 0.180 10.69 5 0.115 0.174 0.285 13.67 6 3 20.49 0.290 15.23 0.170 8 3 28.19 4 0.211 12.04 0.125 16 28.78 13.54 5 0.231 0.123 26.61 16 0.207 11.608 0.121 18.763
Table 4: The performance of our dynamic algorithms compared to linked lists. For each graph we give the spaceand time-optimal block size. Space is in bits per edge; time is for a DFS, normalized to the first column, which is given in seconds.
try is pseudo-random within the group, the location of the backup blocks has little effect on performance. In fact our experiments (not shown) showed no noticeable effect on DFS times for different insertion orders. Overall the space optimal dynamic implementation is about a factor of 6.6 more compact than adjacency lists, while still being significantly faster than linked lists in most cases (up to a factor of 7 faster for randomly inserted edges). On the Pentium 4 linked lists with linear insertion and separator ordering take about 50% less time than our space optimal dynamic representation and 10% less time than our time optimal dynamic representation. On the Pentium III linked lists with linear insertion and separator ordering take about a factor of 1.2 more time than our space optimal dynamic representation and 1.7 more time than our time optimal dynamic representation. Times for insertion are reported below. Summary. Tables 5 and 6 summarize the time complexity of various operations using the data structures we have discussed. For each structure we list the time required for a DFS, the time required to read all the neighbors of each vertex (examining vertices in linear or random order), the time required to search each vertex v for a neighbor v + 1, and the time required to construct the graph by linear, random, or transpose insertion. All times are normalized to the time required for a DFS on an adjacency list with random labeling, and the normalized times are averaged over all graphs in our dataset. and £>' is space. This is because the time gains for larger blocks List refers to adjacency lists. LEDA refers to the become vanishingly small and can be at a large cost in regards to LED A implementation. For List, LEDA and Array, 3 more space efficient nibble codes and is based on a block size that optimizes space. As with the adjacency-array representation, the vertex label ordering can have a large effect on performance for adjacency-lists, up to a factor of 7. In addition to the label ordering, the insertion ordering can also make a large difference in performance for adjacency-lists. The insertion order can cause up to a factor of 11 difference in performance for the graphs with high average degree (e.g. auto, ibm.17 and ibmlS) and a factor of 7.5 averaged over all the graphs (assuming the vertices are labeled with the separator ordering). The effect of insertion order has been previously reported (e.g. [19, page 268] and [7]) but the magnitude of the difference was surprising to us—the largest factor we have previously seen reported is about 4. We note that the magnitude is significantly less on the Pentium III with its smaller cache-line size (an average factor of 2.5 instead of 7.5). The actual insertion order will of course depend on the application, but it indicates that selecting a good insertion order is critical. We note, however, that if a user can insert in linear order, then they are better off using one of the static representations, which allow insertion in linear order. For our data structure the insertion order does not have any significant effect on performance. This is because the layout in memory is mostly independent of the insertion order. The only order dependence is due to hash collisions for the secondary blocks. Since each hash
space. For space optimal we optimize TS .
57
Graph ListRand ListOrdr LEDARand LEDAOrdr DynSpace DynTime CachedSpace CachedTime ArrayRand ArrayOrdr Byte Nibble Snip Gamma
DPS 1.000 0.322 2.453 1.119 0.633 0.367 0.622 0.368 0.945 0.263 0.279 0.513 0.635 0.825
Read Linear Random 0.744 0.099 0.740 0.096 2.876 1.855 2.268 0.478 0.933 0.440 0.650 0.233 0.935 0.431 0.690 0.240 0.638 0.095 0.641 0.092 0.197 0.693 0.873 0.399 1.044 0.562 1.188 0.710
Find Next 0.121 0.119 2.062 0.519 0.324 0.222 0.324 0.246 0.092 0.092 0.205 0.340 0.447 0.521
Linear 0.571 0.711 16.802 7.570 14.666 9.725 2.433 2.234 — — — — — —
Insert Random Transpose 28.274 3.589 0.864 28.318 16.877 21.808 7.657 20.780 23.901 15.538 15.607 10.183 28.660 8.975 19.849 6.600 — — — — — — — — — — — —
Space 76.405 76.405 432.636 432.636 11.608 18.763 13.34 19.073 38.202 38.202 12.501 9.357 9.07 9.424
Table 5: Summary of space and normalized times for various operations on the Pentium 4.
Graph ListRand ListOrdr LEDARand LEDAOrdr DynSpace DynTime CachedSpace CachedTime ArrayRand ArrayOrdr Byte Nibble Snip Gamma
DPS 1.000 0.710 3.163 2.751 0.626 0.422 0.614 0.430 0.729 0.429 0.330 0.488 0.684 0.854
Linear 0.631 0.626 2.649 2.168 0.503 0.342 0.498 0.355 0.319 0.319 0.262 0.411 0.625 0.764
Read Random 0.995 0.977 3.038 2.878 0.715 0.531 0.723 0.558 0.643 0.639 0.501 0.646 0.856 1.016
Find Next 0.508 0.516 2.518 1.726 0.433 0.335 0.429 0.360 0.298 0.302 0.280 0.387 0.538 0.640
Linear 1.609 1.551 17.543 11.846 17.791 13.415 2.616 2.597 — — — — — —
Insert Random Transpose 17.719 3.391 17.837 1.632 19.342 17.880 19.365 11.783 22.520 18.423 16.926 13.866 25.380 7.788 20.601 6.569 — — — — — — — — — — — —
Space 76.405 76.405 432.636 432.636 11.608 17.900 13.36 17.150 38.202 38.202 12.501 9.357 9.07 9.424
Table 6: Summary of space and normalized times for various operations on the Pentium III.
58
Rand uses a randomized ordering of the vertices and Ordr uses the separator ordering. The times for DPS, Read, and Find Next reported for List and LEDA are based on linear insertion of the edges (i.e., this is the best case for them). Dyn refers to a version of our dynamic data structure that does not cache the edges for vertices in adjacency lists. Cached refers to a version that does. For the "DynSpace" and "CachedSpace" structures we used a space-efficient block size; for "DynTime" and "CachedTime" we used a time-efficient one. Array refers to adjacency arrays. Byte, Nibble, Snip and Gamma refer to the corresponding static representations. Note that the cached version of our dynamic algorithm is generally slightly slower, but for the linear and transpose insertions it is much faster than the noncached version. Those insertions are the operations that can make use of cache locality. For linear insertion our cached dynamic representations is a factor of 3-4 times slower than adjacency lists on the Pentium 4 and a factor of about 1.5 slower on the Pentium III. LEDA is significantly slower and less space efficient than the other representations, but as previously mentioned LEDA has many features these other representations do not have. 7
Algorithms
Here we describe results for two algorithms that might have the need for potentially very large graphs: Google's PageRank algorithm and a maximum bipartite matching algorithm. They are meant to represent a somewhat more realistic application of graphs than a simple DFS. PageRank. We use the simplified version of the PageRank algorithm [22]. The algorithm involves finding the eigenvector of a sparse matrix (1 — e)A + eU, where A is the matrix representing the link structure among pages on the web (normalized), U is the uniform matrix (normalized) and e is a parameter of the algorithm. This eigenvector can be computed iteratively by maintaining a vector R and computing on each step Ri = ((1 — t)A + eU)Ri-i. Each step can be implemented by multiplication of a vector by a sparse 0-1 matrix representing the links in A, followed by adding a uniform vector and normalizing across the resulting vector to account for the out degrees (since A needs to be normalized). The standard representation of a sparse matrix is the adjacency array as previously described. We compare an adjacency-array implementation with several other implementations. We ran this algorithm on the Google out-link graph for 50 iterations with e = .15. For each representation we computed the time and space required. Figure 7 lists the results. On the Pentium III, our static representa-
Representation Dyn-B4 Dyn-N4 Dyn-B8 Dyn-N8 Gamma Snip Nibble Byte ArrayOrdr ArrayRand ListOrdr ListRand
Time PHI 30.40 32.96 26.55 30.29 38.56 34.19 26.38 21.09 21.12 33.83 30.96 44.56
(sec) P4 11.05 12.48 9.23 11.25 15.60 13.38 10.94 8.04 6.38 27.59 6.12 28.33
Space (b/e) 17.54 13.28 19.04 15.65 9.63 9.43 9.72 12.59 37.74 37.74 75.49 75.49
Table 7: Performance of our PageRank algorithm on different representations.
tion with the byte code is the best. On the Pentium 4, the array with ordered labeling gives the fastest results, while the byte code gives good compression without sacrificing too much speed. Bipartite Matching. The maximum bipartite matching algorithm is based on representing the graph as a network flow and using depth first search to find augmenting paths. It takes a bipartite graph from vertices on the left to vertices on the right and assigns a capacity of 1 to each edge. For each edge the implementation maintains a 0 or 1 to indicate the currentflowon the edge. It loops through the vertices in the left set using DFS to find an augmenting path for each vertex. If it finds one it pushes one unit of flow through and updates the edge weights appropriately. Even though conceptually the graph is directed, the implementation needs to maintain edges in both directions to implement the depth-first search. To avoid an fi(n2) best-case runtime, a stack was used to store the vertices visited by each DFS so that the entire bit array of visited vertices did not need to be cleared each time. This optimization is suggested in the LEDA book [19, page 372]. We also implemented an optimization that does one level of BFS before the DFS. This improved performance by 40%. Finally we used a strided loop through the left vertices, using a prime number (11) as the stride. This reduces locality, but greatly improved performance since the average since of the DFS to find an unmatched pair was reduced signficantly. Since the graph is static the static representations are sufficient. We ran this algorithm using our byte code, nibble code, and adjacency array implementations. The bit array for the 0/1 flow flags is accessed using the same indexing structure (semi-direct-16) as
59
Representat ion Nibble Byte ArrayOrdr ArrayRand
Time (sec) P4 75.8 27.6 59.9 19.9 57.1 18.6 83.2 28.0
PHI
Space (b/e) 13.477 16.363 41.678 41.678
Table 8: Performance of our bipartite maximum matching algorithm on different static representations.
used for accessing the adjacency lists. A dynamically sized stack is used for the DPS and for storing the visited vertices during a DPS. We store 1 bit for every edge (in each direction) to indicate the the current flow, 1 bit for every vertex to mark visited flags, and 1 bit for every vertex on the right to mark whether it is matched. The maximum bipartite matching algorithm was run on a modified version of the Google-out graph. Two copies were created for each vertex, one on the left and one on the right. The out links in the Google graph point from the left vertices to the right ones. The results are given in Figure 8. The memory listed is the total memory including the representation of the graph, the index for 0/1 flow flags, the flow flags themselves, the visited and matched flags and the stacks. For all three representations we assume the same layout for this auxiliary data, so the only difference in space is due to the graph representation. The space needed for the two stacks is small since the largest DPS involves under 10000 vertices.
8
Discussion
Our experiments indicate that the additional cost needed to decode the compressed representation is small or insignificant compared to other cost for even a rather simple graph algorithm, DPS. As noted, under most situations the compressed representations are faster than standard representations even though many more operations are needed for the decoding. This seems to be because the performance bottleneck is accessing memory and not the bit operations used for decoding. The one place where the standard representations are slightly faster for DPS is when using separator orderings, and linear insertion on the Pentium 4. We were somewhat surprised at the large effect that different orderings had on the performance on the Pentium 4 for both adjacency lists and adjacency arrays. The performance differed by up to a factor of 11, apparently purely based on caching effects (the number of edges traversed is identical for any DPS on a fixed graph). The differences indicate that performance numbers reported for graph algorithms should specify the layout of memory and ordering used for the vertices. The differences also indicate that significant attention needs to be paid to vertex ordering in implementing fast graph algorithms. We note that the same separator ordering as used for graph compression seems to work very well for improving performance on adjacency lists and adjacency arrays. This is not surprising since both compression and memory layout can take advantage of locality in the graphs so that most accesses are close in the ordering. In our analysis we do not consider applications that have a significant quantity of information that needs to be stored with the graphs, such as large weights on the edges or labels on vertices. Clearly such data might diminish the advantages of compressing the graph structure. We note, however, that such data might also be compressed. In fact the locality of the labeling that the separators give could be useful for such compression. For example on the web graphs, vertices nearby in the vertex ordering are likely to share a large prefix of their URL. Similarly on the finite-element meshes, vertices nearby in the vertex ordering are likely to be nearby in space, and hence might be difference encoded. The ideas used in this paper can clearly be generalized to other structures beyond simple graphs. A separate paper [4] describes how similar ideas can be used for representing simplicial meshes in two and three dimensions.
Here we summarize what we feel are the most important or surprising results of the experiments. First we note that the simple and fast separator heuristic we used seems to work very well for our purposes. This is likely because the compression is much less sensitive to the quality of the separator than other applications of separators, such as nested dissection [16]. For nested dissection more sophisticated separators are typically used. It would be interesting to study the theoretical properties of the simple heuristic. For our bounds rather sloppy approximations on the separators are sufficient since any separator of size kn°, c < I will give the required bounds, even if actual separators might be much smaller. We note that all the "real-world" graphs we were able to find had small separators—much smaller than References would be expected for random graphs. Small separators [1] C. J. Alpert. The ISPD circuit benchmark suite. In is a property of real world graphs that is sometimes not ACM International Symposium on Physical Design, properly noted. pages 80-85, Apr. 1998.
60
[2] C. J. Alpert and A. Kahng. Recent directions in netlist partitioning: A survey. VLSI Journal, 19(1-2): 1-81, 1995. [3] D. Blandford and G. Blelloch. Index compression through document reordering. In Data Compression Conference (DCC), pages 342-351, 2002. [4] D. Blandford, G. Blelloch, D. Cardoze, and C. Kadow. Compact representations of simplicial meshes in two and three dimensions. In International Meshing Roundtable (IMR), pages 135-146, Sept. 2003. [5] D. Blandford, G. Blelloch, and I. Kash. Compact representations of separable graphs. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages 342-351, 2003. [6] Y.-T. Chiang, C.-C. Lin, and H.-I. Lu. Orderly spanning trees with applications to graph encoding and graph drawing. In SODA, pages 506-515, 2001. [7] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cacheconscious structure layout. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, pages 1-12, 1999. [8] R. C.-N. Chuang, A. Garg, X. He, M.-Y. Kao, and H.-I. Lu. Compact encodings of planar graphs via canonical orderings and multiple parentheses. Lecture Notes in Computer Science, 1443:118-129, 1998. [9] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-2l(2):194-203, March 1975. [10] Google. Google programming contest web data, http://www.google.com/programming-contest/, 2002. [11] H. Han and C.-W. Tseng. A comparison of locality transformations for irregular codes. In Proc. Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 70-84, 2000. [12] X. He, M.-Y. Kao, and H.-I. Lu. Linear-time succinct encodings of planar graphs via canonical orderings. SIAM J. on Discrete Mathematics, 12(3):317325, 1999. [13] X. He, M.-Y. Kao, and H.-I. Lu. A fast general methodology for information-theoretically optimal encodings of graphs. SIAM J. Computing, 30(3):838-846, 2000. [14] G. Jacobson. Space-efficient static trees and graphs. In 30th FOCS, pages 549-554, 1989. [15] K. Keeler and J. Westbrook. Short encodings of planar graphs and maps. Discrete Applied Mathematics, 58:239-252, 1995. [16] R. J. Lipton, D. J. Rose, and R. E. Tarjan. Generalized nested dissection. SIAM Journal on Numerical Analysis, 16:346-358, 1979. [17] R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM J. Applied Mathematics, 36:177189, 1979. [18] H.-I. Lu. Linear-time compression of bounded-genus graphs into information-theoretically optimal number of bits. In SODA, pages 223-224, 2002. [19] K. Mehlhorn and S. Naber. LEDA: A platfor for
combinatorial and geometric computing. Cambridge University Press, 1999. [20] G. L. Miller, S.-H. Teng, W. P. Thurston, and S. A. Vavasis. Separators for sphere-packings and nearest neighbor graphs. Journal of the ACM, 44:1-29, 1997. [21] J. I. Munro and V. Raman. Succinct representation of balanced parentheses, static trees and planar graphs. In 38th FOCS, pages 118-126, 1997. [22] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998. [23] A. L. Rosenberg and L. S. Heath. Graph Separators, with Applications. Kluwer Academic/Plenum Publishers, 2001. [24] J. Rossignac. Edgebreaker: Connectivity compression for triangle meshes. IEEE Transactions on Visualization and Computer Graphics, 5(1):47-61, /1999. [25] SCAN project. Internet maps, http://www.isi.edu/ scan/mercator/maps.html, 2000. [26] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905, 2000. [27] H. D. Simon. Partitioning of unstructured problems for parallel processing. Computing Systems in Engineering, 2:135-148, 1991. [28] J. Sperling. Development and maintenance of the tiger database: Experiences in spatial data sharing at the u.s. bureau of the census. In Sharing Geographic Information, pages 377-396, 1995. [29] A. Strehl and J. Ghosh. A scalable approach to balanced, high-dimensional clustering of market-baskets. In Proc. of the Seventh International Conference on High Performance Computing (HiPC 2000), volume 1970 of Lecture Notes in Computer Science, pages 525536. Springer, Dec. 2000. [30] A. Szymczaka and J. Rossignac. Grow & Fold: compressing the connectivity of tetrahedral meshes. Computer-Aided Design, 32:527-537, 2000. [31] S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711-726, 1997. [32] G. Turan. Succinct representations of graphs. Discrete Applied Mathematics, 8:289-294, 1984. [33] J. D. Ullman. Computational Aspects of VLSI. Computer Science Press, Rockville, MD, 1984. [34] U.S. Census Bureau. UA Census 2000 TIGER/Line file download page. http: //www. census. gov/ geo/www/tiger/tigerua/ua_tgr2k.html, 2000. [35] C. Walshaw. Graph partitioning archive, http://www.gre.ac.uk/"c.walshaw/partition/, 2002. [36] D. Watts and S. Strogatz. Collective dynamics of small-world networks. Nature, 363:202-204, 1998.
61
Kernelization Algorithms for the Vertex Cover Problem: Theory and Experiments* Faisal N. Abu-Khzam*, Rebecca L. Collins*, Michael R. Fellows5, Michael A. Langston*, W. Henry Suters* and Christopher T. Symons* Abstract A variety of efficient kernelization strategies for the classic vertex cover problem are developed, implemented and compared experimentally. A new technique, termed crown reduction, is introduced and analyzed. Applications to computational biology are discussed. 1
Introduction
restricts a problem's search space to a tree whose size is bounded only by a function of the relevant parameter. Vertex Cover has a host of real-world applications, particularly in the field of computational biology. It can be used in the construction of phylogenetic trees, in phenotype identification, and in analysis of microarray data, to name just a few. While the fact that the parameterized Vertex Cover problem is FPT makes the computation of exact solutions theoretically tractable, the practical matter of reducing run times to reasonable levels for large parameter values has remained a formidable challenge. In this paper, we develop and implement a suite of algorithms, each of which takes as input a graph G of size n and a parameter &, and returns a graph G1 of size n' < n and a parameter k' < k. It is important that (1) n' is bounded by a function only of k' (not of n) and (2) G has a vertex cover of size at most k if and only if G' has a vertex cover of size at most k'. Each algorithm may be employed independently or in conjunction with others. The use of such techniques is called kernelization. An amenability to kernelization seems to be a hallmark of problems that are FPT, and a characteristic that distinguishes them from apparently more difficult AfP-hard problems. After kernelization is completed, the solution process reverts to branching. Large-scale empirical studies of branching methods are also underway. See, for example, [2].
The computational challenge posed by AfP-hard problems has inspired the development of a wide range of algorithmic techniques. Due to the seemingly intractable nature of these problems, practical approaches have historically concentrated on the design of polynomial-time algorithms that deliver only approximate solutions. The notion of fixed parameter tractability (FPT) has recently emerged as an alternative to this trend. FPT's roots can be traced at least as far back as work motivated by the Graph Minor Theorem to prove that a variety of otherwise difficult problems are decidable in low-order polynomial time when relevant parameters are fixed. See, for example, [10, 11]. Formally, a problem is FPT if it has an algorithm that runs in O(f(k)nc] time, where n is the problem size, k is the input parameter, and c is a constant [9]. A well-known example is the parameterized Vertex Cover problem. Vertex Cover is posed as an undirected graph G and a parameter k. The question asked is whether G contains a set C of A; or fewer vertices such that every edge of G has at least one endpoint in C. Vertex 2 Kernelization Alternatives Cover can be solved in O(1.2852fc + kn) time [5] with the use of a bounded search tree technique. This technique Our vertex cover kernelization suite consists of four separate techniques. The first method is a simple scheme based on the elimination of high degree vertices. The *This research has been supported in part by the National Science Foundation under grants EIA-9972889 and CCR-0075792, second and third methods reformulate vertex cover as by the Office of Naval Research under grant N00014-01-1- an integer programming problem, which is then sim0608, by the Department of Energy under contract DE-AC05- plified using linear programming. This linear programOOOR22725, and by the Tennessee Center for Information Tech- ming problem can either be solved using standard linear nology Research under award EO1-0178-081. programming techniques or restated as a network flow t Division of Computer Science and Mathematics, Lebanese problem that can then be solved using an algorithm deAmerican University, Chouran, Beirut 1102 2801, Lebanon * Department of Computer Science, University of Tennessee, veloped by Dinic [8, 12]. The fourth method, which is Knoxville, TN 37996-3450, USA new, we call crown reduction. It is based on finding a § School of Electrical Engineering and Computer Science, Uni- particular independent set and its neighborhood, both
versity of Newcastle, Calaghan NSW 2308, Australia
62
of which can be removed from the graph. We develop of v and w in G. This reduces the problem size by the theoretical justification for each of these techniques, two and the parameter size by one. This idea was first and provide examples of their performance on samples proposed in [5], and warrants explanation. To illustrate, suppose u is a vertex of degree 2 with neighbors v and of actual application problems. w. If one neighbor of u is included in the cover and is eliminated, then u becomes a pendant vertex and can 3 Preprocessing Rules The techniques we employ are aided by a variety of pre- also be eliminated by including its other neighbor in processing rules. These are computationally inexpen- the cover. Thus it is safe to assume that there are two sive, requiring at most O(n 2 ) time with very modest cases: first, u is in the cover while v and w are not; second v and w are in the cover while u is not. If u' constants of proportionality. is not included in an optimal vertex cover of G' then Rule 1: An isolated vertex (one of degree zero) cannot all the edges incident on u' must be covered by other be in a vertex cover of optimal size. Because there vertices. Therefore v and w need not be included in an are no edges incident upon such a vertex, there is no optimal vertex cover of G because the remaining edges benefit in including it in any cover. Thus, in G', an {u, v} and {u, w} can be covered by u. In this case, isolated vertex can be eliminated, reducing n' by one. if the size of the cover of G' is k' then the cover of G This rule is applied repeatedly until all isolated vertices will have size k = k' + 1 so the decrement of k in the construction is justified. On the other hand, if u' is are eliminated. included in an optimal vertex cover of G' then at least Rule 2: In the case of a pendant vertex (one of degree some of its incident edges must be covered by u'. Thus one), there is an optimal vertex cover that does not the optimal cover of G must also cover its corresponding contain the pendant vertex but does contain its unique edges by either v or w. This implies that both v and neighbor. Thus, in G', both the pendant vertex and w are in the vertex cover. In this case, if the size of its neighbor can be eliminated. This also eliminates the cover of G' is fc', then the cover of G will also be any additional edges incident on the neighbor, which of size k = k' + 1. This rule is applied repeatedly until may leave isolated vertices for deletion under Rule 1. all vertices of degree two are eliminated. If recovery of This reduces n' by the number of deleted vertices and the computed vertex cover is required, a record must be reduces k' by one. This rule is applied repeatedly until kept of this folding so that once the cover of G' has been all pendant vertices are eliminated. computed, the appropriate vertices can be included in the cover of G. Rule 3: If there is a degree-two vertex with adjacent neighbors, then there is a vertex cover of optimal size 4 Kernelization by High Degree that includes both of these neighbors. If u is a vertex This simple technique [3] relies on the observation that of degree 2 and v and w are its adjacent neighbors, a vertex whose degree exceeds k must be in every vertex then at least two of the three vertices (u, v, and w) cover of size at most k. (If the degree of v exceeds k but must be in any vertex cover. Choosing u to be one of v is not included in the cover, then all of v's neighbors these vertices would only cover edges (u, v) and (u, w) must be in the cover, making the size of the cover at while eliminating u and including v and w could possibly least k + 1.) This algorithm is applied repeatedly until cover not only these but additional edges. Thus there all vertices of degree greater than k are eliminated. It is is a vertex cover of optimal size that includes v and superlinear (O(n2)) only because of the need to compute w but not n. G' is created by deleting u, v, w and the degree of each vertex. their incident edges from G. It is then also possible to The following theorem is a special case of a more delete the neighbors of v and w whose degrees drop to general result from [1]. It is used to bound the size of the zero. This reduces n' by the number of deleted vertices kernel that results from the application of this algorithm and reduces k' by two. This rule is applied repeatedly in combination with the aforementioned preprocessing until all degree-two vertices with adjacent vertices are rules. Note that if this algorithm and the preprocessing eliminated. rules are applied, then the degree of each remaining vertex lies in the range [3, k']. Rule 4: If there is a degree-two vertex, u, whose neighbors, v and w, are non-adjacent, then u can be THEOREM 4.1. If G' is a graph with a vertex cover of folded by contracting edges {u,v} and {u,w}. This size k', and if no vertex of G' has degree less than three is done by replacing u, v and w with one vertex, u', or more than k', then whose neighborhood is the union of the neighborhoods Proof. Let C be a vertex cover of G', with \C\ = k'. C"s
63
complement, C, is an independent set of size n' - A/. Let F be the set of edges in G' with endpoints in C. Since the elements of C have degree at least three, each element of C must have at least three neighbors in C. Thus the number of edges in F must be at least 3(n' — k'). The number of edges with endpoints in C is no smaller than |F| and no larger than k'\C\, since each element of G has at most k' neighbors. Therefore 3(n' - fc') < |F|
1 whenever {w, v} € E. This is an integer programming formulation of the optimization problem. In this context the objective function is the size of the vertex cover, and the set of all feasible solutions consists of functions from V to {0,1} that satisfy condition (2). We relax the integer programming problem to a linear programming problem by replacing the restriction Xu — {0,1} with Xu > 0. The value of the objective function returned by the linear programming problem is a lower bound on the objective function returned by the related integer programming problem [12, 13, 14]. The solution to the linear programming problem can be used to simplify the related integer programming problem in the following manner. Let N(S) denote the neighborhood of S, and define P = {it € V\XU > 0.5}, Q = {u <E V\XU = 0.5} and R = {u 6 V\XU < 0.5}. We employ the following modification by Khuller [13] of a theorem originally due to Nemhauser and Trotter [14]. THEOREM 5.1. If P, Q, and R are defined as above, there is an optimal vertex cover that is a superset of P and that is disjoint from R. Proof. Let A be the set of vertices of P that are not in the optimal vertex cover and let B be the set of vertices of R that are in the optimal cover, as selected by the solution to the integer programming problem. Notice that N(R) C P because of condition (2). It is not possible for \A\ < \B\ since in this case replacing B with A in the cover decreases its size without uncovering any edges (since N(R] C P), and so it is not optimal. Additionally it is not possible for | A\ > \B\ because then we could gain a better linear programming solution by setting e = min{Xv-0.5 : v e A} and replacing Xu with Xu + e for all u € B and replacing Xv with Xv — e for all v e A. Thus we must conclude that |^4| = |5|, and
64
in this case we can replace B with A in the vertex cover (again since N(R) C P) to obtain the desired optimal cover. • The graph G' is produced by removing vertices in P and R and their adjacent edges. The problem size is n' ~ n— \P\ — \R\ and the parameter size is k' = k— \P\. Notice that since the size of the objective function for the linear programming problem provides a lower bound on the objective function for the integer programming problem, the size of any optimal cover of G' is bounded below by Y^U&Q Xu = 0.5|Q|. If this were not the case, then the original linear programming procedure that produced Q would not have produced an optimal result. This allows us to observe that if |Q| > 2k'', then this is a "no" instance of the vertex cover problem. When dealing with large dense graphs the above linear programming procedure may not be practical since the number of constraints is the number of edges in the graph. Because of this, the code used in this paper solves the dual of the LP problem, turning the minimization problem into a maximization problem, and making the number of constraints equal to the number of vertices [6, 7]. Other methods to speed LP kernelization appear in [12]. 6 Kernelization by Network Flow This algorithm solves the linear programming formulation of vertex cover by reducing it to a network flow problem. As in [14], we define a bipartite graph B in terms of the input graph G, find the vertex cover of B by computing a maximum matching on it, and then assign values to the vertices of G based on the cover of B. In our implementation of this algorithm, we compute the maximum matching on B by turning it into a network flow problem and using Dinic's maximum flow algorithm [8, 12]. The time complexity of the overall procedure is Q(m<Jn), where m denotes the number of edges and n denotes the number of vertices in G. The size of the reduced problem kernel is bounded by 2k. The difference between this method of linear programming and the previous LP-kernelization is that this method is faster (LP takes O(n3)) and is guaranteed to assign values in {0,0.5,1}, while LP codes assign values in the (closed) interval between 0 and 1. Given a graph (7, the following algorithm can be used to produce an LP kernelization of G. Step 1: Convert G = (V, E) to a bipartite graph H = (U,F). U = A U B, where A = {Av\v e V} and B = {Bv\v € V}. If (v,w) € E, then we place both (Av, Bw) and (Aw, Bv) in F. Step 2: Convert the bipartite graph // to a network
flow graph H': Add a source node that has directed Suppose x lies in 5", in which case x is unmatched. Then arcs toward every vertex in A, and add a sink node that y must be matched, because otherwise M would not be receives directed arcs from every vertex in B. Make all optimal. Hence another edge (w,y) exists in M. Then, edges between A and B directed arcs toward B. Give w € R and y € T, and therefore (x,y) is covered. We now argue that N(R) = T. By definition, T C N(R). all arcs a capacity of 1. To see that N(R) C T, note that if y is contained in Step 3: Find an instance of maximum flow through the N(R) and not in T, then y must be matched, because graph H'. For this project we used Dinic's algorithm, otherwise there would be an augmenting path (some but any maximum flow algorithm will work. neighbor of y contained in R is reachable from S by an alternating path). Thus, the neighbor of y contained in Step 4: The arcs in H' included in the instance of M must also be in R, and so y must lie in T. Thus, maximum flow that correspond to edges in the bipartite if x lies in R, y must lie in T and again edge (x, y) is graph H constitute a maximum matching set, M, of H. covered. Finally, if x lies in A — S — R, (x,y) is covered by definition. Step 5: From M we can find an optimal vertex cover of As for the size of the cover, l^) = n — |M|, where n H. Case 1: If all vertices are included in the matching, is the number of vertices in the original graph. Because the vertex cover of H is either the set A or the set B. all elements of R are matched, and by definition, T is all Case 2: -If not all vertices are included in the matching, vertices reachable from R by matched edges, |T| = \R\. we begin by constructing three sets 5, /?, and T. With Therefore the size of the cover = \(A- S - R)\JT\ = the setup we have here (\A\ = \B\ and all capacities \(A - S - R) U R\ = \A - S\ = \M\. Since H is a are 1), if all vertices in A are matched, then all vertices bipartite graph with a maximum matching of size \M|, in B are too. So we can assume that there is at least the minimum vertex cover size for H is |M|, so this cover one unmatched vertex in A. Let S denote the set of all is optimal. • unmatched vertices in A. Let R denote the set of all vertices in A that are reachable from S by alternating THEOREM 6.2. Step 6 of this algorithm produces a paths with respect to M. Let T denote the set of feasible solution to the linear programming formulation neighbors of R along edges in M. The vertex cover on G. of the bipartite graph H is (A-S - R)\J(T). The size Proof. Each vertex in G is assigned a weight - either 0, of the cover is \M\. 0.5, or 1. For every edge (x, y) € G, we want the sum of Step 6: Assign weights to all of the vertices of G their weights, Wx + Wy, to be greater than or equal to according to the vertex cover of H. For vertex v: 1. Thus, for every edge (x,y), the cover of H contains Wv = 1 if Av and Bv are both in the cover of H. either Ax and J3X, Ay and By, Ax and By, or Ay and Bx. Wv = 0.5 if only one of Av or Bv is in the cover of H. If (x, y) € G, then (Ax, By), (Ay, Bx) € H. Because tf's Wv = 0. if neither Av nor Bv is in the cover of H. cover contains one or both endpoints of every edge, we In Case 1 of Step 5, where one of the sets A or B know that at the least either Ax and Bx, Ay and By, becomes the vertex cover, all vertices are returned with Ax and By, or Ay and Bx are in the cover. Therefore the weight 0.5. every edge (x, y) 6 G has a valid weight. • Step 7: The graph that remains will be G' = (V, E'} The graph G' is produced in the same manner as where V — {v\Wv = 0.5} and k' = k — x where x is the in the LP kernelization procedure and again we have the situation where a "no" instance occurs whenever number of vertices with weight Wv — 1. \G'\ > 2k. The time complexity of the algorithm is THEOREM 6.1. Step 5 of this algorithm produces a valid O(m^/n) where ra and n are the number of edges and optimal vertex cover of H. vertices, respectively, in the graph G. Since there are at 2 Proof. In Case 1, the vertex cover of H includes all of most O(n ) edges in the graph, this implies the overall A or all of B. The size of the vertex cover is \A\ — \B\ = method is O(n?). \M\. Without loss of generality assume the vertex cover is A. All edges in the bipartite graph have exactly one 7 Kernelization by Crown Reduction endpoint in A. Thus, every edge is covered, and the The technique we dub "crown reduction" is somewhat vertex cover is valid. similar to the other algorithms just described. With In Case 2, we have sets 5, R C A and T c B. it, we attempt to exploit the structure of the graph to The vertex cover is defined as (A - S - R) U (T). For identify two disjoint vertex sets H and / so that there every edge (x, y) € H, x must lie in 5, R or A — S — R. is an optimal vertex cover containing every vertex of H
65
but no vertex of /. This process is based on the following definition, theorem, and algorithm. A crown is an ordered pair (H, I) of disjoint vertex subsets from a graph G that satisfies the following criteria: (1) H = N ( I ) , (2) / is a nonempty independent set, and (3) the edges connecting H and / contain a matching in which all elements of H are matched. H is said to contain the head of the crown, whose width is \H\. I contains the points of the crown. This notion is depicted in Figure 1. THEOREM 7.1. If G is a graph with a crown (77,7), then there is an optimal vertex cover of G that contains all of H and none of I.
THEOREM 7.2. The algorithm produces a crown as long as the set IQ of unmatched outsiders is not empty. Proof. First, since M\ is a maximal matching, the set O, and consequently its subset /, are both independent. Second, because of the definition of H, it is clear that H =• N ( I N - I ) and since / = IN — IN-I we know that H = N(I). The third condition for a crown is proven by contradiction. Suppose there were an element h € H that were unmatched by A/2. Then the construction of H would produce an augmented (alternating) path of odd length. For h to be in H there must have been an unmatched vertex in O that begins the path. Then the repeated step 4a would always produce an edge that is not in the matching while the next step 4b would produce an edge that is part of the matching. This process repeats until the vertex h is reached. The resulting path begins and ends with unmatched vertices and alternates between matched and unmatched edges. Such a path cannot exist if A/2 is in fact a maximum matching because we could increase the size of the matching by swapping the matched and unmatched edges along the path. Therefore every element of H must be matched by A/2- The actual matching used in the crown is the matching A/2 restricted to edges between H and 7. •
Proof. Since there is a matching of the edges between H and /, any vertex cover must contain at least one vertex from each matched edge. Thus the matching will require at least \H\ vertices in the vertex cover. This minimum number can be realized by selecting H to be in the vertex cover. It is further noted that vertices from H can be used to cover edges that do not connect / and H, while this is not true for vertices in /. Thus, including the vertices from H does not increase, and may decrease, the size of the vertex cover as compared to including The graph G' is produced by removing vertices in vertices from /. Therefore, there is a minimum-size 7:7 and 7 and their adjacent edges. The problem size is vertex cover that contains all the vertices in H and none n' = n — \H\ — |/|; the parameter size is k' — k — \H\. of the vertices in /. • It is important to note that if a maximum matching of size greater than k is found, then there can be no vertex The following algorithm can be used to find a crown cover of size at most &, making this is a "no" problem in an arbitrary input graph. instance. Therefore, if either of the matchings A/i and Step 1: Find a maximal matching A/i of the graph, A/2 is larger than k, the process can be halted. This and identify the set of all unmatched vertices as the set fact also allows1 us to place an upper bound on the size of the graph G . O of outsiders. Step 2: Find a maximum auxiliary matching A/2 of the THEOREM 7.3. If MI and A/2 are each of size at most k, then there are no more than 3k vertices that lie edges between O and N(O). outside the crown. Step 3: Let /o be the set of vertices in O that are Proof. Because the size of A/i is at most &, it contains at unmatched by A/2most 2k vertices. Thus, the set O contains at least n—2k Step 4: Repeat steps 4a and 4b until n = N so that vertices. Because the size of A/2 is at most k, no more than k vertices in O are matched by A/2- Thus, there IN-I = ^TVare at least n — 3k vertices in O that are unmatched by Step 4a: Let Hn = N(In). Step 4b: Let 7n+i = In U {Hn's neighbors under A/2. These vertices are included in /o and are therefore in 7. It follows that the number of vertices in G not A/2}. included in 77 and 7 is at most 3k. • The desired crown is the ordered pair (H, 7), where H = HN and / = IN. We now determine the conditions The particular crown produced by this decomposinecessary to guarantee that this algorithm is successful tion depends on the maximal matching A/i used in its calculation. This suggests that it is may be desirable in finding such a crown.
66
Figure 1: Sample crown decompositions.
to try to perform the decomposition repeatedly, using pseudo-randomly chosen matchings, in an attempt to identify as many crowns as possible and consequently to reduce the size of the kernel as much as possible. It may also be desirable to perform preprocessing after each decomposition, because the decomposition itself can leave vertices of low degree. The most computationally expensive part of the procedure is finding the maximum matching A/2, which we accomplish in our implementations by recasting the maximum matching problem on a bipartite graph as a network flow problem. This we then solve using Dinic's algorithm. The run time is O(m,Jn}, which is often considerably better than O(nz). 8 Applications and Experimental Results Our experiments were run in the context of computational biology, where a common problem involves finding maximum cliques in graphs. Clique is W^lj-hard, however, and thus unlikely to be directly amenable to a fixed-parameter tractable approach [9]. Of course, a graph has a vertex cover of size k if and only if its complement has a clique of size n — k. We therefore exploit this duality, finding maximum cliques via minimum covers. One of the applications to which we have applied our methods involves finding phylogenetic trees based on protein domain information, a high-throughput technique pioneered in [4]. The graphs we utilized were obtained from domain data gleaned at NCBI and SWISSPROT, two well-known open-source repositories of biological information. Tables 1 through 3 illustrate representative results on graphs derived from the sh2 protein domain. The integer after the domain name indicates the threshold used to convert the input into an unweighted graph.
In our implementations, the high-degree method is incorporated along with the preprocessing rules. In general, we have found that the most efficient approach is to run this combination before attempting any of the other kernelization methods. To see this, compare the results of Table 1 with those of Table 2. Next, it is often beneficial to use one or more other kernelization routines. As long as the problem is not too large, network flow and linear programming are sometimes able to solve the problem without any branching whatsoever. This behavior is exemplified in Table 2. The final task is to perform branching if needed. On very dense graphs, kernelization techniques (other than the high-degree rule) may not reduce the graph very much, if at all. Both linear programming and network flow can be computationally expensive. Because crown reduction is quick by comparison, performing it prior to branching appears to be a wise choice. This aspect of kernelization is highlighted in Table 4. Unlike the others, the graph used in this experiment was derived from microarray data, where a maximum clique corresponds to a set of putatively co-regulated genes. 9 A Few Conclusions Crown reduction tends to run much faster in practice than does linear programming. It sometimes reduces the graph just as well, even though its worst-case bound on kernel size is larger. Given the methods at hand, the most effective approach seems to be first to run preprocessing and the high-degree algorithm, followed by crown reduction. If the remaining problem kernel is fairly sparse, then either linear programming or network flow should probably be applied before proceeding on to the branching stage. On the other hand, if the
67
|| Algorithm [| run time [ kernel size (n') \ parameter size (kf) |[ 43 0.58 181 High Degree with Preprocessing 0 0 Linear Programming 1.15 18 36 Network Flow 1.25 98 328 Crown Reduction 0.23 Table 1: Graph: sh2-3.dim, n = 839, k = 246. Times are given in seconds.
|| Algorithm Linear Programming Network Flow Crown Reduction
run time | kernel size (n') \ parameter size (kr) jj 0.05 0 0 0.02 0 0 0.03 23 69
Table 2: Graph: sh2-3.dim, n = 839, k = 246. Preprocessing (including the high-degree algorithm) was performed before each of the other 3 methods. Times are given in seconds.
Algorithm Linear Programming Network Flow Crown Reduction
run time [ kernel size (n') 1:09.49 616 40.53 622 0.07 630
parameter size 389 392 392
Table 3: Graph: sh2-10.dim, n = 726, k — 435. Preprocessing (including the high-degree algorithm) was performed before each of the other 3 methods. Times are given in seconds.
|| run time || Algorithm High Degree with Preprocessing 6.95 Linear Programming 37:58.95 Network Flow 38:21.93 6.11 Crown Reduction
kernel size (n') | parameter size (kf) || 971 896 1683 1608 1683 1608 1683 1608
Table 4: Graph: u74-0.7-75.compl, n = 1683, k = 1608, \E\ = 1,259,512. Times are given in seconds.
68
kernel is relatively dense, it is probably best to avoid the cost of these methods, and instead begin branching straightaway.
Acknowledgment We wish to thank an anonymous reader, whose thorough review of our original typescript helped us to improve the presentation of the results we report here.
References [1] F. N. Abu-Khzam. Topics in Graph Algorithms: Structural Results and Algorithmic Techniques, with Applications. PhD thesis, Dept. of Computer Science, University of Tennessee, 2003. [2] F. N. Abu-Khzam, M. A. Langston, and P. Shanbhag. Scalable parallel algorithms for difficult combinatorial problems: A case study in optimization. In Proceedings, International Conference on Parallel and Distributed Computing and Systems, pages 563-568, Los Angeles, CA, November, 2003. [3] J. F. Buss and J. Goldsmith. Nondeterminism within P. SIAM Journal on Computing, 22:560-572, 1993. [4] J. Cheetham, F. Dehne, A. Rau-Chaplin, U. Stege, and P. J. Taillon. Solving large FPT problems on coarse grained parallel machines. Technical report, Department of Computer Science, Carleton University, Ottawa, Canada, 2002. [5] J. Chen, I. Kanj, and W. Jia. Vertex cover: further observations and further improvements. Journal of Algorithms, 41:280-301, 2001. [6] V. Chv'tal. Linear Programming. W.H.Freeman, New York, 1983. [7] W. Cook. Private communication, 2003. [8] E. A. Dinic. Algorithm for solution of a problem of maximum flows in networks with power estimation. Soviet Math. Dokl, 11:1277-1280, 1970. [9] R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer-Verlag, 1999. [10] M. R. Fellows and M. A. Langston. Nonconstructive tools for proving polynomial-time decidability. Journal of the ACM, 35:727-739, 1988. [11] M. R. Fellows and M. A. Langston. On search, decision and the efficiency of polynomial-time algorithms. Journal of Computer and Systems Sciences, 49:769779, 1994. [12] D. Hochbaum. Approximation Algorithms for MPhard Problems. PWS, 1997. [13] S. Khuller. The vertex cover problem. ACM SIC ACT News, 33:31-33, June 2002. [14] G.L. Nemhauser and L. E. Trotter. Vertex packings: Structural properties and algorithms. Mathematical Programming, 8:232-248, 1975.
69
Safe Separators for Treewidth* Hans L. Bodlaender* Abstract A set of vertices 5 C V is called a safe separator for treewidth, if S is a separator of G, and the treewidth of G equals the maximum of the treewidth over all connected components W of G — S of the graph, obtained by making S a clique in the subgraph of G, induced by W U S. We show that such safe separators are a very powerful tool for preprocessing graphs when we want to compute their treewidth. We give several sufficient conditions for separators to be safe, allowing such separators, if existing, to be found in polynomial time. In particular, every minimal separator of size one or two is safe, every minimal separator of size three that does not split off a component with only one vertex is safe, and every minimal separator that is an almost clique is safe; an almost clique is a set of vertices W such that there is a v € W with W — {v} a clique. We report on experiments that show significant reductions of instance sizes for graphs from probabilistic networks and frequency assignment.
1 Introduction Various NP-hard graph problems can be solved in polynomial time if the treewidth of the graph is bounded by a constant, see, amongst many others, [3, 7, 9, 11]. Experiments and applications show that this is also useful in a practical setting. The algorithm of Lauritzen and Spiegelhalter [16] to solve the probabilistic inference problem on probabilistic networks is the most commonly used algorithm for this problem and uses tree decompositions. Koster et al. [15] used tree decompositions to solve frequency assignment problems that could not be solved by other methods. An important problem that arises in such applications is to find a tree decomposition of the given graph of minimum or close to minimum width. It is known that for each fixed k, there exists a linear time algorithm that checks if a given graph has "This research was partially supported by NWO-EW and partially by EC contract IST-1999-14186: Project ALCOM-FT (Algorithms and Complexity - Future Technologies). t Institute of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, the Netherlands. [email protected] *Konrad-Zuse-Zentrum fur Informationstechnik Berlin, Takustrafie 7, D-14195 Berlin, Germany, [email protected]
70
Arie M.C.A. Koster* treewidth at most fc, and if so, finds a tree decomposition of G of width at most k [6]. Unfortunately, it appears that the constant factor of this algorithm is too big to make this algorithm usable in practice. (See [20].) So, an important task is to design practically efficient methods for finding tree decompositions of small width. In [8], preprocessing methods based on graph reduction were studied. A number of 'safe reduction rules' was proposed; each such rule rewrites the graph locally, thus decreasing the number of vertices in the graph, such that a tree decomposition of optimal width for the smaller reduced graph can be easily transformed to one for the original graph. When no reductions are possible, another method must be used to solve the problem on the remaining graph. Experiments on a set of graphs, taken from probabilistic networks applications, showed that sizes of these remaining graphs were in general much smaller than the sizes of the original graphs. In some cases, reduction was sufficient for finding the optimal solution to the problem. In this paper, we study a different form of preprocessing. Here we propose to use separators. Each preprocessing step with safe separators takes a graph, and replaces it by two or more smaller graphs. This way, we obtain a collection of graphs. Solving the treewidth problem on the original instance is equivalent to solving the treewidth problem on each of the graphs in the collection. However, the graphs in the collection are usually significantly smaller than that in the original instance. We can repeat trying to find safe separators in the graphs in the collection, replacing these again by even smaller graphs, until we do not find a safe separator in any graph in the collection. Then, the treewidth of the graphs in the collection must be established by other means: this may be trivial (e.g., when the graph is complete), can be done by an exact method like branch and bound (which can be fast enough when the preprocessing yielded only small graphs in the collection), or with an approximation algorithm. After some preliminary definitions and results in Section 2, we establish our main graph theoretic results in Section 3. In this section, we give several sufficient conditions for a separator to be safe for treewidth. In Section 4, we discuss how the safe separators can be found. In Section 5, we discuss on experiments that
we have carried out. Some final conclusions are made decomposed any further by a class of safe separators, in Section 6. Some proofs have been omitted in this we call this decomposition final. For one such class of safe separators, we need to extended abstract. define an almost clique to be a set S C V for which 2 Definitions and preliminary results there exists a vertex v € S such that S — v induces a clique. We call v the non- clique vertex of almost clique In this section, we give a number of definitions and a few easy or well known lemmas. We assume the reader to be S. S is an almost clique separator, when S is an almost familiar with standard graph terminology. In this paper, clique and S is a separator. S is an minimal almost we assume graphs to be undirected and without parallel clique separator, when S is an almost clique, and S is a edges or self loops. For a graph G = (V, E), let n = \V\ minimal separator. Finally, we define a graph H — (VH,EH) to be be the number of vertices and m = \E\ be the number of edges. For a vertex set S C V, we denote G — S as the a labelled minor of G = (VG,EG), when H can be subgraph of G, induced by V — S, G[V — S]. We denote obtained from G by a sequence of zero or more of G + clique(S) as the graph (V, E U {{v, w} \ v, w € S}). the following operations: deletion of edges, deletion of A tree decomposition of G = (V,E) is a pair vertices (and all adjacent edges), edge contraction that ({Xi | i e /},T), where {Xi \ i e /} is a collection keeps the label of one endpoint: when contracting the of subsets of V and T = (I, F] is a tree such that (i) edge {v, w} the resulting vertex will be named either v \JigjXi = V, (ii) for all {u,w} € E, there exists an or w. As for minors, the treewidth of a labelled minor i G / such that u, w € Xi, and (iii) for all i,j, k €. I H of G is at most the treewidth of G. with j on the path in T from i to k then Xi n Xk C Xj. The width of a tree decomposition ({Xi | i € /},T) is 3 Conditions for safeness maxte/ \Xi\ — 1. The treewidth of G is the minimum In this section, we give a number of sufficient conditions width over all tree decompositions of G. It is well- for separators to be safe. known for treewidth that for a set W C V that induces a clique in graph G = (V, E), and a tree decomposition LEMMA 3.1. Suppose S is a separator in G = (V,E). ({Xi | i € /},T = (I,F)) of G, there exists an t € / Suppose for every component Z of G — S, the graph G — Z contains a clique on S as a labelled minor. Then with W C Xi. A set of vertices 5 C V is a separator in G = (V, E} S is safe for treewidth. when G —5 has more than one connected component. S Proof. By the definition of safeness, it remains to show is a minimal separator, when it does not contain another that the treewidth of G is at least the maximum separator as a proper subset. It is well-known that S is over all components Z of G — S of the treewidth of a minimal separator if and only if for every component G[5uZ]4-clique(5r), i.e., that for every component Z of Z of G — S, and for every vertex v € S, there is a vertex G-S, the treewidth of G[SuZ]+clique(S) is at most the w € Z that is adjacent to u. 5 is a minimum separator, treewidth of G. Consider a component Z. From the fact when G has no separator of size smaller than S. S is a that G — Z contains a clique on S as a labelled minor, clique separator, when 5 forms a clique in G and 5 is a it follows that G has G[S \JZ]+ clique(S) as a labelled separator. minor: when applying the operations that yield a clique It is easy to verify (using standard techniques, e.g., on S from G - Z to G, we obtain G[S \JZ} + clique(5r). [2]) that for every graph G, and every separator 5" Since the treewidth does not increase by taking labelled in G, the treewidth of G is at most the maximum minors, the treewidth of G[5U Z] + clique(S) is at most over all components Z of G — 5 of the treewidth of the treewidth of G, and the lemma follows. G[S U Z] + clique(^). We call a separator S safe for treewidth, (or, in short safe), when the treewidth of G COROLLARY 3.1. Suppose S is a separator in G — equals the maximum over all components Z of G — 5 of (V, E). Suppose for every component Z of G — S, the graph G[Z U 5] contains a clique on S as a labelled the treewidth of G[S \JZ]+ clique^). By a reformulation of known results, e.g.. from [19], minor. Then S is safe for treewidth. clique separators are safe for treewidth (note that for Proof. Let S be a separator, and suppose that for every every component Z of G — 5, G[5 U Z] + clique(5') = component Z of G — 5, the graph G[Z U S] contains a G[S U Z] is a subgraph of G and hence its treewidth is clique on S as a labelled minor. Consider a component at most the treewidth of G). Z' of G - S. Let Z" be another component of G - S. In the next section, we give more general sufficient The graph G - Z' contains G[S U Z"] as a subgraph, conditions for separators to be safe. By this, classes and hence contains a clique on S as labelled minor. The of safe separators are defined. If a graph cannot be result now follows from Lemma 3.1.
71
THEOREM 3.1. If S is a minimal almost clique separator of G, then S is safe for treewidth. Proof. Consider a minimal almost clique separator S in G and let v be the non-clique vertex. We show that for every component Z of G - S, the graph G[Z U S] contains a clique on S as a labelled minor; the lemma then follows from Corollary 3.1. Consider a component Z of G - S, and consider the graph G[Z U S]. Since S is a minimal separator, v is adjacent to a vertex in Z. Hence, we can contract all vertices in Z to v. The resulting graph G' has vertex set 5, and is a clique: for w, x e S — {v}. {w, x} is an edge in G and hence in G'; for w e S — {v}, w is adjacent to Figure 1: Illustration to the proof of Lemma 3.2 a vertex y € Z in G (since S is a minimal separator), and hence after the contractions there is an edge {v, w} inG'. THEOREM 3.2. Let S be a minimum separator of size In the next section, we see that there is a polynomial three in G. Suppose G — S has two connected compotime algorithm to find a minimal separator that is an nents, each with at least two vertices. Then S is safe almost clique in a graph G when such a separator for treewidth. exists. Thus, Theorem 3.1 gives our first new method Proof. The theorem directly follows from Lemma 3.2 to preprocess the graph with safe separators. We also and Corollary 3.1. can establish safeness of some separators of small size. Other safe separators of size three are implied by COROLLARY 3.2. Every separator of size 1 is safe for the following lemma when k = 3. treewidth. Every minimal separator of size 2 is safe for treewidth. LEMMA 3.3. Let S be a minimal separator in G of size Proof. A vertex set of size 1 is a clique; a vertex set of k, such that G — S has at least k connected components. Then S is safe for treewidth. size 2 is a clique or almost clique. We now consider separators of size three. We first need the following lemma. LEMMA 3.2. Let S be a minimum separator of G with \S\ = 3, and let W be the vertex set of a connected component of G - S with \W\ > 2. Then G[W U 5] contains a clique on S as labelled minor. Proof. (The following short proof of this lemma is due to Gasper Fijavz.) First, we show that G[W U 5] contains a cycle C. Suppose G[W U S] is a forest. The vertices in W cannot have degree less than three, as G does not have separators of size one or two. Thus forest G[WuS] has at least two vertices of degree at least three, so has at least four leaves. As only the vertices in 5 can be a leaf and \S\ = 3, this is a contradiction. Now, consider cycle C, and take three arbitrary vertices w, x, y on C. As the minimum separators in G have size three, there are three vertices disjoint paths in G from w, x, y to 5", by the Menger theorem [18]. As S is a separator in G, these paths belong to G[W U S]. We can now contract all edges on each of these three paths, and edges on C until we obtain a clique on 5. So, a clique on S is a labelled minor of G[W U 5]. See Figure 1.
72
Proof. Consider a component Z of G — S. Let Zi,..., Zk-i be the components in (G - Z} - S, and let S = {vi,..., Ufc}. Since S is a minimal separator, for each z, 1 < i < k — 1, Vi is adjacent to a vertex in Zi, we can contract Zi to the vertex Vi. The result will be a clique on S, as for each i, j, 1 < i < j < k, Vj is adjacent to a vertex in Zi, hence u* is adjacent to Vj after the contraction of Zi to Vi. Safeness now follows by Lemma 3.1. Together with the almost clique separators, these results are quite powerful. The only case where a separator of size three is not necessarily safe is when it is the neighbourhood of a vertex of degree three, it splits the graph into two components, and the vertices in the separator are not adjacent. For separators of size four, we can derive some cases where they are safe as well, but in this case, the result become less powerful. LEMMA 3.4. Let G have no separator of size at most two, S = {v, 11;, x, y} be a separator of size four, with v adjacent to w, x, and y. Suppose that G has no separator of size three that contains v, and suppose that every connected component of G — S has at least two vertices. Then S is safe for treewidth.
Proof. Consider G — v. {w, x, y} is a separator in G — u, and G—v has no separator of size 1 or 2. Hence, with the proof of Theorem 3.2, we have that for every connected component Z of G — S = G — v — {w,x, y}, a clique on {w, x, y} is a labelled minor of G[{w, x, y} U Z\. As v is adjacent to to, x, and y, we hence have that a clique on S is a labelled minor of G[Z U 5]. Hence, S is safe. 4 Finding Safe separators We now discuss how to find safe separators for several types of separators discussed above. Some types of separators are easy to handle. Using safe separators of size zero means splitting the graph into its connected components. The use of separators of size one corresponds to splitting the graph into its biconnected components; this can be done in O(n + m) time using depth first search [21]. Splitting a graph into its triconnected components, and finding the separators of size two can also be done in linear time [13]. Clique separators are well studied and can be found in O(nm) time, see [22, 17, 5, 19]. The other types of safe separators require somewhat more discussion here. We first look to the minimal almost clique separators. LEMMA 4.1. Suppose G = (V,E) does not contain a clique separator. For every v € V, S C V — {v}, S is a minimal clique separator in G — {v}, if and only if 5 U {v} is a minimal almost clique separator in G. Proof. Clearly 5 is a clique separator in G — {v}, if and only if 5 U {v} is an almost clique separator in G. Now, suppose W C S U {v} and S U {v} are two different almost clique separator in G. If v 6 W, then W — {v} is a separator in G — {v}, and thus S is not a minimal separator in G. If v & W, then W is a clique, and hence G contains a clique separator, contradiction. Suppose X C S and S are two different clique separators in G — {v}. Then X\J {v} is an almost clique separator in G, and hence S U {v} is not a minimal separator in G. The lemma tells us that we can find the set of minimal almost clique separators of G = (V, E) by finding the minimal clique separators in G — {v} for all v € V. COROLLARY 4.1. The set of all minimal almost clique separators of a graph G = (V, E) can be found in O(n2m) time. After we have split a graph in the collection on a minimal almost clique separator we should check the resulting graphs again for having a minimal almost clique separator. Consider the graph in Figure 2. For
vertex v, there is no separator S such that 5 — {v} is a clique. However, after the graph has been split on separator {w,x,y} with {iy, x, y} turned into a clique, then the component with v contains a minimal separator 5' = {v, iy, x} with S' — {v} a clique. We now look to algorithms to find the safe separators of size three, indicated by Theorem 3.2. There is an O(n 2 ) algorithm to split a graph into its four-connected components and finding all separators of size three [14]. We conjecture that this may lead to an O(n2) time algorithm to make a safe separator decomposition of a given graph that is final with respect to safe separators of size one, of size two, and minimum separators of size three whose components each contain at least two vertices. In our experiments, we have used a simpler method, based on the vertex connectivity algorithm, described in [12, Section 6.2]. We also just find one safe separator, and repeat the procedure all over again on the new graphs in the collection. Using flow techniques, we can find for a pair of vertices v, w if there is a separator of size at most k that separates v from w in O((n + m)k) time. This can be used to check whether there is a separator 5 that separates v from w of size A; such that both v and w belong to a component of G — S with at least two vertices in O((n + m)fc 3 ) time as follows. Of course, when v and w are adjacent, then no separator between v and w exists. Suppose both v and w have degree at most k. Then we look to all O(k2) graphs obtained by contracting v to a neighbour, and contracting w to a neighbour, for all pairs of neighbours, excluding contractions to a vertex that is adjacent to both v and w. One can see that the required separator in G exists, if and only if there is a separator of size < k in one of these graphs obtained by contraction: if S separates v from w in the graph obtained by contracting v to x and contracting w to y, then in G, S also separates v from w, and x belongs to the same component as v in (7—5, and w and y also belong to the same component in G—S. Thus, we look to all O(k2) graphs obtained by the contractions for separators between v and w of size at most k. (Actually, when the separators are found using an application of the Ford-Fulkerson flow algorithm, we can see that it is not necessary to do the contractions to w; we omit here the technical details.) When both v and w have degree at least k 4-1, then for every separator 5" of size at most k that separates v from iy, both v and w have a neighbour that does not belong to S and hence belong to a component of G — S of size at least two. So, in this case, we just look for a separator of size at most k between v and w. If one of v or w has degree at most k and the other has not, then we look to the O(k] graphs, obtained by contracting the small degree vertex to one of its neighbours. Thus in O((n
73
Figure 2: New minimal almost clique separators can be formed time, we can determine if the desired separator between v and w exists, and if so, find one. We will apply this procedure for the case that k = 3. So far, we required the separator to separate a specific pair of vertices. To look for any separator in the graph, we use the scheme of [12, p. 129]. Take an arbitrary vertex v. For all vertices w, look if there is a separator of size at most fc, with v and w in different components of size at least two. If we find such a separator, we are done. If not, when the degree of v is at most fc, check if the neighbours of v split G into at least three components with two of them have size at least two. Otherwise, we know that v must belong to any separator of size at most k that splits G with at least two components of size at least two. Remove v from G, and look for a separator of size at most k — 1 in G — v with at least two components of size at least two. This procedure takes O(nmk*) time; we apply it with k = 3, so we have an O(nro) procedure to check if G has a safe separator, indicated by Theorem 3.2. Minimum separators that split the graph into at least three components are also safe (Lemma 3.3.) Most of these are already found by the procedure above; the remaining case is when there are two vertices of degree three with the same set of neighbours. We can determine in O(n) time if there are two vertices of degree three with the same neighbourhood: assume some order on the vertices. Then, list all vertices of degree three with their neighbours in sorted order, and then radix sort this list (see e.g., [10, Section 9.3]). Vertices with the same neighbourhood will be on consecutive places in this list.
74
Thus, in O(nm) time, we find for a given graph G if it has a safe separator of size three of the types, given in Theorem 3.2 or Lemma 3.3. Repeating this on newly formed graphs in the collection gives a (conservative) time bound of O(n 2 ra) to find all safe separators of size three. 5 Experiments In this section, we report on computational experiments that illustrate the significance of safe separators for treewidth. On the one hand, we show that safe separators indeed decompose graphs. On the other hand, we show that the computation times for algorithms to compute small widths are reduced significantly this way. All algorithms have been implemented in C++. Computations have been carried out on a Linux-operated PC with a 2.53 GHz Intel Pentium 4 processor. The safe separators are tested on two sets of instances. The first set consists of the moralised graphs for probabilistic networks. The second set is taken from the CALM A project on frequency assignment problems [1]. In total, 15 graphs from probabilistic networks and 25 from frequency assignment are considered in this study. Before applying the safe separators the graphs are preprocessed by the graph reduction rules presented in [8]. In this way, we avoid the detection of trivial separators, i.e., separators that can also be interpreted as one of the graph reduction rules. Such separators generate lots of small graphs that can be neglected anyway. Our first experiment concerns the decomposition by clique separators. Table 1 shows the results for those graphs that contain clique separators. The 15
instance munin2-pp munin3-pp munin4-pp munin-kgo-pp celarOl-pp celar03-pp celar07-pp celar08-pp celar09-pp celarlO-pp celarll-pp
\v\
Size
167 96 217 16 157 81 92 189 133 133 96
\E\
#cs
455 313 646 41 804 413 521 1016 646 646 470
6 2 2 1 1 5 3 3 1 1 1
Sizes of output graphs ^vertices (# subgraphs) 95(1), 18(2), 17(2), 8(2) 82(1), 9(2) 177(1), 23(2) 9(2) 110(1), 47(1) 63(1), 10(1), 7(1), 5(1), 4(1), 2(1) 71(1), 16(1), 8(1), 3(1) 120(1), 53(1), 16(1), 8(1) 120(1), 16(1) 120(1), 16(1) 80(1), 19(1)
CPU time 0.05 0.02 0.07 0.00 0.03 0.03 0.03 0.07 0.05 0.05 0.03
Table 1: The effect of clique separators
graphs not reported on do not contain clique separators. The column "#CS" reports on the number of clique separators applied whereas "Sizes of output graphs" reports on the number of vertices in the graphs obtained by the decomposition with in parentheses the number of graphs of that size. For example, the graph munin3pp is decomposed in one graph of 82 vertices and 2 of 9 vertices. Note that by each decomposition the total number of vertices (over all graphs) increases by at least one, as some vertices belong to more than one output graph. Table 1 shows that sometimes large components are decomposed from the rest of the graph; by that reducing the size of the largest component significantly. The computation times (in seconds) for this separation are very small. In our second experiment we extended the computation with the minimal almost-clique separators and the safe separators of size three (in this order). In Table 2 we report on all instances, regardless whether the separators decompose the graph or not. The columns "#ACS" and "#S3" report respectively on the number of found almost clique separators and separators of size three. Table 2 shows that both the almost-clique separators and safe separators of size three are effective in preprocessing the graph. In particular, the minimal almost clique separators turn out the be very effective. Separators of size 3 are found rarely but in some cases they indeed exist. For some instances (pathfinder-pp, celar06-pp) the largest output graph is a clique, and by this the treewidth is found. On the other hand, on four instances the safe separators did not sort out any effect. Except for the instance pignet2-pp, the computation times are reasonable. Our last experiment orients towards the approximation of treewidth for those instances that are not solved
by safe separator decomposition. For this purpose we implemented the maximum cardinality search-minimal algorithm [4] which can be used to generate an upper bound for the treewidth. In Table 3 we compare the values and computation times of this heuristic for the original graphs, the graphs preprocessed by the graph reduction rules, and the graphs decomposed by safe separators. Moreover, we report on the lower bound provided by the graph reduction rules [8] and the one that results from the safe separator decomposition, given by the largest output graph that forms a clique. The results show that an additional significant time reduction can be achieved by the safe separators. In addition, better widths and better lower bounds are derived occasionally. Most remarkable in this context is the instance diabetes, where the width is reduced from 35 via 20 to 4, the treewidth for this instance. 6 Conclusions In this paper, we introduced the notion of separators that are safe for treewidth. It was known that clique separators are safe, in our terminology. We have established a number of sufficient conditions for separators to be safe. Experiments show that such safe separators can be efficiently found, and help to reduce the problem size when we want to compute the treewidth and optimal tree decompositions for many graphs coming from practical applications. Thus, safe separators are a useful tool when preprocessing graphs for treewidth. In an earlier paper, graph reduction was used for preprocessing [8]. A closer look to the reduction rules shows that most can be obtained as a special case of applying safe separators. Safe separators thus are a more powerful tool, as there are many graphs that cannot be reduced with the reduction rules, but contain
75
Size
instance
Output #ACS #S3 7 0 85 0 0 0 2 0 4 13 2 2 0 4 2 0 0 0 1 0 1 0 0 5 0 0 0 1 2 0 0 1 1 19
\E\
#cs
26 116 308 66 167 96 217 16 14 23 27 12 1024 48 30 22 157
78 276 1158 188 455 313 646 41 75 54 63 43 3774 137 77 96 804
0 0 0 0 6 2 3 1 0 0 0 0 0 0 0 0 2
celar02-pp celar03-pp
19 81
115 413
0 5
0 17
0 0
celar04-pp
114
524
0
22
1
celar05-pp celar06-pp celar07-pp
80 16 92
426 101 521
0 0 4
13 1 12
0 0 0
celar08-pp
189
1016
4
32
1
celar09-pp celarlO-pp celarll-pp graphOl-pp graph02-pp graph03-pp graph04-pp graph05-pp graph06-pp graph07-pp graph08-pp graph09-pp graph 10- pp graph 11-pp graph!2-pp graph 13-pp graph 14- pp
133 133 96 89 179 79 179 91 180 180 314 405 328 307 312 420 395
646 646 470 332 659 293 678 394 790 790 1173 1525 1253 1338 1177 1772 1325
2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24 24 8 4 3 8 6 4 3 3 18 7 22 18 64 44 0
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
\v\
bar ley- pp diabetes-pp link-pp muninl-pp munin2-pp munin3-pp munin4-pp munin-kgo-pp oesoca+-pp oow-trad-pp oow-solo-pp pathfinder-pp pignet2-pp pigs-pp ship-ship-pp water-pp celarOl-pp
Sizes of output graphs ^vertices (# subgraphs) 16(1), 6(3), 5(3), 4(1) 8(1), 6(1), 5(84) 63(1), 5(2) 18(2), 17(4), 16(4), 6(2), 5(10), 4(2) 79(1), 7(2), 5(4) 55(2), 38(2), 23(2), 5(2) 7(2), 5(2) 21(1), 5(1) 16(1), 14(1) 7(5), 6(1) 47(1), 6(1) 24(1), 6(2) 21(1), 7(1) 58(1), 47(1), 19(1), 9(1), 8(3), 7(5), 6(2), 5(6), 4(3)
0.03 38(1), 11(1), 10(2), 9(2), 8(1), 7(1), 6(2), 0.74 5(7), 4(4), 3(1), 2(1) 62(1), 16(1), 9(2), 8(3), 7(1), 6(4), 5(7), 4(4), 6.92 3(1) 47(1), 19(1), 9(1), 8(1), 6(2), 5(4), 4(4) 1.18 12(2) 0.01 45(1), 16(1), 12(2), 7(2), 6(6), 5(2), 4(1), 1.10 3(2) 76(1), 39(1), 16(1), 12(3), 10(2), 8(2), 7(1), 6.02 6(2), 5(19), 4(6) 76(1), 16(1), 12(2), 6(1), 5(17), 4(6) 5.22 76(1), 16(1), 12(2), 6(1), 5(17), 4(6) 5.22 48(1), 19(1), 16(1), 13(1), 5(3), 4(4) 3.11 85(1), 10(4) 5.68 176(1), 8(3) 71.63 71(1), 7(8) 5.63 173(1), 8(6) 92.53 87(1), 10(4) 6.56 177(1), 10(3). 63.62 177(1), 10(3) 63.54 296(1), 9(18) 1960.48 398(1), 10(7) 1942.06 306(1), 8(11), 7(5), 6(6) 1859.36 289(1), 9(18) 1896.00 248(1), 7(64) 3619.29 12381.94 376(1), 8(44) 410.75
Table 2: Clique, almost clique, and size three separators for preprocessed instances
76
CPU time 0.06 5.37 36.60 0.80 0.54 1.17 1.65 0.01 0.01 0.07 0.07 0.01 3824.53 0.39 0.18 0.08 5.84
safe separators. However, the algorithms for applying graph reduction are much faster than those for finding safe separators, and thus the best practice seems to be to first apply graph reduction until this is not possible, and then look for safe separators in the graph. An open problem is to obtain faster algorithms to find the safe separators, especially the minimal almost clique separators and/or safe separators of size three (or four), and to find safe separator decompositions that are final for the given types of safe separators.
References [1] K. I. Aardal, C. A. J. Hurkens, J. K. Lenstra, and S. R. Tiourine. Algorithms for radio link frequency assignment: The CALMA project. Operations Research, 50(6):968 - 980, 2003. [2] S. Arnborg, D. G. Cornell, and A. Proskurowski. Complexity of finding embeddings in a fc-tree. SI AM J. Alg. Disc. Meth., 8:277-284, 1987. [3] S. Arnborg, J. Lagergren, and D. Seese. Easy problems for tree-decomposable graphs. J. Algorithms, 12:308340, 1991. [4] A. Berry, J. R. S. Blair, and P. Heggernes. Maximum cardinality search for computing minimal triangulations. In P. Widmayer, editor, Proceedings 28th Int. Workshop on Graph Theoretic Concepts in Computer Science, WG'02, pages 1-12. Springer Verlag, Lecture Notes in Computer Science, vol. 2573, 2002. [5] A. Bery and J.-P. Bordat. Decomposition by clique minimal separators. Research report, LIM, Marseiller, 1997. [6] H. L. Bodlaender. A linear time algorithm for finding tree-decompositions of small treewidth. SIAM J. Comput., 25:1305-1317, 1996. [7] H. L. Bodlaender. Treewidth: Algorithmic techniques and results. In I. Privara and P. Ruzicka, editors, Proceedings 22nd International Symposium on Mathematical Foundations of Computer Science, MFCS'97, Lecture Notes in Computer Science, volume 1295, pages 19-36, Berlin, 1997. Springer-Verlag. [8] H. L. Bodlaender, A. M. C. A. Koster, F. van den E5jkhof, and L. C. van der Gaag. Pre-processing for triangulation of probabilistic networks. In J. Breese and D. Koller, editors, Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 32-39, San Francisco, 2001. Morgan Kaufmann. [9] R. B. Borie, R. G. Parker, and C. A. Tovey. Automatic generation of linear-time algorithms from predicate calculus descriptions of problems on recursively constructed graph families. Algorithmica, 7:555-581, 1992. [10] T. H. Gormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, Mass., USA, 1989. [11] B. Courcelle and M. Mosbah. Monadic second-order
[12] [13] [14] [15] [16]
[17] [18] [19]
[20] [21] [22]
evaluations on tree-decomposable graphs. Theor. Comp. Sc., 109:49-82, 1993. S. Even. Graph Algorithms. Pitman, London, 1979. J. E. Hopcroft and R. E. Tarjan. Dividing a graph into triconnected components. SIAM J. Comput., 2:135158, 1973. A. Kanevsky and Y Ramachandran. Improved algorithms for graph four-connectivity. J. Comp. Syst. Sc., 42:288-306, 1991. A. M. C. A. Koster, S. P. M. van Hoesel, and A. W. J. Kolen. Solving partial constraint satisfaction problems with tree decomposition. Networks, 40:170-180, 2002. S. J. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. The Journal of the Royal Statistical Society. Series B (Methodological), 50:157-224, 1988. H.-G. Leimer. Optimal decomposition by clique separators. Disc. Math., 113:99-123, 1993. K. Menger. 2ir allgemeinen kurventheorie. Fund. Math., 10:96-115, 1927. K. G. Olesen and A. L. Madsen. Maximal prime subgraph decomposition of Bayesian networks. Technical report, Department of Computer Science, Aalborg University, Aalborg, Denmark, 1999. H. Rohrig. Tree decomposition: A feasibility study. Master's thesis, Max-Planck-Institut fur Informatik, Saarbriicken, Germany, 1998. R. E. Tarjan. Depth first search and linear graph algorithms. SIAM J. Comput., 1:146-160, 1972. R. E. Tarjan. Decomposition by clique separators. Disc. Math., 55:221-232, 1985.
77
instance barley diabetes link muninl munin2 muninS munin4 munin-kgo oesoca+ oow-trad oow-solo pathfinder pignet2 pigs ship-ship water celarOl celar02 celarOS celar04 celarOS celar06 celarO? celarOS celarOQ celarlO celarll graphOl graph02 graphOS graph04 graphOS graphOG graphO? graphOS graph09 graphlO graph 11 graph!2 graph 13 graph!4
Original width CPU time 0.16 7 182.77 35 37 857.43 8.27 15 444.31 16 662.03 15 431.79 28 335.74 13 11 0.31 0.07 6 0.12 6 0.43 6 77479.39 239 50.88 18 0.18 9 0.06 10 17 71.73 1.00 10 16 9.11 41.65 16 15 8.78 11 1.16 10.21 18 81.76 19 18 48.23 49.35 18 40.29 16 4.12 27 43.76 57 3.72 24 45.17 61 4.42 28 60 47.06 47.21 60 104 276.65 706.53 128 105 292.78 295.77 106 267.72 99 146 817.39 770.40 145
low 4 4 4 4 4 4 4 5 9 4 4 5 4 4 4 5 6 6 8 6 6 9 9 9 9 9 7 8 6 5 6 8 8 8 7 8 4 7 5 6 4
with Graph Red. width CPU time 7 0.04 20 5.00 36 117.83 13 0.89 8 4.29 13 2.49 15 12.05 5 0.01 11 0.01 0.04 6 6 0.05 6 0.00 230 9758.43 11 0.28 9 0.08 0.04 10 17 7.31 10 0.01 16 1.15 16 4.04 1.44 15 11 0.01 1.91 18 11.31 19 6.34 19 6.34 19 16 2.26 27 3.17 34.92 55 25 2.26 35.94 59 29 3.60 37.72 58 37.71 58 103 234.95 128 551.76 274.94 105 104 239.45 227.94 99 143 681.38 139 547.31
low 5 4 4 4 4 4 4 5 9 4 4 6 4 5 4 6 8 6 10 8 8 11 11 11 11 11 7 9 7 6 7 9 9 9 8 9 7 8 6 7 4
with Safe Sep. width CPU time 7 0.01 4 0.00 36 117.83 13 0.83 7 0.13 13 1.60 11 1.13 5 0.00 11 0.01 6 0.03 0.02 6 230 9758.43 11 0.26 0.04 9 10 0.03 1.12 16 10 0.01 16 0.20 16 0.88 16 0.40 18 0.37 18 1.76 18 1.55 18 1.55 16 0.45 27 2.84 56 34.12 23 1.66 58 33.45 30 3.21 58 36.60 58 36.90 103 203.66 125 530.38 105 231.02 104 206.22 94 131.22 140 523.76 139 547.31
Table 3: Maximum Cardinality Search-Minimal for instances
78
Efficient Implementation of a Hotlink Assignment Algorithm for Web Sites Artur Alves Pessoa t
Eduardo Sany Laber *
Abstract Let T be a rooted directed tree where nodes represent web pages of a web site and arcs represent hyperlinks. In this case, when a user searches for an information t, it traverses a directed path in T, from the root node to the node that contains i. In this context, we define hotlinks as additional hyperlinks added to web pages in order to reduce the number of accessed pages per search. Given a search probability for each web page, we address the problem of inserting at most one hotlink per page, minimizing the expected number of accesses in a search. In a previous work, we proposed a dynamic programming polynomial time algorithm for this problem, assuming that the height of T is logarithmic. In this paper, we present an efficient implementation for the previous algorithm that leads to optimal or quasi-optimal hotlinks assignments. We also describe experiments with 1,914 trees extracted from 21 actual web sites and randomly generated access probabilities. Our implementation has found optimal solutions to all but two generated instances in up to five minutes each, using a standard computer. The largest solved instance has 57,877 nodes.
1 Introduction Due the expansion of the Internet at unprecedented rates, continuing efforts are being made in order to improve its performance. An important approach is improving the design of web sites [4, 11]. A web site can be viewed as a directed graph where nodes represent web pages and arcs represent hyperlinks. In this case, the node that corresponds to the home page is a root node. Hence, when a user searches for an information i in a web site, it traverses a directed path in the corresponding graph, from the root node to the node that contains i. Here, we assume that the user always "All authors are from the Informatics Department of PUCRio, Brazil. Address: Rua Marques de Sao Vicente 225, RDC, 4° andar, CEP 22453-900, Rio de Janeiro - RJ, Brazil. E-mails: {artur,laber,criston}@inf.puc-rio.br tThis work was partially supported by CNPq through Bolsa DTI (Proj. 55.2046/2002/7), and through Edital Universal 01/2002 (Proc. 476817/2003-0). *This work was partially supported by CNPq through Bolsa de Produtividade (Proc. 300428/99-5) and through Edital Universal 01/2002 (Proc. 476817/2003-0), and by FAPERJ (Proc. E26/150.715/2003).
Criston de Souza
knows which link leads to the desired information. In this context, we define hotlinks as additional hyperlinks added to web pages in order to reduce the number of accessed pages per search [11]. Since a "nice" web page cannot contain much information, the number of hotlinks inserted in each page should be limited. This scenario motivates the problem of inserting at most one hotlink in each web page, so as to minimize the number of accesses required to locate an information. Given a search probability for each web page, we address the problem of inserting at most one hotlink per page, minimizing the expected number of accesses in a search. We call this problem the Average Case Hotlink Search (ACHS) problem. We assume that the given web site is represented by a rooted directed tree T, where only the leaves contain information to be searched by the user. We also assume that the user always follows a hotlink (w, v) from node u when searching for a leaf in the subtree rooted by v. This assumption was introduced by Czyzowicz et al. [3], and called the "obvious navigation" assumption. Due to this assumption, we consider that the insertion of a hyperlink (w,v) must be followed by the deletion of any other arc that ends in v. As a result, the graph obtained after inserting a hotlink in a tree is also a tree. For example, the tree of Figure l.(b) is obtained from that of Figure l.(a) through the addition of the hotlink (u,v) and then the addition of hotlink (z,iu).
Figure 1: (a) the tree T.
(b) the tree TA, where
1.1 Problem Definition An instance of the ACHS problem is a pair (T,p), where T = (V,E)is a directed tree rooted at a node r € V and p is a probability
79
function for the leaves of T. In order to explain the problem, let us recall some basic definitions. The level d(u) of a node u in a tree T is its distance to the root of T. We say that a node v is a descendant of another node u in T when the only path in T that connects r to v contains u. In this case, we also have that u is ancestor of v. A node u is a proper descendant (ancestor) of v if u is a descendant (ancestor) of v and u^v. Given T = (V, E), a solution to the ACHS problem is a hotlink assignment, defined as a set A C V x V. A hotlink assignment A is feasible if and only if it satisfies the following three conditions:
assignment to (T, p) is a feasible hotlink assignment that minimizes
over all possible feasible assignments. The objective of the ACHS problem is to find an optimal hotlink assignment to T. We use E*[T, p] to denote E[TA* , p]. Observe that some internal nodes of T may become leaves in TA (e.g. the node x in Figure l-(b)). By definition, these nodes do not belong to L. Hence, we refer to the nodes of L as hotleaves. The height of a tree T', denoted by tf(T'), is the maximum distance from the root of T1 to a leaf (i) for every arc (w, v) € A, v is descendant of u in T; in T'. Throughout this paper, we may use H to denote (u) let u, v, a, b be nodes of V such that u is a proper the height H(T) of the input tree T. In Figure l.(b), ancestor of v and v is a proper ancestor of a. If we have H(TA) = 4. (w, a), (u, 6) € ^4, then a is not an ancestor of 6; 1.2 Statement of the Results In this paper, we (iii) for every node u E V, there is at most one arc present EX-PATH (Experimental PATH) an efficient (w,t>) € A implementation for the PATH algorithm. PATH is an The condition (ii) is related to the "obvious nav- exact algorithm for ACHS, independently proposed in igation assumption" and its motivation can be better [12] and [6]. Given a parameter D, PATH produces the all feasible understood by examining Figure 1. The condition pre- best possible hotlink assignment A*D among A hotlink assignments A such that H(T ) < D. This vents, as an example, that both (w,v) and (y,p) belong £> algorithm runs in O(n3 ) time and requires O(2D) to A simultaneously. Recall that we assume that the A user always follows a hotlink (u, v) from node u when memory space. Since H(T * ) < H it suffices to execute searching for a leaf in the subtree rooted at v. Since p is PATH with D = H in order to solve ACHS. Hence, a node in such a subtree, we conclude that the hotlink PATH runs in polynomial time, under the assumption that H(T) = O(logn). It is worth mentioning that EX(y, 6) (if it exists) will never be followed by the user. We remark that the definition of a feasible hotlink PATH introduce effective practical improvements on the assignment allows self loops. Although this is not PATH algorithm. The main motivation for this implementation is our necessary, it helps the description of the algorithms intuition that H(T) typically grows very slowly as n proposed in this paper. For practical purposes, however, increases in actual web sites. This is based on the we never add a self loop. belief that a web designer avoids to construct sites that Now, we formalize the objective for the ACHS obligates an user to traverse a long path in order to problem. For that, we define the concept of an improved locate its target information. In fact, our intuition is tree. confirmed by the experiments reported in this paper. We describe experiments with 1914 trees extracted DEFINITION 1.1. Given T — (V,E), and a feasible from 21 actual web sites (we found 16 as a maximum hotlink assignment A, the improved tree obtained from A value of H). Motivated by the research of [7], the T through A is defined as T = (V, (E-X)UA), where access probabilities were generated according to the X - {(u,v) E E \(y,v) € A for some y € V}. Zipf distribution. EX-PATH has found optimal hotlinks In the definition above, X is the set of arcs of assignments to 1912 generated instances in up to five E whose heads receive some hotlink from A. As an minutes each, using a standard computer. Furthermore, example, Figure l.(a) shows a tree T and Figure l.(b) approximately 98% of them completed in up to 0.3 shows the tree T^u'v^^w^. The set X in this case is seconds. The largest solved instance has 57877 nodes. It is worth mentioning that the execution time required {(x,w),(y,v)}. Given an improved tree TA rooted at r, let ^(M) by EX-PATH for a given value of D does not depend be the level of u in TA. Moreover, let L be the on the generated probability distribution. Only the subset of V that contains all leaves of T. Given a quality of the obtained solutions for D < H is affected probability function p : L -> [0,1], an optimal hotlink by these probabilities. In this case, we also observed
80
that EX-PATH has the nice property of producing quasi-optimal solutions when the parameter D is set slightly smaller than JET, which greatly improves its performance. Moreover, EX-PATH introduces two practical improvements:
heuristic was tested with both random and actual instances, where the latter were extracted from the web sites of Canadian universities. In [8], Kranakis et al. present an O(n2) time algorithm to this problem that provides an upper bound on the optimal value of
(i) it discards some subproblems generated by PATH without affecting its optimality; 1.4 Paper Organization This paper is organized as follows. In Section 3, we give a detailed description of (ii) it uses a weaker restriction as a function of the the PATH algorithm as well as a brief description of the parameter D. other algorithms used in our experiments. In Section 4, The first improvement reduces both the execution we introduce the EX-PATH implementation by adding time and the memory usage of EX-PATH. As a result two practical improvements to the PATH algorithm. In of the second improvement, EX-PATH explores a larger Section 5, we describe our experiments and comment number of feasible solutions in order to find the best one, our experimental results. for the same value of D. As a consequence, the quality of the obtained solution is improved without increasing 2 Notations and Definitions the memory usage and with negligible changes on the We use Tu = (Vu,Eu) to denote the subgraph of T execution times. Later, we detail the current D- induced by the descendants of w, that is, Vu = {v € restriction of EX-PATH. V | v is descendant of u}, and Eu = {(v, w) € E \ v, w € Finally, we also compare the solutions obtained by Vu}. T—Tuis used to denote the subgraph of T induced our implementation with three other implementations by V — Vu. Throughout this paper, we refer to Tu as for ACHS algorithms: greedyBFS [3], approorimate- the subtree o/T rooted at u. Furthermore, we use Lu to HotlinkAssignment algorithm [8] and MIP. The latter denote the subset of Vu that contains all leaves of Tu. is the implementation of a Mixed Integer Programming We use Su to denote the set of all children of node u in Model introduced in this paper. As far as we know, we T. are the first to show how the solutions of these algorithms compare to the optimal solutions. 3 Algorithms 1.3 Related Work The idea of adding hotlinks to web sites were firstly proposed by Perkowitz and Etzioni [10]. After that, some authors have considered the ACHS problem and some variations [2, 8, 5]. Bose et al. [2] considered a variation of the ACHS problem where the input tree is replaced by an arbitrary DAG (Directed Acyclic Graph). Even for an uniform distribution, they showed that this problem is MPcomplete. Furthermore, for general trees, they give a where Ent (p) lower bound on optimal value of is the entropy [1] of the access probability distribution p and d is the maximum node outdegree in the tree. This lower bound also holds for the ACHS problem. In [5], Fuhrmann et al. considered another variation of ACHS where multiple hotlinks can be assigned from a single node. For fc-regular complete trees and general distributions, they proved upper and lower bounds on the optimal values. The previous related works do not use the obvious navigation assumption. In [3], Czyzowicz et al. gives a detailed discussion on the ACHS problem and some variants, including practical aspects. The authors also propose a greedy heuristic to the ACHS problem called greedyBFS, which is implemented in a software tool. The greedyBFS
In this section, we explain the algorithms that will be used in our experiments. 3.1 Approximation Algorithms The greedyBFS algorithm was designed by Czyzowicz et al. [3]. The authors defined the gain of a hotlink (u, v) as g(u, v) = p(Tv)(d(v) - d(u) - 1), where p(Tw) is the sum of the probabilities of the leaves in Tv. First, greedyBFS assigns a hotlink from the root r of the input tree to the node u that maximizes the gain g(r, u). This generates a new tree T^r'^. Then, the algorithm recursively assigns hotlinks to the subtrees rooted at the children of rinT< r ' t t >. The approximateHotlinkAssignment algorithm (AHA for shortness) was proposed by Kranakis et al [8]. The only difference between this algorithm and the previous one is the criterion used to choose the node u. AHA chooses a node such that ^^ < p(u) < ^^If such a node does not exist or is a child of r, then AHA chooses the grandchild of r with the greatest probability. 3.2 The Mixed Integer Programming Algorithm In order to model ACHS as a MIP problem, we
81
consider the m paths (Ai,...,A m ) in the input tree T from a hotleaf to the root r. We denote by Ck and pk the length of the path A* and the probability of the only hotleaf in this path, respectively. In our MIP model, we use the binary variable Xjj to indicate whether the hotlink (i, j) belongs to current solution (xij = 1) or not (x^j = 0). We say that two hotlinks (i,j) and (a, 6) are nested when there is at least one path that contains i, j, a, b, for (i £ a) V (6 ^ j), and d(i) < d(a) < d(b) < d(j). Hence, we also use a continuous variable yfj, that represents the length reduction on the path A* due to the addition of the hotlink («,.?). Clearly, if two hotlinks are nested, then only the path length reduction due to the most external hotlink should be non-zero. Moreover, any path length reduction due to the addition of the hotlink (i, j] is not greater than d(i,j) = (d(j) — d(i) — 1). Hence, we have the following two restrictions in our model:
n recursively solve t h e subproblem where t h e input tree is the maximum subtree of T^r>u^ rooted at v. At the end, return the best solution found. In this case, if we discard every solution that generates an improved tree whose height is greater than £>, then we may have Q(nD) subproblems. As stated before, the time complexity of the PATH algorithm is simply exponential on the parameter D. For that, PATH uses the following strategy. For a given input tree rooted at r, it considers only two possibilities: assigning or not a hotlink from r to some node in T/, where / is the last child of r (assuming some order). If such a hotlink is not assigned, then we must solve a subproblem with T/ as an input tree and another subproblem with T — T/ as an input tree. On the other hand, if a hotlink is assigned from r to some node in T/, then we must solve two modified subproblems. In the first subproblem, the input tree T/ has an additional hotlink available from one level lower than its root. The second subproblem has T — T/ as an input tree where no hotlink can be assign from the root r (since it must be assigned to some node in T/). This approach leads to the P-ACHS problem defined below.
We say that two hotlinks (i,j) and (a, 6) cross when Input: there is at least one path that contains i,j,a, 6, and i) a directed path q = (Vq,JE?q) where Vq = d(i) < d(a) < d(j) < d(b). Since the obvious navigation {ft, - - -,9fc} and Eq ~ {(qi,qi+i)\l < i < k - 1}; assumption does not allow crossing hotlinks, we add the following additional restriction to our model for all pair ii) a vector a = (ai,..., a*, a +i, b) € {0,1}*+2; k of crossing hotlinks: Xjj + ar0,& < 1Finally, since we can use at most one hotlink iii) a tree T = (V, 8} rooted at r; from each node, we have the following restriction: iv) an integer D; v) a probability function p. Let L be the set of all leaves in T. As in the ACHS problem, we refer to the nodes of I. as hotleaves. Output: A hotlink assignment A to the tree Tq = (Vq U V, Eq U S U {(tfifc,r)}), satisfying the following six conditions: 3.3 The PATH Algorithm In this section, we explain the PATH algorithm [12, 6], an exact dynamic (a) A is feasible in the sense of the ACHS problem; programming algorithm for solving ACHS. Given a pa- (b) No hotlink can point to a node in K,; rameter D selected by the user, PATH finds a hotlink assignment A*D with the following properties: H(TA») < (c) If a, = 0, then no hotlink can leave &; D and, for every feasible hotlink assignment A with (d) If ak+i = 0, then no hotlink can leave r; H(TA] < D, we have E[TA°,p] < E[TA,p]. Hence, our algorithm solves a height restricted hotlink assign- (e) if b = 0, then no hotlink can point to r. ment problem. Next, we give an overview on the ap- (f) H(TA) < D proach used to design this algorithm. A A straightforward approach to solve the (height Objective: Minimize E[T ,p]. restricted) ACHS problem using dynamic programming Observe that the ACHS problem is a particular case would be the following. For each possible hotlink of the P-ACHS problem when q is empty, 6 = ai = 1, assignment (r,w) from the root r of T, obtain the D = H and T = T. Thus, an exact algorithm for Pimproved tree T^r'u^. Then, for each child v of r ACHS is also an exact algorithm for ACHS. The objective of ACHS is modeled as follows:
82
3.4 Solving P-ACHS Figures 2 and 3 are used throughout this section to illustrate the PATH execution. Figures 2.(a) and 3.(a) represent an instance of PACHS where the path q consists of four nodes <&, #2, u to indicate the path obtained by inserting u at the end of q. We use q* to denote the subpath of q formed by its i-th first nodes, and |q| the denote the number of nodes in q.
Figure 3: (a) an instance of the P-ACHS problem, (b) and (c) the decomposition in the case 2 when c = (0,1,0,0,1). Let E*(q,a,T,p) denote the cost of an optimal solution of a P-ACHS instance defined by a binary vector a, a directed tree T, a path q with |a| — 2 nodes, an integer D, and a probability function p. If |q| > D, then PATH sets J5*(q,a,T,p) = oo. Hence, let us assume that |q| < D. In order to solve this instance we must consider the following cases: Case 1: some hotlink is assigned from a node of q to r in the optimal solution; Case 2: no hotlink is assigned from a node of q to r in the optimal solution; Case 1 This case is only considered when 6 = 1 . In this case, we must add a hotlink from some available Figure 2: (a) an instance of the P-ACHS problem, node to r. Thus, we have $3f=i at possibilities. As (b) the improved tree obtained due to the addition of an example, if ((/2, r) is assigned to the tree of Figure the hotlink (q2,r). (c) the corresponding subproblem 2.(a), then PATH generates the subproblem of Figure generated in the case 1. 2.(c). In fact, the addition of hotlink (52, r) creates an improved tree where qi has two children: r and q3 (see
83
Figure 2.(b)). However, since 93 and 94 are not ancestors of hotleaves, they can be removed without modifying the cost of the solution. Observe that 6 is set to 0 since the condition (ii) of the ACHS problem definition (see Section 1.1) assures that no node can receive two Stop Conditions: If T has only one node and this hotlinks. In general, if some hotlink points to r in the node is a hotleaf, say I, then the best choice is to assign optimal solution, we have that a hotlink from the first available node in q to I. Thus,
Case 2 In this case all the available nodes of q may only point to some node in V — {r}. Thus, PATH must decide which of the available nodes are allowed to point to the nodes of 7/, the maximum subtree of T rooted at the last child / of r (assuming any order). Let k' = Y^,i=i ai De tne number of available nodes. Then, PATH has 2fc/ possibilities to take such a decision. Since it is not clear which one is the best, then all of them are considered. In order to clarify this case, let us consider the subproblem of Figure 3. (a) and the possibility where 92 and r remain available for 7/ (see Figure 3.(c)). As a consequence, only q\ and 54 will be allowed to point to nodes in T — 7/ (Figure 3.(b)). Figure 3.(b) defines a new subproblem (q,a',T — 7/,p), where a' = (1,0,0,1,0,0). Note that b is set to 0 since we are in case 2. On the other hand, Figure 3.(c) defines a new subproblem (q —> r,a",7/,p), where a" = (0,1, 0,0, 1,1,1). Thus, the maximum between the cost of the optimal solutions for the subproblems defined by Figures 3.(b) and 3.(c) is the cost of the optimal solution for the problem of Figure 3. (a) under the assumptions that no hotlink can be assigned to r (Case 2), the nodes qz and r cannot point to nodes in T— 7/, and the nodes q\ and 54 cannot point to nodes in 7/. In general, let C be a set of binary vectors defined by C = {(ci, . . . , cfc+i)|ci < at for t = 1, . . . , k + 1}. Each c € C corresponds to one of the 2*' possibilities for selecting the nodes that will remain available to point to nodes in 7/. Furthermore, let c = a — c. This vector defines which nodes from q will remain available to point to nodes in T — 7/. Then, by considering all choices for c, we have that
(3-3) If T has no hotleaves then .E*(q,a, T,p) = 0. If |q| > D then E*(q,a,T,p) = oo. 4 The EX-PATH Implementation In this section, we discuss two ideas that we implement on the EX-PATH code in order to improve both its performance and its memory consumption with respect to the PATH algorithm.
4.1 Reducing the number of subproblems Let (q,a, T,p) be a subproblem generated by PATH. By means of the dynamic programming technique, PATH maintains a table that contains the optimum values for all generated subproblems. Next, we discuss how a given subproblem can be found in this table. Later, we show how to reduce the size of this table. Recall that every tree T of a generated subproblem can be obtained from some subtree Tr of T by removing the last i children of r (following some arbitrary order). Hence, we use the last non-removed child of r to identify T in the set of all generated trees. We denote this last child by /. For example, in the subproblem of Figure 3.(a), / is the third child of r (from left to right). On the other hand, / is the second child of r for the subproblem of Figure 3.(b). If r has no child, then T does not need to be identified because the corresponding subproblems lead to stop conditions for PATH. Moreover, let a be the binary value of the vector a, given by we also use this value to identify the vector a in the set of all generated vectors. Hence, an element in the subproblem table of PATH can be identified by the pair (a, /). Recall that, when PATH decomposes a subproblem according to the case 2, it checks 2* possible values for is the number a binary vector c, wheree of available hotlinks in both the path q and the root r of T. If T has m hotleaves with m > fc', then EXCases 1 and 2 together: Let P and Q be PATH assumes that the last k' — m available hotlinks respectively, the righthand side of equations (3.1) and of q -> r cannot be assigned. For that, it sets to zero the corresponding elements of a. As a result, only (3.2). Thus,
84
2m possibilities are checked, and many elements hi the subproblem table of PATH do not need to be stored by EX-PATH. The correctness of EX-PATH is stated by the following lemma, whose proof we defer for the extended version of this paper.
by our implementation. Now, we show a further consequence of this improvement. Observe that we do not need to restrict the number of nodes of q as it does not directly affect the size of our subproblem table. Instead, we only discard the subproblems with |q'| > D. As a result, this improvement leads to the best possible soluLEMMA 4.1. Let I = (q,a,T,p) be an instance of P- tion A*D under the restriction that no subproblem with ACHS generated as a subproblem for an input tree T, |q'| > D is used to construct TA*°. Observe that this A where T has m leaves and q -4 r has k' > m available restriction is weaker than H(T o) < D. nodes. Then, there is an optimal assignment for I that 5 Experimental Results assigns only the first m available hotlinks of q. In this section, we describe our experiments. All of them To efficiently store and retrieve the optimum values were executed in a Xeon 2.4 GHz machine with 1GB of for the subproblems generated by EX-PATH, we pro- RAM Memory. The MIP algorithm was implemented pose a new indexing method for this subproblem table. on XPRESS-MP optimization package [9]. Given an instance / = (q, a, T, p) of P-ACHS where T has m hotleaves, let ft be the number of vectors a' with 5.1 Instances In order to obtain our instances, we no more than m available hotlinks and binary values adopted the same approach employed by Czyzowicz smaller than a. If a also has no more than m avail- [3]. First, we obtained the directed graphs associated able hotlinks, then we use the pair (/?, /) to identify / to 21 Brazilian Universities sites. Let Gl,...,G21 be in the subproblem table. For example, we observe that these graphs and let Sj, for* i = 1,..., 21, be the node we have ft = a whenever no vector has more than ra that models the home page of the ith site. Next, for available hotlinks. On the other hand, if m = 2 and i = 1,...,21, we executed a breadth first search in a = (1,0,0,0,0), then a = 16 and 0 = 14 since both G*, starting at Sj, to extract a tree T*. This process the vectors (0,1,1,1,0) and (0,1,1,1,1) have more than has generated 21 main trees. Then, for each tree two available hotlinks and binary values smaller than 16. T' = (¥*,£*) we generated |V*| additional trees by Next, we show how to recursively calculate the value considering each subtree rooted at a node in T*. Finally, of ft as a function of a and m. If a = (0, a'), then we discard the trees with height smaller than 3 so as to /?(a,m) = /?(a',m). Otherwise, if a = (l,a'), then account a total of 1914 trees, including the main ones. /?(a,m) = /3(a',m—l)+7(|a|,m), where 7(|a|,m) is the Table 1 indicates some parameters of the main total number of vectors with |a| elements, no more than trees. All of them have been defined before but d, the m available hotlinks, and ai = 0. The recursion stops maximum node degree in a tree. when |a| = 1. In this case, we always return the value In order to assign probabilities to the leaves, we of &. To improve the performance of this calculation, employed the Zipf's distribution, where the probability we obtain the value of 7(|a|,m) = 2 ££0 ( |a 'r 2 )» for of the ith most probable leaf is given by pi = ^—. |a| = 1,..., D and m = 1,..., |a|—2, from another table Here, Hm is the harmonic number Hm = ]CiLi ^- The usage of such a distribution is motivated by the work that is constructed hi a preprocessing phase. of Classman [7], which experimentally proves that the 4.2 Weakening the jD-restriction Let (q,a,T,p) popularity of Web pages can actually be modeled with be an instance of P-ACHS. Let also z — min{i | ai — Zipf's popularity law. We note that the popularity 1} - 1. We observe that J5*(q,a,T,p) = z p(T) + order among the leaves was randomly selected. On the J5*(q',a',7~,p), where a' (q') is obtained from a (q) by other hand, we recall that the execution time of EXremoving its first z elements (nodes), and p(T) denotes PATH for a given value of D does not depend on the the sum of the probabilities of all hotleaves in T- For generated probability distribution. Only the quality of example, if we have a = (0,0,0,1,0,1,1), then z — 3 the obtained solutions for D < H is affected by these and a' = (1,0,1,1). In this case, since no hotlink can probabilities. be assigned from 91,92 or <&, these nodes only cause a constant to be added to the optimum value. Since 5.2 Results For each of the 1914 instances, we exewe can use the value of JB*(q',a',T,p) to calculate cuted the four algorithms: greedyBFS [3], AHA [8], EXE*(q,a,T,p), we need only to store ^(q^a^Tjp) in PATH and MIP. We executed EX-PATH with D = 14. This value is the maximum one for which EX-PATH can the table. execute all instances without memory overflow. Since The immediate consequence for the previous imH > 14 for only two instances, then EX-PATH has provement is to reduce both the time and space needed
85
Site www.ufs.br www.ufrn.br www.pucminas.br www.ufmg.br www.pucsp.br www.unicamp.br www.puc-rio.br www.ucb.br www.ucsal.br www.ufg.br www.ucg.br www.ufmt.br www.ufms.br www.ufpr.br www.pucrs.br www.ufsc.br www.ufba.br www.ufpe.br www.unicap.br www.ufpb.br www.ufpi.br
n 542 320 1016 2275 1151 57877 13315 1158 369 556 2317 746 1743 646 10484 443 1892 842 48 117 1100
m 440 254 921 1863 938 45559 11618 933 219 477 2016 583 1534 554 8225 341 1446 641 35 96 650
d 47 28 81 153 43 358 513 52 43 78 76 84 390 80 426 40 148 30 21 20 48
H 6 7 4 6 7 10 8 7 5 4 6 9 3 5 16 10 9 10 6 4 8
E(T,p] 3,61 5,06 3,00 4,79 5,13 7,69 5,45 5,41 3,97 2,74 4,47 5,62 2,61 3,53 11,63 6,89 5,68 6,95 4,86 3,32 5,57
Table 1: The main instances obtained from Brazilian Universities sites.
Inst 1 2 3 4 5 6 Inst 1 2 3 4 5 6 Inst 1 2 3 4 5 6
D H 16 15 14 13 10 12 D H 16 15 14 13 10 12 D H 16 15 14 13 10 12
H MB % 49,62 313 50,94 108 40,96 50 51,69 37 H -3 MB % 56,06 786 50,78 269 49,62 92 50,94 32 40,96 14 11 51,69 H-6 MB % 56,06 68 24 50,78 49,62 8 50,93 3 39,69 1 51,69 <1
H-l MB % 50,78 903 49,62 309 50,94 108 44 40,96 37 51,69 H-4:
MB % 56,06 387 50,78 132 49,62 45 50,94 16 40,96 6 5 51,69 H-7 MB % 26 56,06 50,78 9 49,62 3 1 50,85 32,22 <1 50,82 <1
H-2 MB % 56,06 1589 50,78 539 49,62 186 50,94 63 40,96 32 51,69 21 H -5 MB % 56,06 178 50,78 61 49,62 20 50,94 7 40,91 3 2 51,69
Table 2: Gain (%) due to the EX-PATH algorithm, found the optimal solution for at least 1912 instances. and execution times (seconds), for D ranging from H The gain of an algorithm Alg for an instance (T, p) to H-7. is given by (E[T,p] - E[TA,p])/E[T,p], where A is the set of hotlinks that Alg adds to the tree T. The leftmost chart in Figure 4 shows the average gain of Table 2 shows the gain for each value of D, as well each algorithm as a function of the input height. On as the corresponding execution times. The bold printed the other hand, the rightmost chart shows how the values are either non-optimal gains or gains for which we solution cost of each algorithm compares to that of do not have an optimality proof as for the first two rows. EX-PATH. Recall that all solutions of EX-PATH are We remark that, without weakening the D-restriction, optimal for H < 14. We observe the solutions of EX-PATH only finds a solution A when H(TA) < D. both the greedyBFS and AHA become farther from the After weakening the D-restriction, however, this is not optimal ones as H increases. This fact is less clear for necessarily true. For example, for the instance 4, the H > 10 because we have few instances in this range. solution found for D = H — 4 = 9 leads to an optimal We also observe that, as in [3], the solutions given tree whose height is 12 (H(TA*) = 12). As a second by greedyBFS are better than that of AHA. However, example, for the fifth instance, the solution found for we give the additional information that the greedyBFS D = H — 6 = 4 leads to a tree whose height is solutions are at most 5% from the optimal ones (in the 10. Despite of the very restrictive value of D, this average), for H < 10. last tree has a gain of 39,69% against an optimum In terms of speed, both greedyBFS and AHA solved gain of 40,96%. Although, greedyBFS has found good all instances in less than 0.3 seconds. XPRESS exactly solutions for all instances, only the last gain of the fifth solved 1681 instances spending up to 1 minute, 18 of the row is worse than that of greedyBFS. instances took 1 to 2 minutes time and 215 instances Finally, Table 3 shows the allocated memory and did not finish in less than 1 hour. EX-PATH exactly the relative reduction on the memory usage due to the solved 1880 instances in up to 0.3 seconds, 28 in up to first improvement of section 4. Roughly speaking, we 10 seconds and the remaining 6 (for 2 of them we do observe that the memory usage divides by two whenever not have an optimality certificate) under 27 minutes. the value of D decreases by one. Moreover, the first For these 6 instances we gradually reduced the value improvement provided a reduction on the memory usage of D so as to observe both the performance gain and of about 50% for large values of D. This reduction the solution quality loss. Tables 2 and 3 illustrate the deteriorates as the value of D decreases. performance of EX-PATH for these instances.
86
D
Inst 1 2 3 4 5 6 Inst 1 2 3 4 5 6
H
16 15 14 13 10 12 D H
16 15 14 13 10 12
H
H-l
MB % MB % - 600 54 561 57 314 52 295 55 165 49 233 48 138 39 74 46 130 53 H-4 H-5 MB % MB % 197 40 108 34 103 37 57 31 54 34 30 27 28 30 16 23 25 11 13 5 7 17 13 25
H-2
MB 637 332 173 91 84 41
%
51 49 47 44 26 40
H -6
MB 59 31 16 8 7 4
% 28 25 21 17 0 10
H-3 MB % 355 46 185 43 97 41 51 38 46 18 23 33 H -7
MB 32 17 9 5 4 2
% 23 19 15 10 0 3
Table 3: Allocated memory (MB) by the EX-PATH algorithm, and the relative reduction obtained through the first improvement, for D ranging from H to H — 7.
Figure 4: Average gain (%) due to each algorithm as a function of If (leftmost), and relative difference between the solution cost of each algorithm and that of EXPATH (rightmost). References [1] N. Abramson. Information Theory and Coding. McGraw Hill, 1963. [2] Prosenjit Bose, Evangelos Kranakis, Danny Krizanc, Miguel Vargas Martin, Jurek Czyzowicz, Andrzej Pelc, and Leszek Gasieniec. Strategies for hotlink assignments. In International Symposium on Algorithms and Computation, pages 23-34, 2000. [3] J. Czyzowicz, E. Kranakis, D. Krizanc, A. Pelc, and M. Vargas Martin. Enhancing hyperlink structure for improving web performance. Journal of Web Engineering, 1(2):93-127, March 2003. [4] M. C. Drott. Using web server logs to improve site design. In Proceedings of ACM Conference of Computer Documentation, pages 43-50, 1998. [5] Sven Fuhrmann, Sven Oliver Krumke, and HansChristoph Wirth. Multiple hotlink assignment. In Proceedings of the Twenty-Seventh International Workshop on Graph- Theoretic Concepts in Computer Science, 2001. [6] Ori Gerstel, Shay Kutten, Rachel Matichin, and David Peleg. Hotlink enhancement algorithms for web direc-
tones. In Proceedings of the ISAAC'2003, December 2003. [7] Steve Classman. A caching relay for the world wide web. In First International World-Wide Web Conference, pages 69-76, May 1994. [8] Evangelos Kranakis, Danny Krizanc, and Sunil Shende. Approximate hotlink assignment. In International Symposium on Algorithms and Computation, pages 756-767, 2001. [9] DASH Optimization. XPRESS-MP software. www.dashoptimization.com. [10] Mike Perkowitz and Oren Etzioni. Adaptive web sites: an AI challenge. In IJCAI (1), pages 16-23, 1997. [11] Mike Perkowitz and Oren Etzioni. Towards adaptive Web sites: conceptual framework and case study. Computer Networks (Amsterdam, Netherlands: 1999), 31(11-16):1245-1258, 1999. [12] Artur A. Pessoa, Eduardo Sany Laber, and Criston de Souza. On the worst case search in trees with hotlinks. Technical Report 23, Departamento de Informatica, PUC-RJ, Rio de Janeiro, Brasil, August 2003.
87
Experimental Comparison of Shortest Path Approaches for Timetable Information* Evangelia Pyrga*
Prank Schulz*
Abstract We consider two approaches that model timetable information in public transportation systems as shortestpath problems in weighted graphs. In the time-expanded approach every event at a station, e.g., the departure of a train, is modeled as a node in the graph, while in the time-dependent approach the graph contains only one node per station. Both approaches have been recently considered for the earliest arrival problem, but little is known about their relative performance. So far, there are only theoretical arguments in favor of the time-dependent approach. La this paper, we provide an extensive experimental comparison of the two approaches. Using several real-world data-sets we evaluate the performance of the basic models and of several extensions towards realistic modeling. Furthermore, new insights on solving bicriteria problems in both models are presented. The time-expanded approach turns out to be more robust for modeling more complex scenarios, whereas the time-dependent approach shows a clearly better performance.
1 Introduction An important problem in public transportation systems is to model timetable information so that subsequent queries asking for optimal itineraries can be efficiently answered. The main target that underlies the modeling (and which applies not only to public transportation systems, but also to other systems as well like route planning for car traffic, database queries, web searching, etc) is to process a vast number of on-line queries as fast as possible. In this paper, we are concerned with 'This work was partially supported by the 1ST Programme of EU under contract no. IST-1999-14186 (ALCOM-FT), by the Human Potential Programme of EU under contract no. HPRN-CT1999-00104 (AMORE), and by the DFG under grant WA 654/112. tComputer Technology Institute, P.O. Box 1122, 26110 Patras, Greece, and Department of Computer Engineering and Informatics, University of Patras, 26500 Patras, Greece. Emails: {pirga,zaro}<9ceid.upatras.gr. * University of Karlsruhe, Department of Computer Science, P.O. Box 6980, 76128 Karlsruhe, Germany. Emails: {fschulz,dwagner}9ira.uka.de.
88
Dorothea Wagner*
Christos Zaroliagis*
a specific, query-intensive scenario arising in public railway transport, where a central server is directly accessible to any customer either through terminals hi train stations or through a web interface, and has to answer a potentially infinite number of queries. The main goal in such an application is to reduce the average response time for a query. Two main approaches have been proposed for modeling timetable information: the time-expanded [5, 9, 11, 12], and the time-dependent approach [1, 6, 7, 8]. The common characteristic of both approaches is that a query is answered by applying some shortest path algorithm to a suitably constructed digraph. The timeexpanded approach [11] constructs the time-expanded digraph in which every node corresponds to a specific time event (departure or arrival) at a station and edges between nodes represent either elementary connections between the two events (i.e., served by a train that does not stop in-between), or waiting within a station. Depending on the problem that we want to solve (see below), the construction assigns specific fixed weights to the edges. This naturally results in the construction of a very large (but usually sparse) graph. The tunedependent approach [1] constructs the tune-dependent digraph hi which every node represents a station and two nodes are connected by an edge if the corresponding stations are connected by an elementary connection. The weights on the edges are assigned "on-the-fly", i.e., the weight of an edge depends on the time in which the particular edge will be used by the shortest path algorithm to answer the query. The two most frequently encountered timetable problems are the earliest arrival and the minimum number of transfers problems. In the earliest arrival problem, the goal is to find a train connection from a departure station A to an arrival station B that departs at A later than a given departure time and arrives at B as early as possible. There are two variants of the problem depending on whether train transfers within a station are assumed to take negligible time (simplified version) or not. In the minimum number of transfers problem, the goal is to find a connection that minimizes the number of train transfers when
considering an itinerary from A to B. We consider also combinations of the above problems as bicriteria problems. Techniques for solving general multi-criteria problems have been discussed in [4, 5], where the discussion in [4] is focused on a distributed approach for timetable information problems. Space consumption aspects of modeling more complex real-world scenarios is considered in [3]. For the time-expanded model, the simplified version of the earliest arrival problem has been extensively studied [11, 12], and an extension of the model able to solve the'niinimum number of transfers problem, but without transfer times, is discussed hi [5]. For the time-dependent model, several extensions to that model are proposed in [10] including transfer times and the miTiiTnnm number of transfers problem. Comparing the time-expanded and time-dependent approach, it is argued theoretically in [1] that the tune-dependent approach is better than the time-expanded one when the simplified version of the earliest arrival problem is considered. In this paper, we provide the first experimental comparison of the time-expanded and the tunedependent approaches with respect to their performance in the specific, query-intensive scenario mentioned earlier. For the simplified earliest arrival problem we show that the time-dependent approach is clearly superior to the time-expanded approach. In order to cope with more realistic requirements, we investigate, besides the extensions to train transfers in combination with the earliest arrival problem proposed in [5, 10], additional new extensions of both approaches. In particular, the proposed extensions can handle cases not tackled by most previous studies for the sake of simplification. These new cases are: (a) the waiving of the assumption that transfer of trains within a station takes negligible time; (b) the consideration of the minimum number of transfers problem; (c) the involvement of traffic days; and (d) the consideration of bicriteria problems combining the earliest arrival and the minimum number of transfers problems. We also conducted extensive experiments comparing the extended approaches. That comparison is important, since the described extensions are mandatory for real-world applications, and (to the best of our knowledge) nothing is known about the relative behavior of realistic versions of the two approaches. In Section 2 the variants of itinerary problems that are considered in this paper are defined. The modeling of the earliest-arrival problem is considered hi Section 3, where first the basic ideas of the time-expanded and time-dependent models are briefly reviewed and then the realistic extensions of these approaches are pre-
sented. Sections 4 and 5 discuss how the minimum number of transfers problem and the bicriteria problems, resp., can be solved hi either of the extended models. The experimental comparison of the two approaches based on real data from the German railways is presented in Section 6. We first consider how the plain versions of the two approaches compare, and subsequently investigate the extensions and bicriteria problems. Section 7 summarizes our insights on the advantages and disadvantages of the approaches under comparison. 2 Itinerary Problems In this section, we provide definitions of the timetable problems that we will consider. A timetable consists of data concerning: stations (or bus stops, ports, etc.), trains (or busses, ferries, etc.) connecting stations, departure and arrival times of trains at stations, and traffic days. We define an elementary connection c to be a 5-tuple of the form c = (Z,Si,S2,*d,ta) a^d interpret it as train Z leaves station S\ at tune td, and the immediately next stop of train Z is station £2 at time ta. The tune values ta and td are integers in the interval [0,1439] representing the tune in minutes past midnight. The length of elementary connection c, denoted by length(c), is ta — td (mod 1440). We generally assume that trains are operated daily, unless stated otherwise, as e.g., hi Section 3.3. There, we discuss the integration of traffic days: for each elementary connection we are given additionally one bit per day indicating whether that particular connection is operated on that day. If x denotes a tuple's field, then the notation x(c) specifies the value of x hi the elementary connection c. The timetable induces a set C of elementary connections. At a station £ it is possible to transfer from one train to another. Such a transfer is only possible if the time between the arrival and the departure at that station S is larger than or equal to a given, station-specific, transfer time, denoted by transfer (S). A sequence of elementary connections P = (ci,...,Cfe) together with departure times depi(P) and arrival tunes arr»(P), 1 < i < &, is called a connection from station A = 3i(ci) to station B = S^Cfc), if it fulfills some consistency conditions: the departure station of G£+I is the arrival station of c,-; the tune values depi(P) and arrj(P) correspond to the tune values td and ta, resp., of the elementary connections (modulo 1440) and respect the transfer times at stations. We also assume that the times depi(P) and arr,(P) include data regarding the departure/arrival day by counting time in minutes from the first day of the timetable. Such a time t is of the form t = a • 1440 + &, where a 6 [0,364] and b e [0,1439]. Hence, the actual time within a day is t (mod 1440) and the actual day is |t/1440j.
89
For the timetable information problem we are additionally given a large, on-line sequence of queries. A query defines a set of valid connections, and an optimization criterion (or criteria) on that set of connections. The problem is to find the optimal connection (or a set of optimal connections) w.r.t. the specific criterion or criteria. In this work, we are concerned with two of the most important criteria, namely the earliest arrival (EA) and the 1™*"*""*" number of transfers (MNT), and consequently investigate two single-criterion and a few bicriteria optimization problems which are defined next. Earliest Arrival Problem (EAP). A query (A, B, to) consists of a departure station A, an arrival station B, and a departure tune to (including the departure day). Connections are valid if they depart at least at the given departure tune to, and the optimization criterion is to minimize the difference between the arrival time and the given departure time. We distinguish between two different variants of the problem: (a) The simplified version, where train transfers take negligible tune and hence the input is restricted to transfer(S) — 0 for all stations S. (b) The realistic version where train transfers require arbitrary nonnegative minimum transfer tunes transfer(S). We will discuss efficient solutions to these problems in Section 3. Minimum Number of Transfers Problem (MNTP). A query consists only of a departure station A and an arrival station B. Trains are assumed to be operated daily, and there is no restriction on the number of days a timetable is valid1. All connections from A to B are valid, and the optimization criterion is to rmnimize the number of train transfers. We will discuss this problem in Section 4. Bicriteria Problems. We consider also bicriteria or Pareto-optimal problems with the earliest arrival (EA) and the imniTnwn number of transfers (MNT) as the two criteria. We are interested in two problem variants: (i) finding the so-called Pareto-curve which is the set of undominated Pareto-optimal paths (the set of feasible solutions where the attribute-vector of one solution is not dominated by the attribute-vector of another solution), and (ii) finding the lexicographically first Paretooptimal solution (e.g., find among all connections that minimize EA the one with minimum number of transfers). We will discuss these problems in detail in Section 5.
3 Earliest Arrival Problem In this section we consider the modeling of the EAP, in both the tune-expanded and the time-dependent approach. In either approach we first briefly describe how to model the simplified version of the problem, where transfers between trains at a station take negligible time, and subsequently consider the realistic version of EAP, where transfer tune between trains at a station is non-zero. 3.1 Time-Expanded Model 3.1.1 Simplified Version The tune-expanded model [11] is based on the time-expanded digraph which is constructed as follows. There is a node for every time event (departure or arrival) at a station, and there are two types of edges. For every elementary connection (Z,Si,S2,tdjta) in the timetable, there is a train-edge in the graph connecting a departure node, belonging to station Si and associated with time t<j, with an arrival node, belonging to station #2 and associated with time t0. In other words, the endpoints of the tram-edges induce the set of nodes of the graph. For each station S, all nodes belonging to S are ordered according to their time values. Let v\,..., Vk be the nodes of S in that order. Then, there is a set of stay-edges (i>i, Vi+i), 1 < i < k — 1, and (ufc, vi) connecting the time events within a station and representing waiting within that station. The edge length of an edge (u, v) is tv — tu (mod 1440), where tu and tv are the time values associated with u and v, respectively. It is easy to see that the simplified version of EAP can be solved by computing a shortest path from the first departure node at the departure station with departure time later than or equal to the given start time. Since edge lengths are non-negative, one can use Dijkstra's algorithm and abort the main loop when a node at the destination station is reached.
3.1.2 Realistic Version In this case, we keep, for each station, an additional copy of all departure nodes in the station which we call transfer nodes] see Fig. 1. The stay-edges are now introduced between the transfer nodes. For every arrjval node there are two additional outgoing edges: one edge to the departure of the same train, and a second edge, called transfer edge, to the transfer node with tune value greater than or equal to the tune of the arrival node plus the minimum time needed to change trains at the given station. The edge lengths are defined as in the definition of the original I This assumption can be safely made since time is not mini- model (see Section 3.1.1).
mized in the MNTP, and thus in a MNTP-optimal connection one can wait arbitrarily long at a station for some connection that is valid only on certain days.
90
Figure 1: Modeling train transfers in the time-expanded Figure 2: Modeling train transfers in the time-dependent approach. approach.
3.2 Time-Dependent Model 3.2.1 Simplified Version The time-dependent model [1] is also based on a digraph, called timedependent graph. In this graph there is only one node per station, and there is an edge e from station A to station B if there is an elementary connection from A to B. The set of elementary connections from A to B is denoted by C(e). The cost of an edge e = (v,w} depends on the time at which this particular edge will be used by an algorithm which solves EAP. In other words, if T is a set denoting tune, then the cost of an edge (u,iu) is given by /(«,«/)(*) — t, where t is the departure time at v, f(v,w) : T —» T is a function such that f(v,w)(t) = t', and t7 > t is the earliest possible arrival time at w. A modification of Dijkstra's algorithm can be used to solve the earliest arrival problem in the timedependent model. Let D denote the departure station and to the earliest departure time. The differences, w.r.t. Dijkstra's algorithm, are: set the distance label 5(D) of the starting node corresponding to the departure station D to to (and not to 0), and calculate the edge lengths by evaluating the functions /«. on-the-fly. Assume that the edge e = (A, B) is considered, and let the earliest arrival time at station A be t. We compute /e(t) by determining the earliest connection c* 6 C(e) departing from A later than t. Then, the earliest arrival at B via A is the arrival time of c*. In other words, the length of e is the waiting tune at A for c* plus length(c*). The particular connection c* can be easily found by binary search if the elementary connections C(e) are maintained in a sorted array. See [1] for more details on the algorithm and its correctness.
3.2.2 Realistic Version To model non-zero train transfers in the tune-dependent model, we use information on the routes that trains may follow as proposed in [10]. In the following, we describe the construction of a digraph G = (V, E) which will be our main model and will be referred to as the train-route digraph^ see Fig. 2. We say that stations AO, AL,... ,-Afc-i, k > 0, form a train route if there is some train starting its journey from AQ and visiting consecutively AI, ..., Ak-i in turn. If there are several trains following the same schedule (with respect to the order in which they visit the above stations), then we say that they all belong to the same train route P. In the train-route digraph there are several nodes per station A: one node A representing the station itself, and for each train route visiting A an additional node p f . There are three kinds of edges: (i) get-in edges from Atopf with constant length g^ = transfer^A); (ii) getoff edges from p£ to A with zero edge length; and (Hi) route edges from pf to pf (B is the next station in the train route) with time-dependent edge length. A route edge (p£,pf) contains those elementary connections from A to B that belong only to the considered train route. Get-in edges belonging to the departure station have zero edge length. 3.3 Incorporating Traffic Days Integrating traffic days in any of the so far described models and algorithms can be done as follows. Whenever an elementary connection is considered, the real departure time is known not only modulo a day, and the day can be determined by dividing the calculated departure tune by 1440 (see Section 2 on page 2). A look-up in the traffic-day table of the corresponding train shows whether that day the elementary connection is valid or not. Elementary connections that are not valid can be ignored.
91
4 The Minimum Number of Transfers Problem The graphs defined for the realistic version of the earliest arrival problem in both the time-expanded (Section 3.1.2) and the time-dependent (Section 3.2.2) approach can be used to solve the MNT problem with a similar method. Edges that model transfers are assigned a weight of one, and all the other edges are assigned weight zero. In both approaches a shortest path in the resulting (always static) weighted digraph yields a connection with minimum number of transfers. In the time-expanded case all transfer edges" are the edges with weight one, and a shortest path from an arbitrary transfer node of the source station to an arrival node of the destination station yields a solution of the MNTP. In the time-dependent case the get-in edges except the get-in edges belonging to the departure station are assigned the weight one, all other edges have weight zero. Here, the MNTP is solved by a shortest path from the node representing the departure station to the one representing the arrival station. 5 Bicriteria Problems We consider bicriteria problems with the earliest arrival (EA) and the minimum number of transfers (MNT) as the two criteria. We investigate two problem variants: on the one hand we want to find all Pareto-optimal solutions, and on the other hand we want to find the lexicographically first Pareto-optimal solution (e.g., find among all connections that minimize EA the one with minimum number of transfers). In the following, we shall refer to the bicriterion problem we consider as (X,Y) with X (resp. Y) as the first (resp. second) criterion we want to optimize and X,Y € {EA,MNT}. Again, the graphs defined for the realistic EAP described in Sections 3.1.2 and 3.2.2 are used. 5.1 Time-Expanded Model
tination station is considered for the first time during the execution of the algorithm. The (MNT,EA) is symmetric to the above and can be solved similarly. Note that in the same way the latest-departure problem can be solved by minimizing the difference between arrival time and actual departure time as second criterion. 5.1.2 All Pareto-optima Finding all Paretooptimal solutions is generally a hard problem, since there can be an exponential number of them. However, if we make the (apparently reasonable) assumption that connections that arrive more than one day later than the earliest arriving connection are not of interest, then every node in the tune-expanded graph can have only one Pareto-optimum. Hence, the above described method for producing the lexicographically first Pareto-optimum can provide all Pareto-optima of a station, if one simply lets the algorithm rim until all nodes of the destination station have been considered (either settled or disregarded as dominated solutions). 5.2 Time-Dependent Model 5.2.1 Lexicographically First Pareto-optima The lexicographically first Pareto-optimum in the (MNT,EA) case can be solved also in the timedependent model like in the time-expanded model (see Section 5.1.1), by defining edge weights as parrs of transfers and travel time (see also [10]). The (EA,MNT) case cannot be solved by that method, which can be easily shown by the construction of a counterexample (see Appendix B). 5.2.2 All Pareto-optima For generating all Paretooptimal solutions hi the time-dependent model we use as a sub-procedure the computation of an earliest arriving connection with bounded number of transfers (see Appendix). Then, the following approach can generate all Pareto-optimal solutions. Solve EAP and count the number of transfers found, say M. Then, run the algorithm described in the Appendix for all values M - 1,M — 2,...,0. The algorithm given hi the Appendix actually does this and in fact can speed up this process, since instead of stopping when the optimal solution with at most M transfers at the destination is found, we can just continue with the execution of the algorithm to produce the next EA solution with at most M — 1 transfers, and so on, until no new path can be found.
5.1.1 Lexicographically First Pareto-optima We first consider the (EA,MNT) case. We maintain a second edge weight, the transfer value trans(e) for an edge e. = (u,u), whose value is 1 if e is a transfer edge (i.e., u is an arrival node and v a transfer node), and 0 otherwise. Consider now the edge weights as pairs of travel time and trans (e), and define the canonical addition on these pairs: (a, 6) + (a1, tf) - (a-l-a', &-f-&')- Tlie smaller relation is the lexicographical extension to pairs: (a, 6) < (a',6') & (a < a') or (a = a' and b < &'). To find the lexicographically first Pareto-optimal solution, it then suffices to run Dijkstra's algorithm by maintain6 Experiments ing distance labels as pairs of integers and by initializing the distance label of the start-node s to d(s) = (0,0). The main goal of the experimental study is to compare The optimal solution is found when a node at the des- the performance of the time-expanded and the time-
92
Expanded Dependent
Timetable France G-long G-locall G-local2 G-all France G-long G-locall G-local2 G-all
Nodes 166085 480173 691541 1124824 2295930 4578 6817 13460 13073 32253
Edges 332170 960346 1383082 2249648 4591860 14791 18812 37315 36621 92507
G-/ Node 1 1 1 1 1 36 70 51 86 71
c-/
Edge
0.5 0.5 0.5 0.5 0.5 11 26 19 31 25
Table 1: Parameters of the graphs considered in the comparison of the original models. The last two columns show the number of elementary connections per node and per edge.
dependent approach. Thus, given two different implementations and a timetable, we define the relative performance or speed-up with respect to a measured performance parameter as the ratio of the value obtained by the first implementation and the value obtained by the second one. When one tune-expanded and one tunedependent implementation is compared, we always divide the time-expanded value by the time-dependent value, i.e., we consider the speed-up achieved when the tune-dependent approach is used instead of the timeexpanded approach. All code is written in C++ and compiled with the GNU C++ compiler version 3.2; the experiments were run on a PC with AMD Athlon XP 1500-1- processor at 1.3 GHz and 512MB of memory running Linux (kernel version 2.4.19). The implementation of the tune-dependent model for the simplified earliest arrival problem uses the parameterized graph data structure of LBDA version 4.4.
the union of all the three German timetables (G-all). Hafas [2], the commercial timetable information system used by the German railway company Deutsche Bahn, is based on data hi the same format. Table 1 shows the characteristics of the graphs used in these models for the above mentioned timetables. Real-world queries were available only for the timetables G-long and G-all, so we additionally generated random queries for every timetable. Each set of queries consists of 50,000 queries of the form departure station, destination station and earliest departure time. In the tables, real queries are specially marked (X). 6.1.2 Heuristics On top of the models described hi Section 3, we considered heuristics to reduce the running tune while optimal solutions are guaranteed. For both approaches we considered the goal-directed search heuristic (see, e.g., [11]). In this heuristic the length of every edge is modified hi a way that if the edge pouits towards the destination its length gets smaller, while if the edge pouits away from the destination node, then its length gets larger. More precisely, for an edge (u, u) with length l(u, u), its new length l'(u, v) becomes l'(utv) = l(u,v) —p[u] +p[v], where p[-] is a potential function associated with the nodes of the graph. The crucial fact is that p[-] must be chosen in such a way so that l'(u, v) is non-negative. For example, a valid potential of a node can be defined by dividing the euclidean distance to the destination by the highest speed of a train in the timetable (i.e., the tune that the fastest train would need on the direct Hue towards the destination). Concerning the time-expanded model we reduced the node set: All arrival nodes which have outdegree one can be removed, by redirecting incoming edges to the target node of the outgoing edge. Thus, in the tuneexpanded graphs, the number of nodes equals the number of elementary connections, and the number of edges is twice the number of nodes. In the tune-dependent model the binary search technique to determine the edge length (see Section 3.2.1) can be replaced by the method described in [1] which avoids the binary searches.
6.1 Comparison of Original Models First, we consider the simplified version of the earliest arrival problem, since both approaches have been actually developed for that problem and we are interested hi 6.1.3 Implementation Environment and Perinvestigating their differences hi exactly this setting. formance Parameters For the tune-expanded model the implementation is based on that used in [11]; the 6.1.1 Data The following five railway timetables optimization technique to ignore the arrival events dewere used. The first timetable contains Prench scribed in Section 6.1.2 is included. For the tunelong-distance traffic (France) from the winter period dependent model, we have implemented both the plain 1996/97. The remaining four are German timetables version that uses binary search as well as the "avoid from the winter period 2000/01; one resembles the longbinary search" technique. For both models we also distance traffic in Germany (G-long), two contain loused the goal-directed search heuristic. Thus, for the cal traffic in Berlin/Brandenburg (G-locall) and in time-expanded model we have two different irnplementhe Rhein/Main region (G-local2), and the last is
93
Expanded Dependent
Timetable France G-long G-locall G-local2 G-all G-long X G-all X France G-long G-locall G-local2 G-all G-long X G-all X
Time 100.4 169.6 608.7 840.1 1352.8 66.7 392.1 8.2 10.7 19.7 20.7 76.6 5.5 37.3
El.C. 30824 44334 176720 226027 326186 18891 96943 8539 20066 26792 31698 79981 11173 40808
Nodes 33391 48094 182717 232511 342917 20853 104369 2269 3396 6535 6524 16145 1711 6926
Edges 61649 88668 353443 452056 652378 37783 193888 4463 5129 9835 10075 26333 2682 11647
Table 2: Average CPU-time in ms and operation counts for solving a single query for the time-expanded (upper part) and the time-dependent model (lower part). The arrival nodes are omitted in the time-expanded model (see Section 6.1.2), and in the time-dependent model binary search was used. Goal-directed search was not applied in both cases. The marker (X) indicates whether real-world or random queries have been used.
Figure 3: Comparison of the performance (CPU-time in ms) of the basic time-expanded and time-dependent implementations (see Table 2, only random queries are shown.) The abscissa shows the size of the timetable in number of elementary connections.
tations (goal-directed search with Euclidean distances or not), while for the time-dependent model we have several implementations depending on the use of: binary search or the "avoid binary search" version the goal-directed search heuristic with Euclidean or Manhattan distances and whether floating-point or integral potentials are used. For each possible combination of timetable and implementation variant we performed the corresponding set of random queries (for G-long and G-all we additionally performed the corresponding real-world queries) and measured the following performance parameters as mean values over the set of performed queries: CPU-time in milliseconds, number of nodes, number of edges, and number of elementary connections touched by the algorithm. For the tune-expanded model, the number of elementary connections touched is the number of train-edges touched by the algorithm, while for the tune-dependent model it is the total number of elementary connections that have been used for calculating the edge lengths. More precisely, when binary search is used hi the tune-dependent model, for Figure 4: Like Figure 3, with the difference that here a single edge the number of steps needed by the binary the ordinate shows not the runtime but the speed-up with search is the number of touched elementary connections. respect to the number of touched edges, CPU-tune, and number of touched elementary connections. 6.1.4 Results and Discussion Figures 3 and 4 as well as Tables 2, 3 and 4, clearly show that the timedependent model solves the simplified earliest arrival
94
.1 13 3 W •3 o 0
13 o
OS
1
W •3 o O
S<
S" 1
O"
•s
£' S
1 O
Time-D<spendenlt Model, Binlary Sean± Time El. conn. Nodes Edges Data 7072 1593 3415 Prance 9.4 16597 2737 4217 G-long 13.5 6257 26008 9434 G-locall 28.6 6398 9895 31196 G-local2 30.4 74525 14568 24030 100.3 G-all 1238 8349 1991 G-long X 6.3 33676 9420 43.1 5551 G-all X 7062 3410 1590 France 9.4 2730 16560 4208 13.6 G-long 9422 6249 25975 G-locall 28.9 31152 G-local2 9882 6389 30.7 14538 23983 74394 G-all 103.6 8318 1984 1233 6.4 G-long X 5532 33565 9388 44.5 G-all X 1647 7225 7.9 3511 France 16975 2807 4316 11.2 G-long 26086 6284 9473 23.2 G-locall 31235 6407 9908 G-local2 24.7 74822 14639 24138 G-all 86.4 1272 8555 2041 5.3 G-long X 33994 5615 9524 38.0 G-all X 1644 7214 3505 7.7 France 16938 2800 4306 11.1 G-long 6276 26053 23.1 9461 G-locall 9894 31189 6398 24.6 G-local2 74689 88.9 14608 24091 G-all 1267 2034 8524 5.2 G-long X 5594 33880 9491 39.0 G-all X
Table 3: Comparison of the time-dependent implementations that use binary search and four different versions of goal-directed search: Euclidean distance with (a) integer and (b) float potentials, and Manhattan distance with (c) integer and (d) float potentials. Columns are as in Table 2.
•s
1 w § o
Data Prance G-long G-locall G-local2 G-all G-long X G-all X
Time-Expanded Moclei Time El. conn. Nodes 84.0 22259 24179 175.0 34259 37453 684.3 170369 176243 219992 226386 953.0 1392.6 285440 300788 54.3 13384 14931 80229 341.9 74069
Edges 44517 68517 340741 439986 570885 26768 148140
Time-Dep endent M odel, Avoid Binary Se arch France 8942 2262 5.9 4386 5129 3396 7.5 9216 a G-long 14.2 G-locall 6541 18312 9814 6524 14.6 18435 10075 "i? G-local2 Q G-all 47.4 16146 48520 26333 d 3.8 2682 4773 1711 '3 G-long X OH G-all X 20.1 6927 20993 11648 France 1614 6.6 3406 6711 G-long 9.2 4217 7553 2737 G-locall 17656 6301 20.5 9499 13 G-locaI2 21.5 18088 6398 9895 3 H G-all 24030 44010 14568 63.0 •ao G-long X 4.2 3527 1238 1991 O G-all X 5552 23.7 9421 16898 France 5.1 3505 6926 1669 2807 7733 7.1 4316 .1 G-long G-locall 17621 15.3 6289 9480 G-local2 16.1 9908 18113 6407 d £ 44214 G-all 14639 50.5 24138 3.2 1272 2041 3618 1 G-long X O 5615 G-all X 19.1 9524 17088
1 •a
•a
Table 4: Comparison of goal-directed search in tbe timeexpanded case (upper part) and the technique to avoid binary searches in the time-dependent case (lower part). In the time-dependent case two different distance measures for the goal-directed search are reported, the Euclidean and the Manhattan distances with integral potentials, which were the fastest. Columns are as in Table 2.
95
Expanded Simplified Realistic
!
1
Nodes 289432 578864
Station Nodes Simplified II 6685 Realistic || 6685 Dependent
|[
Route Nodes -
79784
Edges 578864 1131164 Timetable Edges 17577 72779
Transfer Edges -
159568
Table 5: Graph parameters for the realistic models, applied to the same input timetable G-long-1: The number of nodes and edges of the graphs in the time-expanded (upper part) and time-dependent (lower part) approach, compared to the simplified, original models.
Problem Time | Nodes [ Edges EA-simple X 20760 41519 70 78 40624 73104 EAX 125 101731 138417 MNTX 82 (EA,MNT) X 40628 73123 (MNT.EA) X 161 99061 137075 287 123943 236887 Pareto X EA-simple 106 34469 61955 122 EA 61159 111301 212 169299 239841 MNT (EA,MNT) 129 61195 111386 259 163438 234297 (MNT.EA) 405 170946 330150 Pareto
Table 6: Results for the realistic problems using the timeexpanded implementations. For comparison, the columns referred by EA-simple show the results in the simple model problem considerably faster than the time-expanded using the G-long-1 data set. model, for every considered data set. Thus, the much smaller graph in the time-dependent approach pays off, and the edge lengths can be computed efficiently enough all-Pareto-optima problem involving both of the former when real data is considered. Regarding. CPU-tune, the speed-up ranges between problems (Section 5). Additionally, we have more ef12 (France) and 40 (G-local2) when the basic imple- ficient implementations in the time-expanded case for mentations are used (see Fig. 4 and Table 2), and be- the lexicographically first pareto-optimum when the artween 17 (France) and 57 (G-local2) when comparison rival time is the first criterion (EA,MNT), and in the concerns the best implementations (including heuris- time-dependent case when the number of transfers is the first criterion (MNT,EA). In the time-expanded imtics) in both models (see Tables 3 and 4). Concerning the time-dependent model, we observe plementations we reduced the node set using a similar that it is better to use the "avoid binary search" method as in the simplified case (see 6.1.2), and omittechnique (see Tables 3 and 4). Compared to the binary ted the departure nodes in the time-expanded graph. search implementation the speed-up was between 1.39 Also for the realistic time-dependent implementations (G-locall) and 1.86 (G-all with real-world queries). we applied heuristics similar to the method that avoids The goal-directed search technique always reduces the the binary search; for details see [10]. search space of Dijkstra's algorithm, i.e., the number of touched nodes and edges. However, this reduction 6.2.2 Results and Discussion The average values payed off only in a few cases in the sense that it could of the number of touched nodes, edges, and the average running time for solving the real-world queries are not also decrease the CPU-time. displayed in Tables 6 and 7. These results show that, 6.2 Comparison of Realistic Models We now concerning CPU time, the tune-dependent approach turn to the comparison of the two approaches when the still performs better than the time-expanded approach more realistic problems and models are considered. As in all the cases considered. However, the gap is not input data we use a variant of the G-long timetable, as big as for the simplified earliest arrival problem. which we get by using only elementary connections that In fact, for the realistic earliest arrival problem with are valid on the first day of the timetable period. We re- realistic queries the speed-up is only 1.6 (compared to fer to that timetable as G-long~l. We applied the real- a speed-up of 12 for the simplified EAP and the data world and random queries described in Section 6.1.1. set G-long, see Table 2). In the time-expanded case, Table 5 shows the parameters of the graphs used in the the graph used in the realistic EAP has less than twice as many nodes and edges as the graph used in the realistic models compared to the original models. simplified EAP, and is of very similar structure. Thus, it 6.2.1 Implementation Environment For both ap- needs only slightly more time to solve the realistic EAP proaches we implemented the described solutions for the than to solve the simplified EAP. The lexicographically realistic earliest arrival problem (Section 3), the mini- first (EA,MNT) problem is solved in a very similar way mum number of transfers problem (Section 4), and the as the realistic EAP, and the CPU-time and operation
96
Problem
Time [ms] 10 50 38 83 181 11 54 47 106 219
Nodes
TTEdges 4365 38168 21558 22901 65753 4811 41011 27235 28779 77610
Trans. Edges 45494 61615 60462 79691 48942 69411 69054 94904
stations) are considered, and more sophisticated techniques in the bicriterion case had to be used. Finding 2967 EA-simple X the lexicographically first connection when the earliest 44731 EAX arrival is the main criterion could not be done directly 26680 MNTX in the time-dependent model. Nevertheless, all other 28272 (MNT,EA) X problems under consideration could be modeled also in 78412 ParetoX the time-dependent approach. 3315 EA-simple The experimental study showed that the time48200 EA dependent approach is clearly superior with respect 33455 MNT to performance of the original models, as speed-up 35262 (MNT.EA) factors in the range from 10 to 40 were observed. Pareto 92378 Considering the extensions towards realistic models, however, the time-dependent approach still performs Table 7: As Table 6, but for the time-dependent implemen- better, but the difference is much smaller. The timeexpanded approach benefits from the straight-forward tations. modeling that allows more direct extensions and simpler implementations. counts are almost identical. In contrast, for the MNT and the lexicographically first (MNT,EA) problems as References well as for finding all Pareto-optimal solutions a much bigger part of the graph has to be searched, and thus [1] G.S. Brodal and R. Jacob. Time-dependent networks more CPU-time is needed. as models to achieve fast exact time-table queries. In In the time-dependent case, because of the addiProc. 3rd Workshop on Algorithmic Methods and Models for Optimization of Railways (ATMOS 2003), Electional nodes and edges in the train-route graph, which tronic Notes in Theoretical Computer Science, volume are much more than the nodes hi the simplified time92, issue 1, Elsevier, 2003. dependent graph, the realistic earliest arrival problem [2] http://bahn.hafas.de. Hafas is a trademark of Hais solved 5 times slower than the simplified EAP. The con Ingenieurgesellschaft mbH, Hannover, Germany. MNT is solved faster than the realistic EAP, since all [3] M. Schnee, M. Muller-Hannemann, and K. Weihe. edge lengths in the train-route digraph are static. The Getting train timetables into the main storage. In solution of the (MNT.EA) problem again involves timeProc. 3rd Workshop on Algorithmic Methods and Moddependent edge lengths, and thus is slower than comels for Optimization of Railways (ATMOS 2002), Elecputing the realistic EAP and MNTP. The implementatronic Notes in Theoretical Computer Science, voltion for all Pareto-optimal solutions uses the compution ume 66, issue 6, Elsevier, 2002. of the earliest arrival problem with bounded number of [4] R. Mohring. Angewandte Mathematik - insbesondere Informatik, pages 192-220. Vieweg, 1999. transfers as a sub-procedure, and needs only roughly [5] M. Muller-Hannemann and K. Weihe. Pareto shortest twice the time as the solution to the lexicographically paths is often feasible in practice. In Proc. 5th Workfirst (MNT.EA) problem.
7
Conclusion
We have discussed time-expanded and time-dependent models for several lands of single and bicriteria problems in timetable information. In the tune-expanded case, extensions that model more realistic requirements (like modeling train changes) could be integrated in a more-or-less straightforward way and the central characteristic of the approach is that a solution to a given optimization problem could be provided by solving a shortest path problem in a static graph, even for finding all Pareto-optimal solutions in the considered bicriterion problem. In the time-dependent case, the central characteristic of having one node per station had to be violated when more complex optimization problems (like the integration of minimum transfer times at
[6]
[7] [8] [9] [10]
shop on Algorithm Engineering (WAE 2001), Springer LNCS, volume 2141, pages 185-198, Springer, 2001. K. Nachtigal. Time depending shortest-path problems with applications to railway networks. European Journal of Operations Research, volume 83, pages 154-166, 1995. A. Orda and R. Rom. Shortest-path and minimumdelay algorithms in networks with time-dependent edge-length. Journal of the ACM, volume 37(3), 1990. A. Orda and R. Rom. Minimum weight paths in tunedependent networks. Networks, volume 21, 1991. S. Pallottino and M. Grazia Scutella. Equilibrium and advanced transportation modelling, chapter 11. Kluwer Academic Publishers, 1998. E. Pyrga, F. Schulz, D. Wagner, and C. Zaroliagis. Towards realistic modeling of time-table information through the time-dependent approach. In Proc. 3rd Workshop on Algorithmic Methods and Models for
97
[11]
[12]
[13]
[14]
Optimization of Railways (ATMOS 2003), Electronic Notes in Theoretical Computer Science, volume 92, issue 1, Elsevier, 2003. F. Schulz, D. Wagner, and K .Weihe. Dijkstra's algorithm on-line: An empirical case study from public railroad transport. ACM Journal of Experimental Algorithmics, volume 5(12), 2000. F. Schulz, D. Wagner, and C. Zaroliagis. Using multilevel graphs for timetable information in railway systems. In Proceedings 4th Workshop on Algorithm Engineering and Experiments (ALENEX 2002), Springer LNCS, volume 2409, pages 43-59, Springer, 2001. D. Wagner and T. Willhalm. Geometric speed-up techniques for finding shortest paths in large sparse graphs. In Proc. llth European Symposium on Algorithms (ESA 2003), Springer LNCS, volume 2832, pages 776-787, Springer, 2003. M. Ziegelmann. Constrained shortest paths and related problems. PhD Thesis, NaturwissenschaftlichTechnischen Fakultat der Universitat des Saarlandes, 2001.
Appendix A
Earliest Arrival with Bounded Number of Transfers in the Time-Dependent Model Given two stations a and 6, and a positive integer k, the Earliest Arrival problem with Bounded number of Transfers (EABT) is defined to be the problem of finding a valid connection from a to b such that the arrival tune at 6 is the earliest possible, and subject to the additional constraint that the total number of transfers performed in the path is not greater than k. Since EAP reduces to a shortest path problem, EABT is clearly a resource constrained shortest path problem. We consider two algorithms for solving the EABT problem. The first one is an adaptation of the method proposed in [1] to our extended tune-dependent model (train-route digraph). The second one is an adaptation of the labeling approach (see e.g., [14]) for solving resource constraint shortest paths to our extended timedependent model. The idea of [1] casted to the extended tunedependent model is as follows. Let A denote the get-off edges, D the get-in edges, and R the route edges (cf. Section 3.2.2). We construct a new digraph G' = (V',E') consisting of k + 1 levels. Each level contains a copy of the train-route digraph G = (V, A U D U R). For node u e V, we denote its i-th copy, placed at the z-th level, by Ui, 0 < i < k. For each edge (u,v) 6 A(J R, we place in E' the edges (v,i,Vi), VO < i < k. For each edge (u,v) £ D, we place hi E' the edges (ui,Vi+i), VO < i < k. These edges, which connect consecutive levels, indicate transfers. With the above construction, it is easy to see that a path from some node SQ (at
98
the 0-th level) to a node ti (at the f-th level) represents a path from station(s) to station(i) with I transfers. In other words, the EABT problem can be solved by performing a shortest path computation in G' aiming to find a shortest path from a node pg at level 0, where a = station(s), to the first possible U{ at level i, 0 < i < k, where u is the node of the train-route graph such that b = station(u). The adaptation of the labeling approach to our train-route digraph is as follows. We use the modified Dijkstra's algorithm (cf. Section 3.2.2), where now we maintain fc+1 (instead of one) labels, and which requires some additional operations to take place as nodes are extracted from the priority queue. Each label is of the form (tt,Zi) u , 0 < i < k, representing the currently best time £* to reach node u by performing exactly Zj transfers. Let s be the node for which a = station(s). The algorithm works as follows. Initially, we insert to the priority queue the label (£, 0)a. The priority queue is ordered according to tune, aiming at computing the earliest arrival path. When we extract a label (ti,l)u> we relax the outgoing edges of u considering that u is reached on time ti and with I transfers. In addition, if (ti',lr) was the last label of u that has been extracted, then we delete from the priority queue all labels of the form (tm,m)u for I < m < I', setting I' = k in the case where (ti, l)u was the first of the labels of u to have been extracted. In this way, we discard the dominated - by (*/j Ou ~ labels from the priority queue, since for all such (tm,rn)u it holds that ti < tm (as (ti, l)u was extracted before (£m. ""*)«) and I < m. Clearly, such labels are no longer useful as (ti,l)u corresponds to an s-u path at least as fast as the one suggested by (tm,7n)u, a-nd with less transfers than the latter. Exactly for the same reasons, when we relax an edge (u, v) € E having found a new label (ti^,l\}v for v, we will actually update the label of v only if there has been so far no label of v extracted from the priority queue, or if the last label of v that was extracted had a number of transfers greater than li. Concerning now the complexity of the labeling algorithm, we need to see that for each node the total number of labels that is scanned in order to find those that are in the priority queue and can safely be deleted is O(k), while the total number of deletions is O(nk), where n = |V|. This is due to the fact that we only check the labels from the last known (by a deletemin operation) number of transfers, until the previous one. In this way, each label is checked at most once throughout the execution of the algorithm. Since each edge will be relaxed at most k + 1 times, the total number of relaxations will be O(mk), where m = \E\.
We can also see that the total number of labels that are in the priority queue is at most O(nfc). Because of this, the time for a delete-min or a delete operation, is 0(log(rzfc)). This means that the total time needed for the algorithm is O(mk + nk • log(nA:) 4- nk • log(nfc)) = O(nk • log(nA:)), which is the same as for the algorithm
in [1]. B
The (EA,MNT) Problem in the TimeDependent Approach. The following describes an example showing that the (EA,MNT) problem cannot be solved directly in the time-dependent approach by using pairs as edge costs in the train-route digraph. Consider the train-route digraph shown in Figure 5, and a query to find an A-Dconnection. Let Ci be an A-B-C-D-connection with one transfer at C, and €2 be an A-C-D-connection with no transfer. Both connections arrive at the same (optimal) time at D, and connection C\ arrives earlier at C than the connection £2- Then, the algorithm for finding the lexicographically first solution using pairs (EA,MNT) as edge costs outputs connection C\ as optimal, while there is connection Cz with the same arrival time, but less transfers. The reason is that the tune-dependent function for edge e is decreasing.
Figure 5: Lexicographically first (EA,MNT) connections cannot be found by simply using pairs as edge costs in the train-route digraph.
99
Reach-based Routing: A New Approach to Shortest Path Algorithms Optimized for Road Networks Ron Gutman* January 6, 2004 Abstract
Past work has explored two strategies for high volume shortest path searching on large graphs such as road networks. One strategy, extensively researched by the academic community, pre-computes paths and avoids a too expensive all-pairs computation by computing and storing only enough paths that a path for an arbitrary origin and destination can be formed by joining a small number of the pre-computed paths. This approach, in practice, has been unwieldy for large graphs - both preprocessing time and the size of the resulting database are excessively burdensome. The implementations can be unusually complex. The other strategy, often used in industry for routes on road networks, exploits the natural hierarchy in a road network to prune a Dijkstra search. This is much less unwieldy, but less reliable as well. The pruning is based on a heuristic in such a way that no guarantee about the optimality of the computed path can be made. In the worst cases, the algorithm will fail to find a path even though one exists. Both of these strategies are inflexible or inefficient in the face of complex requirements such as changes to the network or queries involving multiple destinations or origins. We introduce a new concept called reach which allows shortest path computation speed on par with the industry approach but computes provably optimum paths as do the theoretical approaches. It is more versatile than both; for example, it easily handles multiple origins and destinations. The versatility makes available a wide range of strategies for dealing with complex routing problems. In a test on a graph of 400,000 vertices, the new algorithm computed paths between randomly chosen origins and destinations 10 times faster than Dijkstra. It also combines naturally with the A* algorithm for an additional reduction in path query processing time.
TaveMarket and Yahoo, e-mail: [email protected]
100
1
Introduction
Some form of hierarchy is inevitably harnessed to improve the speed of shortest path computations on large graphs. Many of the well-researched strategies impose a hierarchy on the graph, usually by partitioning the graph (see [7, 13, 14, 15]). The partitioning permits a trade-off between pre-computation and query time processing. Results stored from the preprocessing represent paths from arbitrary vertices in the graph to vertices at the partition boundariesand paths between vertices on the partitions. Queries are serviced by joining stored paths (usually 2 or 3 paths). Some variations of this approach assume planarity of the graph or that the graph is undirected (e.g., [7, 14, 15]). Road networks cannot be reconciled with either assumption. It is common in industry to rely on the natural hierarchy in road networks to improve query speed. This approach involves no preprocessing. Typically, roads are classified according to their importance for longer routes, e.g., freeways are very important and residential streets are unimportant. Either vertices or edges in the graph representation may carry the importance attribute. A modified Dijkstra algorithm disregards vertices or edges of low importance when they are far from both origin and destination. A heuristic rule determines whether a vertex or edge can be ignored. Effective implementations require a "bidirectional" search, that is two Dijkstra algorithms running in tandem, one searching from the origin toward the destination and the other searching from the destination in reverse, that is, on a graph in which every edge (w,v) has been replaced by (v,u). The heuristic does not guarantee a shortest path, and reliably good results require tuning. Empirically, the performance is sub-linear as a function of the "distance" between the origin and destination. Both approaches suffer from inflexibility. Neither can efficiently address problems that involve multiple destinations and origins; each possible pair of origin and destination must be handled as a separate problem. Neither is easily adapted to a dynam-
ically changing graph. For example, if a traffic jam increases the weight on the edges representing the involved roads, a frontage road might become very important. How can that change in importance be discovered? Our approach was inspired by the latter question. Instead of relying on road classifications, we define a formal attribute of a vertex that reflects the importance of the vertex and can be computed from the graph. The attribute, which we call "reach", makes possible a new variation on the Dijkstra algorithm ([4, 5]) that preserves the optimality of the result while improving computation time significantly. We show that the modified algorithm preserves optimality. We also offer efficient algorithms for computing the attribute. Our approach offers these advantages: • Guarantees optimality of computed paths. • The shortest path computation time is comparable to that of the industry approach. • Can be combined with other optimizations such as the A* algorithm (see [10, 11]). • Storage requirements are not significantly increased by the pre-computed data. • Preprocessing may be fast enough to handle dynamically changing graphs in some applications (e.g, a metropolitan road network on a parallel machine with 10-20 CPUs). • Greatly reduces computation time of shortest paths for multiple origins or destinations. We feel that the last of these is the most important because of the importance of multiple origin and destinations in fields such as transporation logistics and the lack of a satisfactory alternative for road networks. The computational effectiveness of our approach depends on properties of the graph. We have not formalized those properties, so we provide empirical results on 3 different data sets instead of theoretical time bounds. Loosely, our approach depends on the presence of a natural hierarchy in the network. We believe that most large networks used for transport will possess some hierarchy due to the motivations of designers to optimize the network for transport. Section 2 of this paper introduces the concept of reach and gives its formal definition and explains notation and terminology used in this paper. Section 3 presents our shortest path algorithm.
Because the computation of reach for each vertex potentially requires an all-pairs shortest path computation, we present an alternative in sections 4 and 5. The alternative method computes an upper bound on the reach of each vertex at much less cost than an all-pairs computation. An upper bound can be used in our shortest path algorithms without affecting correctness. Section 4 gives an intuitive description of the approach while section 5 presents theorems that lead to the actual algorithm for computing upper bounds on reach. Though we have proofs for all of the theorems, space did not permit their inclusion. Section 6 describes the algorithm for computing the upper bounds and its implementation. Section 7 presents experimental results that characterize the performance of our reach bound computation and shortest path algorithms.
2 The Concept of Reach Intuitively, the reach of a vertex encodes the lengths of shortest paths on which it lies. To have a high value of reach, a vertex must lie on a shortest path that extends a long distance in both directions from the vertex. The notion of "length" or "distance" here is not necessarily the same as the notion of "weight" or "cost", that is, the length of a path is not necessarily the same as the weight, or cost, of the path. In a road network, for example, travel time is commonly used as the cost metric and the weights on the edges reflect travel time not travel distance. The definition of reach will also depend on some metric. The reach metric might be the same as the cost metric. However, our experimental results showed that, on road networks, using travel distance as the reach metric more effectively captured the hierarchy in the network. In addition, a metric based on geometric distance permits a more straight forward implementation of the shortest path algorithm and one which can be readily adapted to a variety of problems and strategies for solving them. For this reason, we use terminology and notation that distinguishes the weight function of a graph from a reach metric for the graph. The definition of a reach metric is identical to the definition of a weight function: a reach metric for graph G = (V, E) is a function m : E —> R mapping any edge, e, in G to a real number, m(e). For a path P, we use the notation m(P] to represent the sum of m(e) over all edges e of P or zero if P has only 1 vertex. The notation, m(w,v, P), represents m(Q) where Q is the subpath of P from u to v. (In this paper, the term "path" always means "simple path"). It's possible for the reach metric to be the same
101
as the weight function, or cost metric. However, we found that a reach metric based on distance is more effective and has other advantages, so the shortest path algorithm we present uses a reach metric based on distance. We assume that the graph is "projected" into a plane. We use the word "projected" to distinguish this notion from the notion of a planar graph. None of our work assumes planarity of the graph. We only assume that each vertex has been assigned coordinates in a some Euclidean space without requiring that edges do not intersect. We do assume that the assignment of coordinates is consistent with the reach metric in this way: Given an edge, («,v), m(w,v) > d(u,v) where d gives the Euclidean distance between two vertices. The projection into a Euclidean space is not strictly needed as long as there is a distance function consistent with the reach metric, but most networks with a distance function are projected into some space. Note that given a path P from u to D, m(P) = m(w,v,P) > d(u, v) Because the term "shortest path" too easily brings length or distance to mind, we, instead, use the term "least-cost path" for the remainder of the paper except for the "Experimental Results" section (as much for our own sanity as the reader's). A "least-cost path tree" is the same as a "shortest path tree". Definition: Given, • a directed graph G = (V,E) with positive weights • a non-negative reach metric m : E —» R • a path P in G starting at vertex s and ending at vertex t • a vertex v on path P then the reach of v on P, r(v,P), is min{m(s,v,P),m(v,t,P)} and the reach of v in G, r(v,<7), is the maximum value of r(v, Q) over all least-cost paths Q in G containing v.
queue. If test returns true, the vertex is inserted into the priority queue; otherwise the vertex is not inserted into the priority queue. We describe the algorithm for test(v) and prove that the modified Dijkstra algorithm finds the least cost path. Our tests show that the algorithm reduces the number of insertions into the priority queue by an order of magnitude or more on moderately long paths (more than 25km) in road networks. The function test(v) uses the following information: • r(v, G) as defined above • m(P) where P is the computed path from the origin, s, to v at the time v is to be inserted into the priority queue. • d(v, t) where t is the destination The value of test(v) is: • true if r(v,G) > m(P) or r(v,G) > d(v,t) • false otherwise In other words, the value returned by test(v) is only false if the reach of v is too small for it to lie on a least-cost path a distance m(P) from the origin and at a straight-line distance d(v, t) from the destination. The modified Dijkstra algorithm is equivalent to the unmodified Dijkstra algorithm performed on a graph, G', that results by removing the rejected nodes from G. To prove that the algorithm is correct, we only need to show that G' has the same least-cost path from s to t as G. It is clearly not possible for the reduced graph to have a lower cost path from s to i, so it is sufficient to show that a least-cost path, P, from s to tin G also exists in G'. Induction shows that P exists in G' if for every v on P, test(v) is true. Assuming there is a t; on P for which test (v) is not true, leads to a contradiction we now show. Let v be the first vertex on P for which test(v) is false. (Note that test(s) is necessarily true). That v is on P implies that:
However, if test(v) is false, then both of the following are true:
3 A Reach-based Shortest Path Algorithm The algorithm we present here assumes a reach metric that is consistent with the Euclidean distance function in the manner described above. Let G be a directed graph G = (V,E) with positive weights, Taken together, (3) and (2) contradict (1). We and reach metric, m consistent with distance function conclude that for any v on P, test(v) is true, that d : (V x V) —+ R. The algorithm is a modification of Dijkstra's algo- G' includes P, and that the modified algorithm is rithm in which a function, test(v), is called immedi- correct. Note that, test(v) could use, in place of r(v,G), ately prior to inserting a vertex, v, into the priority
102
some upper bound, 6, on r(v,(7). In that case, a correctness proof is the same except that (2) and (3) must be justified for the modified test(v). (2) and (3) would follow from these observations: • b < m(s, v, P) because test(v) failed using 6 • b < d(v,t) < m(v,t,P) because test(v) failed using b • r(w,G)
We call b a reach bound for v.
end of section 3, an upper bound on r(v, G) can be used in place of r(v, G). We describe an algorithm to compute reach bounds that are close enough to the actual reach values that our reach-based least-cost path algorithm, using those reach bounds, yields the performance results stated in section 7. Our strategy computes small reach bounds first, then uses that information to compute larger reach bounds. This bootstrapping from low to high reach bounds is repeated until reach bounds for most vertices have been computed. The remaining vertices are assigned infinite reach bounds which means that they are never rejected by the test function described in section 3. For a network with a high degree of hierarchy, the first iteration computes reach bounds for most vertices. The computation is performed by computing what we call "partial least-cost path trees". A partial least-cost path tree is a tree which is a directed subgraph of a least-cost path tree and has the same root as the least-cost path tree. The partial leastcost path trees that are needed to compute small reach bounds are small trees so their computation time is much shorter than that of the complete leastcost path trees. The theory in the next section shows how to determine the extent of the required partial least-cost path trees. When some reach bounds have been computed for some vertices, the next iteration is performed on a graph, G', with those vertices omitted. Because G' is smaller than G, partial least-cost path trees computed can be extended further on G', at a reasonable computing cost, than on G. This allows reach bounds to be computed for additional vertices. The bounds are computed using least-cost paths in G' and the previously computed reach bounds of vertices in G — G' adjacent to those least-cost paths. The reach bounds of the adjacent vertices allow the algorithm to infer how far least-cost paths in G can extend from where they leave G'. Iterations of this process with smaller and smaller G' continue until G' is small enough to assign its vertices infinite reach bounds without badly affecting the performance of path computation. Implementation details appear in section 6.
If infinity is employed as the reach bound for all v, then the algorithm becomes the Dijkstra algorithm. We note, without giving details, that this modification can be easily applied to the A* algorithm. For graphs projected into some space, the A* algorithm's estimation function for vertex v might be based on d(v, t) depending on the cost metric (weight function) for the graph. That A* and test(v) can share the cost of computing d(v,t), which is not completely trivial, is an added benefit of combining the two techniques. Finally consider a requirement to compute a leastcost path from an origin s to any vertex in a set T of vertices. The modified algorithm, with or without A*, readily performs this as long as there is a function d(v,T) that computes the distance from a vertex v to the nearest vertex in T. For example, we have implemented this for T where T are the points on the network that intersect the boundary of a circle or rectangle (we temporarily add vertices on the edges where the intersections occur). Then the algorithm computes the shortest path to the circle or rectangle boundary. The performance of our algorithm obviously depends on how many vertices are rejected by the test function. That, in turn, depends on distribution of the values of reach for the vertices in the graph. For road networks and other networks with a high degree of hierarchy, most vertices have' low reach values (short reach) and only a few have high reach values (long reach). In fact, the distribution approximates an exponentially decreasing function of reach. There is a more general algorithm that uses a bidirectional search and requires no Euclidean distance function, but that approach lacks several advantages compared to the one we present here and is more 5 Theory for Computing Reach Bounds complex. The theorems in this section answer three key questions about computation of reach bounds: 4 A Fast Algorithm for Computing Reach Bounds Computing r(v,<7) for every v in G is expensive. How far should the least-cost path computations The only method we know is to perform an all-pairs extend in order to compute reach bounds? (Theshortest path computation on G. But as noted at the orems 5.3 to 5.6)
103
• Given that extent of computation, for which vertices can a reach bound be computed? (Theorems 5.5 and 5.6) • For those vertices, what reach bounds can be determined? (Theorems 5.1 and 5.2)
• a directed graph G = (V,E) with positive weights • a non-negative reach metric m : E —t R • a vertex v in G
• a least-cost path P in G from s to t including v Theorems 5.3 to 5.6 address the first question by • a subpath of P, P', from s' to t' and including v characterizing the set of least-cost paths that must then, r(v,P) < min{r(s',G) + m(s',v,P'),r(t',G) + be computed. The length of these paths, and hence mM',P')} the length of the computations, are bounded by these theorems. For Theorems 5.2, 5.3, and 5.4, the following are given: Definition: Given a directed graph G with • a directed graph G = (V, E) with positive positive weights and a path P in G, we say that "P weights is minimal with respect to the reach of v in G", if
P is a least-cost path including v, r(v,P) = r(v,G), and, for any proper subpath of P containing v, Q', r(v,Q')
• a non-negative reach metric m : E —> R • a subgraph of G, G' = (V',E'), such that E' — («,v)|u,t>€ V, (u,v) € E
Intuitively this means that removing an edge from • a vertex v in G' such that r(v, G) > 0 either end of a path reduces the reach of v on the • a path P which is minimal with respect to the path. Obviously, for every vertex v in graph G, there reach of v in G exists a path P which is minimal with respect to the reach of v in G. In addition, for those theorems, we let To illustrate the application of the theorems, sup• path P' = the longest subpath of P containing pose we wish to compute reach bounds for all vertices t; and included in G' v in G for which r(v, G) < b. A special case of Theorem 5.6 tells us that it is sufficient to compute all • $ and s' be the first vertices of P and P', least-cost paths, P, in G satisfying respectively m(P) < 26 4- max{m( where / and / are, respectively, the first and last edges of P Longer paths are not needed. For a given v in G, the maximum value of r(v, P) for all such P is computed. If that maximum is less than 6, then this special case of Theorem 5.6 further tells us that among those paths is one which is minimal with respect to the reach of v in G. Therefore that maximum value is r(v,G). This special case (G = G') of Theorem 5.6 is all that is needed for the first iteration of the algorithm. For subsequent iterations on successively smaller subgraphs of G, the more general theorem is needed to determine the set of least-cost paths to be computed and which vertices' reach bounds are determined by those paths. Theorem 5.2 is needed to compute the reach bounds. Theorem 5.1 provides a bound on the reach of a node on a path in terms of the reach, in the graph, of the path's endpoints. Theorem 5.1 (Simple Reach Bounding Theorem): Given,
104
• t and tr be the last vertices of P and P', respectively • / be the first edge of P after s' (the first edge of P' if P' has more than one vertex) • I be the last edge of P before t' (the last edge of P' if P' has more than one vertex) (note: / and / exist because r(v, P) = r(v, G) > 0, so P must have at least one edge prior to v and one edge following v) Theorem 5.1 cannot be readily applied when the reach of s' and t' are unknown. Theorem 5.2, which follows from Theorem 5.1, can be applied when the reach of vertices in G — G' only are known. Theorem 5.2 will show that computing P' is a key to computing a reach bound for v. Theorem 5.2 (Practical Reach Bounding Theorem): Let • g = max{r(x,G) +m(x,s')\(x,s') € E - E'} if there is a (x, s'} € E — E', otherwise g = 0, • h = max{r(y,G) + m(t',y)\(t',y) € E - E'} if there is a (£',y) € E - £", otherwise h — 0,
then, r(v,G) = r(v,P) < min{g + m(s',v,P'),h +
m(v,t,p')}
With Theorem 5.3, we begin to characterize the least-cost paths that must be computed to compute reach bounds by placing bounds on the lengths of the paths consistent with ensuring that P' is among those paths.
• a path P' in G' which includes v and is minimal for r(v, G'} > b (which we define to mean that, for any proper subpath of P', Q, including v, r(v,Q) 0), then m(P') < 26 -I- m(f) + m(l).
Theorem 5.6 puts Theorems 5.4 and 5.5 together Theorem 5.3 (Simple Minimal Path Bounding to describe a set of least-cost paths from which the reach bounds for some vertices in G' can be Theorem): m(P') < 2r(v,G') + max{m(/) + m(s,s',p),m(f) + computed.
m(t',t,p)}
Theorem 5.6 (Pruning Theorem for Reach The right-hand side of this inequality contains Bound Computation): Given, • G, G', m, v, and P as defined for Theorems 5.2, terms that are unknown during the computation: 5.3, and 5.4, m(s,s', P), m(£',£, P), and r(v, G'). Theorems that follow will replace the unknown terms with larger • some 6 > 0, but known values. • the set 5 of all least-cost paths in G' such that each path P' e S satisfies Theorem 5.4 (Practical Minimal Path Bounding Theorem): Let • c = max{r(a:,G)|ar € V - V'} if V ? V, otherwise 0, • d be defined such that if there exists a vertex u immediately preceding s' on P, then d = m(w, s'), and if not, then d = 0, • ebe defined such that if there exists a vertex u immediately following t' on P, then e = m(t', u), and if not, then e — 0,
then, m(P') <
2r(v,G')+c+max{m(l)+d,m(f)+e}.
The value of c can be computed by iterating over V — V, but there remains one term whose value is unknown during computation of the least-cost paths: r(v, G'). This problem can be addressed by limiting v to those vertices of G' for which r(v,G') < b for some suitably chosen value b. Then 6 can substituted into the inequality of Theorem 5.4 in place of r(v,G'), and all of the values on the right-hand side of the inequality become readily available during the computation for such v. However, we must be able to distinguish vertices v that satisfy r(u, G') < b from those that don't. Theorem 5 allows the computation to make this distinction.
P' has at least one edge, and m(P') < 2&+c+max{m(/)-l-d,m(/) +
e,m(f) + m(l)} where s' and t1 are, respectively, the first and last vertices of P', / and / are, respectively, the first and last edges of P', c = max{r(x,G)\x € V - V'} if V ^ V, otherwise 0, d = max.{m(u,s')\(u,s') € E - E'} if such a u exists, otherwise 0, e = max{m(*',u)|(<',w) e E - E'} if such a u exists, otherwise 0, if max{r(v,P')|P' e 5} < &, then r(v,G') < b, and S includes the longest subpath of P containing v and included in G' provided the subpath has at least one edge.
In other words, for a v meeting the conditions of Theorem 5.6, the set S includes a path P' with which we can apply the reach bound formula of Theorem Theorem 5.5 (Reach Bound Validation Theo- 5.2. We don't know which P' it is, but the maximum value for all such P' in S is clearly a safe bound. rem): Given, • some b > 0, 6 Algorithm and Implementation for • a directed graph G' = (V',E') with positive Computing Reach Bounds weights, The iterative process that computes reach bounds, • a non-negative reach metric m : E' —> R', as previously explained, computes partial least-cost • a vertex v € G' such that r(v,G'} > 6, path trees on progressively smaller subgraphs of the
105
input graph G. As the subgraphs become smaller the partial least cost path trees extend greater distances from their roots in order to compute reach bounds for more vertices. Theorem 5.6 helps determines how far each partial least-cost path trees must extend while Theorem 5.2 provides a formula for the reach bounds. There are three differences between the theory described in the preceding section and our actual implementation. We describe those differences, meant to simplify the implementation and enhance its performance, before describing the algorithm we used. First, Theorem 5.6 does not actually call for the computation of least-cost path trees. In general, the set of paths prescribed by Theorem 5.6 might not form a tree, but a directed acyclic subgraph of G because, for given source and target vertices, there can be multiple least-cost paths. Though computing least-cost path dags is feasible, it's much simpler to compute least-cost path trees using the standard Dijkstra algorithm. Each iteration computes one partial least-cost path tree for each vertex in G' applying Theorem 5.6 to determine their extent. If all least-cost paths in G' are unique, that is, no two least-cost paths have the same origin and destination, then those trees contain all of the paths in the set S described by Theorem 5.6. If two least-cost paths have the same origin and destination, then the trees do not contain all of the paths in S. As a result, there might be some v for which the reach bound computed is incorrect. Such a v lies on a least-cost path, P, which is not unique to its origin and destination. However, an alternate path with the same origin and destination in G' is included in the tree rooted at that origin. So although the reach bound for v is underestimated it never prevents the reach-based Dijkstra algorithm from finding a least-cost path. The other two differences improve performance by helping to limit the size of the trees computed. A partial least-cost path tree can be computed by running a Dijkstra algorithm that terminates when the tree is sufficiently large to contain the desired paths. However, some branches of the tree might be extended much further than necessary, because the cost-metric, or weights, of the graph determine how the tree grows while the reach metric determines whether a particular branch of the tree has been extended far enough. An optimization stops exploration on branches that satisfy Theorem 5.6 before other branches do. A side effect is that the tree produced contains some paths which are not least-cost paths. We found that this optimization only slightly increases the reach bounds computed, and has no other ill effects, if implemented conservatively. The third difference is the most important, calls
106
for some restatement of theorems, and is reflected in the psuedocode given at the end of this section. The test, from Theorem 5.6, to determine whether a particular path, P', is needed in the tree, (4) m(P') < 26 + max{r(a:,G)|z e V - V'} + max{m(/) 4- d, m(/) + e, m(f) + m(l)} is difficult to apply efficiently, that is, in such a way that keeps each partial least-cost path trees small. To be applied efficiently an inclusion test, I(P), which is true if path P is be included in the tree, should have this property: Given P and Q, both paths starting at the root of the least-cost path tree, such that P is a subpath of Q, and Q has one more vertex that P, then I(Q) =» /(P). Conversely, not(/(P)) =» not(/(Q)). This permits the computation to stop at P if /(P) fails and avoid evaluating I(Q). If /(P) fails, no least-cost path that extends through and beyond the endpoint of P need be included in the least-cost path tree. We call this a "monotonic inclusion test". The term e in (4) prevents it from being monotonic. All of the other terms in (4), except m(l) and e, are fixed for a given least-cost path tree, so without those two terms, (4) would be monotonic, but only e prevents monotonicity. One way to form a monotonic inclusion test is to replace e by some constant. Recall that e is the maximum length of edges in E — E' leaving the last vertex of P'. If we replace e by k where k is the maximum length of any edge in E — E', we get a monotonic test, (5) m(P') < 26 + max{r(x,G)\x € V - V'} + max{m(/) + d, m(f) + k, m(f) + m)} Theorem 6.1: (5) is a monotonic inclusion test. The disadvantage of (5) is that k can be large. One remedy is to replace long edges in the graph with several smaller ones reducing k to some maximum allowed edge length. This might work well in some road networks. However, we implemented a simpler approach though one that requires restatement of some theorems. In this approach, the least-cost path trees are generated, not from G' alone, but from H whose edges and vertices, respectively, are
So H includes vertices in G' and vertices in G adjacent to vertices in G':
The following theorems are slight restatements of compute reach bounds for vertices of G' given reach bounds for vertices of G — G'. Initially G' — G earlier theorems to support this refined approach. and reach bounds, held in array bounds, are set Theorem 6.2 (Practical Reach Bounding to infinity. Vertices for which finite reach bounds are computed are removed from G' after each Theorem): Given the conditions of Theorem 5.2, r(v,G) = r(v,P) < min{g + m(sf,v,P'),r(t',G) + iteration. After the last iteration, G' is expected to be small enough that assigning each of its vertices m(v,t',P'}} an infinite reach bound, which means those vertices Theorem 6.4 (Monotonic Minimal Path are never pruned by the routing algorithm, does Bounding Theorem): Given, all of the conditions not significantly affect query performance. The b and defintions of Theorem 5.4 and H as defined above parameter of Iterate (corresponding to 6 in Theorem with the modification that path P' be the longest 6.6) controls the trade-off between the amount of subpath of P containing t; and included in H (instead computation performed by the iteration and the of <7'), then m(P') < 1r(v,H] + max{r(x,G)\x 6 amount of reduction in the size of G'. In each iteration, Theorem 6.6 is applied to V - V'} + max{m(/) + d, m(/)} compute least-cost path trees needed to assign reach From Theorem 6.4, we can derive Theorem 6.6 bounds to some vertices in V. Each computed tree in the same way that Theorem 5.6 is derived from is traversed and Theorem 6.2 is applied to each vertex, v, in the tree to produce a reach bound Theorem 5.4. on the assumption that the tree contains, P', the Theorem 6.6 (Pruning Theorem for Reach longest subpath in G' of a path which is minimal Bound Computation): Given, with respect to the reach of v in G. The maximum of • G, G', m, v, P, and b as defined for Theorem 5.6, the bounds over all of the trees is computed for each v. At the same time, the reach of each v over all of • H as defined above, the trees is computed and used with Theorem 6.6, to • the set S of all least-cost paths in H such that identify some v for which one of trees contains the each path P' in S satisfies actual P'. For those v, the maximum bound computed over the trees is a valid reach bound for v in G. P' has at least one edge, and m(P') < 26 + c + max{m(J) + where s' and if are, respectively, the first and last vertices of P', / and I are, respectively, the first and last edges of P', c = max{r(x,G)\x 6 V - V'} if V £ V, otherwise 0, d = max{m(w, s')|(w, s') € E} if such a u exists, otherwise 0, if max{r(v,P')|P' € 5} < 6, then, r(v,H) < 6, and S includes a longest subpath of P containing v and included in H provided the subpath has at least one edge. In comparison with Theorem 5.6, Theorem 6.6 eliminates the e term in the inequality that each path P' in S satisfies. In the pseudocode of Figure 1, ReachBoundC 'amputation computes reach bounds on graph G by calling Iterate repeatedly passing it graph G and subgraph G'. Iterate attempts to
7 Experimental Results We present performance results for both the computation of reach bounds (for all but 3% to 5% of vertices) and the reach-based variant of Dijkstra's algorithm. All tests were performed by C++ programs without compiler optimization on a 2GHz PC running Linux with sufficient memory to hold all of the graph representation, reach bounds, and data structures required by the algorithms. All of the tests were performed on graphs representing real road networks. For all tests, the cost metric was travel tune and the reach metric was travel distance. Our priority queue was a bucket priority queue (see [2, 3, 8, 9]). Table 1 shows a somewhat greater than linear increase in computation time of reach bounds as the size of the dataset increases. This is what we expected. Generally, for hierarchical networks, we expect each iteration to require computation proportional to the size of the network, but that the number of iterations needed will increase as a sublinear function of size. With the increase in data size, computation of ex-
107
ReachBoundComputation(G , B, bounds] If G is a graph, (E, V), with weight function, w, and reach metric, m. If B is an array of increasing positive integers. // bounds is an array indexed by v € V and into which reach bounds are placed G1 :=G For each v € V: bounds[v] := oo For each index, i, of B, in ascending order: Call Iterate(G , G' , B[i], bounds) // attempts to set finite values in bounds V := {v\v € V, bounds[v] = 00} // should become smaller in succesive iterations E' ~ {(u,v)\(u,v) € E,u € V',v € V1} G' :={V',E'} Iterate(G, G', 6, bounds) lfV = V then c := max{bounds[x]\x £V -V1} else c := 0 For each v £V: bounds[v] :— 0 // will set back to oo if needed r[v] := 0 // reach of v in least-cost path trees Form the graph H : EH := {(x,y)|x E V',y E V} VH~V'\j{v\1(x,v)£EH} H:={VH,EH] For each vertex s' € V: If there exists (a?, s') € E - E1 g := max{bounds[x] + m(x,s')\(x,s') € E - E'} d:= max{m(x18')\(x,sl) G E - E'} else g := d := 0 T := partial least-cost path tree of H rooted at s' containing all P' such that m(P') <2b + c + d + m(f) + m(l), where / and / are the first and last edges of P' II this inequality is simpler but slightly looser than Theorem 6.6 requires Traverse T (once) to do the following for each vertex, v, in T: Compute r(v,T) Over all paths, P', in T that begin at s', include v, and end at a leaf, t', of T: Ift' £V-V rt := bounds[t'] else rt := 0 // in this case, Theorem 6.6 guarantees that P terminates at t' rb := min{<7 -1- m(s', v, P'),rt + m(v, t', P')} // application of Theorem 6.2 if rb > bounds[v] bounds[v] := rb if r(u, T) > r[v] r[v] :=r(v,T) For each v 6 V: If r[v] >b// apply Theorem 6.6 bounds[v] := oo // reach bound not validated Figure 1: Pseudocode for Reach Bound Computation
108
region Alameda County San Francisco Bay Area
number of vertices 97240 393368
exact reach computation 233 minutes 4415 minutes
reach bound computation 28 minutes 161 minutes
Table 1: Exact Reach and Reach Bound Computations
algorithm Dijkstra A* Reach Reach + A* Exact Reach
avg route length 26 kilometers 26 kilometers 26 kilometers 26 kilometers 26 kilometers
cpu time (1000 routes) 62 seconds 60 seconds 14 seconds 12 seconds 9 seconds
priority queue insertions per path 44122 27395 5058 3711 3199
Table 2: Shortest Path Computation for Alameda
algorithm Dijkstra A* Reach Reach + A* Exact Reach
avg route length 56 kilometers 56 kilometers 56 kilometers 56 kilometers 56 kilometers
cpu time (1000 routes) 289 seconds 194 seconds 28 seconds 17 seconds 15 seconds
priority queue insertions per path 179263 79293 10043 5314 5797
Table 3: Shortest Path Computation for the Bay Area
algorithm Dijkstra A* Reach Reach + A*
avg route length 52 kilometers 52 kilometers 52 kilometers 52 kilometers
cpu time (1000 routes) 334 seconds 140 seconds 27 seconds 13 seconds
priority queue insertions per path 141464 50692 7910 3473
Table 4: Shortest Path to Box Computation for the Bay Area
109
act reach values increased 19 times, roughly in proportion to the square of the size of the dataset as we'd expect from an all-pairs shortest path computation. We compared the performance of shortest path computations for four Dijkstra variations. In each case, 1000 random routes were computed. Each route was chosen randomly by randomly choosing two vertices from the graph. The random choices of the two vertices were independent, so the average distances between them were roughly proportionate to the geographic width and length of the road network. Comparing the tests on the smaller network (Table 2) with the tests on the larger network (Table 3), suggests a linear increase in computation as a function of path length using the reach algorithm and sublinear using the combination of reach and A*. The computation time of the Dijkstra algorithm on road networks is commonly considered to increase with the square of the distance as rule of thumb. The improvement in performance provided by the reach algorithm is consistent with the heuristic approach using road classifications often used in industry. The numbers of priority queue insertions per path computation give some insight into the factors determining performance. The A* algorithm, for instance, exhibits a reduction in priority queue insertions out of proportion to the reduction in computation time. That suggests that the cost of the estimate function, which involves an expensive square root, is itself significant. The improvements provided by reach and by A* appear to be independent from each other and complimentary. The performance of the path computation using exact reach values (without A*) suggests that there is considerable room for trade-off between reach preprocessing time and query performance and/or that the preprocessing can benefit from tuning. Tests with two road networks in Asia demonstrated similar reductions in computation time compared to Dijkstra ranging from a 5 times reduction to 10 times reduction. To validate the reach algorithms and their implemetation, we compared the cost of each path computed among the various algorithm implementations. There were no differences. In a parallel series of tests (Table 4), the destinations were regions bounded by latitude and longitude lines (which we call "boxes"). The origin was a vertex chosen at random. The center of the box and its size were randomly chosen. Combinations in which the origin vertex was inside the box were discarded. This simulates an application in which it is desirable
110
to know how soon a mobile device can enter a region using the road network. These tests demonstrate the versatility of the reach-based algorithms and suggest that reach-based algorithms would provide the fastest possible means of computing paths to multiple destinations or from multiple origins - given the difficulties other approaches face. 8 Further Work It's desirable to characterize the performance of our algorithms as functions of easily computed network properties. Under some conditions, we think, the computation time of our shortest path algorithm is a linear, or near linear, as a functon of some metric on the path computed. Such a bound would be considered meaningful in road network applications. Considering the distribution of reach values would play a role in the analysis. We also think that, under some conditions, there is a dynamic, or incremental, approach that allows fast update to the reach bounds when there are localized changes to the graph. 9 Acknowledgements This research was performed at Wavemarket, Inc in Emeryville, California. Thanks to Wavemarket, Inc. and Scott Hotes at Wavemarket for making this research and paper possible. Thanks to Dave Blackston at Wavemarket for his comments on this paper. Patents related to the ideas in this paper have been filed by Wavemarket, Inc. References [1] G. Ausiello, G. F. Italiano, A. M. Spaccamela, and U. Nanni. Incremental Algorithms for Minimal Length Paths. Journal of Algorithms, 12(4):615-638, 1991. [2] B. V. Cherkassky, A. V. Goldberg, and C. Silverstein. Buckets, Heaps, Lists, and Monotone Priority Queues. Siam Journal of Computing, 28(4): 13261346, 1999. [3] B. V. Cherkassky, A. V. Goldberg, and C. Silverstein. Buckets, Heaps, Lists, and Monotone Priority Queues. In Proceedings of the Eight Annual ACMSIAM Symposium on Discrete Algorithms, pages 8392, 1997. [4] T. H. Gormen, C. E. Leiserson, R. E. Rivest, and Clifford Stein. Introduction to Algorithms. Second Edition, MIT Press, 2001. [5] E. W. Dijkstra. A Note on Two Problems in Connexion with Graphs. Numer. Math., 1:269-271, 1959.
[6] Hristo N. Djidjev, Grammati E. Pantziou, Christos D. Zaroliagis. Improved Algorithms for Dynamic Shortest Paths. Algorithmic, 28(4):367-389, 2000. [7] H. N. Djidjev. Efficient Algorithms for Shortest Path Queries in Planar Digraphs. In Proceedings of the 22nd Workshop on Graph Theoretic Concepts in Computer Science, Lecture Notes in Computer Science, pages 151-165. Springer Verlag, 1996. [8] A. V. Goldberg and C. Silverstein. Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets. Technical Report 95-187, NEC Research Institute, Princeton, NJ 1995. [9] R. Gutman. Priority Queues for Motorists. Dr. Dobb's Journal, 340:89-94, 2002. [10] P. Hart, N. Nilsson, and B. Raphael. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100-107, 1968. [11] P. Hart, N. Nilsson, and B. Raphael. Correction to "A Formal Basis for the Heuristic Determination of Minimum Cost Paths". SIGART Newsletter, no. 37:28-29, 1994. [12] R. Jacob, M. Marathe, and K. Nagel. A Computational Study of Routing Algorithms for Realistic Transportion Networks. ACM Journal of Experimental Algorithms, 4(6), 1999. [13] N. Jing, Y. W. Huang, and E. Rundensteiner. Hierarchical Encoded Path Views for Path Query Processing: an Optimal Model and its Performance Evaluation. IEEE Transactions on Knowledge and Data Engineering, 10(3):409-431, 1998. [14] P. Klein, and S. Subramanian. A Fully Dynamic Approximation Scheme for Shortest Path Problems in Planar Graphs. Algorithmica, 22(3):235-249, 1998. [15] P. Klein, S. Rao, M. Rauch, and S. Subramanian. Faster Shortest-Path Algorithms for Planar Graphs. Special issue of Journal of Computer and System Sciences on selected papers of STOC 1994, 55(1):323, 1997. [16] R. E. Tarjan. Data Structures and Network Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1983.
111
Lazy Algorithms for Dynamic Closest Pair with Arbitrary Distance Measures David Eppstein
Jean Cardinal*
subroutines, a merging corresponding to two deletions and one insertion. Cluster similarity can be measured in several ways, and such routines should not rely on any assumptions about the similarity measure. It is worth noticing that in the vector quantization literature, this algorithm is known as the Pairwise Nearest Neighbor (PNN) method [10]. An example of PNN-like algorithm that uses a more sophisticated distance measure can be found in [6]. The PNN method is a good example of a practical application that uses distance measures which have no useful geometric properties. It is shown in [8] that most known softwares implementing these algorithms use a brute force search method to find the optimal mergings. Other applications are greedy matching in graphs, traveling salesman heuristics and ray-intersection diagrams. Efficient practical methods for this problem have 1 Introduction been presented in [8], derived from the algorithms deWe study the dynamic closest pair problem: given a scribed earlier in [9] for various geometric problems. distance measure between any two points, maintain Implementations and links can be found on Eppstein's the closest pair of a set of points under insertion and website1. We propose improvements of these methods deletion operations. This is a classical topic, many based on lazy deletion, in which actual distance recomgeometrical special cases of which have been extensively putations are are delayed until a closest pair query restudied before [3, 4]. quires them. In this paper, we study the problem with as few Section 2 presents previously proposed algorithms assumptions as possible. The only conditions imposed for this problem. Section 3 presents new algorithms on the distance measure £>(.,.) between two points using a lazy deletion mechanism. In section 4 we are that it must be symmetric and take values in a comment on several experimental results. Conclusion totally ordered set. We avoid relying on any property and directions for future work are found in section 5. of the distance such as Euclideanness or the triangle This manuscript is a revised version of a preliminary inequality. technical report [5]. This problem has a wide range of applications, a good survey of which can be found in Eppstein's 2 Previous Algorithms paper [8]. Among others is agglomerative clustering, a well-known clustering technique that iteratively merges The following algorithms are all described in [8], althe two closest clusters [7]. The algorithm starts though the first one is a simple method that can be with as many clusters as points, and ends as soon found in other works. Conga lines and Multiconga are as the number of clusters is sufficient, or when some the only known linear-space algorithms providing nonthreshold in the objective value is reached. This family trivial amortized bounds on the update complexities. of algorithms can make use of dynamic closest pair Theoretical lower bounds on the complexity of the update operations are not known. Abstract We propose novel lazy algorithms for the dynamic closest pair problem with arbitrary distance measures. In this problem we have to maintain the closest pair of points under insertion and deletion operations, where the distance between two points must be symmetric and take value in a totally ordered set. Many geometric special cases of this problem are well-studied, but only few algorithms are known when the distance measure is arbitrary. The proposed algorithms use a simple delayed computation mechanism to spare distance calculations, and one of these algorithms is a lazy version of the FastPair algorithm recently proposed by Eppstein. Experimental results on a wide number of applications show that lazy FastPair performs significantly better than FastPair and the other algorithms we tested.
* Universite Libre de Bruxelles, Brussels, Belgium, [email protected] ^University of California, Irvine, USA, [email protected]
112
1
http://www.ics.uci.edu/~eppstein/projects/pairs/
a new set 5,, and each deletion causes the vertices connected by graph edges to the deleted point to be moved to a new set S,; after every such update some sets are merged and their graphs reconstructed in order to maintain the O(logn) bound on the number of sets. THEOREM 1. (CONGA LINES [8]) Conga Lines correctly maintain the closest pair in amortized time O(nlogn) per insertion and O(nlog n) per deletion.
as potenProof. We define tial function, and study the sum of the effective cost of the operation and the corresponding potential function variation. MultiConga is similar to Conga Lines, except that subsets Si are never merged: the number of subsets is allowed to become arbitrarily large. THEOREM 2. (MULTICONGA [8]) MultiConga correctly maintains the closest pair in amortized time O(ri) per insertion and O(n 3 / 2 ) per deletion. Figure 1: Illustration of the neighbor heuristic.
2.1 Neighbor Heuristic. The neighbor heuristic associates with each point p of S its nearest neighbor in S \ {p}, together with the corresponding distance d(p) = mmqeS\{p}D(p,q). These data are maintained for each operation. A query consists in scanning all the distances d(p) and selecting the smallest. Worst-case time complexities are O(n), O(n) and O(n2} for query, insertion and deletion, respectively. This algorithm is actually a simple maintenance of the nearest neighbor directed graph structure. It is illustrated in Fig. 1. 2.2 Conga Lines and MultiConga. A conga line for a subset Si of the data points is a directed path, found by choosing a starting point arbitrarily, and then selecting each successive point to be the nearest unchosen neighbor of the previously selected point. Points within Si may choose their neighbors from the entire data set, while points outside Si must choose their neighbors from within Si. The conga lines data structure consists of a partition of the points into O(logn) sets Si, and a graph d for each set, with Gi initially constructed to be a conga line for Si although subsequent deletions may degrade its structure. The data structure also contains a priority queue of the edges in the union of the graphs Gi, from which a closest pair can be found as the shortest edge. Each insertion creates
2.3 FastPair. FastPair can be seen as a further relaxation of MultiConga, but is better explained in terms of the neighbor heuristic. In this method, when a new point is inserted, its nearest neighbor is computed, but nearest neighbors and distances d(y) associated with other points y in the set are not updated. Hence the points that were previously in the set may be associated with wrong neighbors. This is not important since the correct distance can still be found associated with the new point. A second difference with the neighbor heuristic is that the structure is initialized with a single conga line, instead of computing all possible distances. In terms of complexity bounds, FastPair does not improve on the neighbor heuristic. However, experimentally, it was shown to be the best overall choice [8]. If we refer to conga lines, FastPair is a MultiConga data structure in which vertices that are moved after a deletion do not form a new subset, but rather a collection of singletons. It is illustrated on Fig. 2. 3
Lazy Algorithms
We present new algorithms that extend the previous ones by using a lazy deletion procedure. While the basic idea is quite simple, it is not immediately clear that it is applicable in all situations, in particular when the distance is not geometric, and in combination with the FastPair algorithm. We show that it is indeed the case and, in section 4, that it provides what can be considered as the fastest known practical algorithm for generic dynamic closest pair maintenance.
113
heuristic correctly maintains the closest pair Proof. Any distance d(p) associated with a marked point p is less than or equal to its actual nearest neighbor distance. The correctness of the algorithm follows. In terms of the number of distance calculations, the lazy neighbor heuristic cannot cost more than the standard neighbor heuristic. It is not difficult, however, to show that amortized time bounds are the same as in the unmodified neighbor heuristic, i.e. equal to the worst-case bounds.
Figure 2: Illustration of the FastPair algorithm. 3.1 Lazy Neighbor Heuristic. Lazy deletion spares a certain amount of distance recalculations by delaying them as most as possible. It works as follows: when a point p is deleted, we do not recompute the nearest neighbors of the points that were having p as nearest neighbor. Instead, we simply mark them. When a query is made, the smallest distance d(p) is selected using a linear search. If p is marked, we recompute its nearest neighbor and iterate until the smallest distance corresponds to a nonmarked point, which must belong to the closest pair. When a point p is inserted we search for its nearest neighbor, and at the same time modify nearest neighbors of the previously existing points that have p as new nearest neighbor. If some of these points were marked, we unmark them, since the distance is again valid. Lazy deletion was also proposed by Kaukoranta et al. [11] to accelerate the PNN algorithm. They restrict the use of lazy deletion to distances satisfying a socalled monotony criterion, and show that the weighted distance used in the PNN algorithm satisfies it. It is interesting to note that it can actually be implemented for any type of symmetric, totally ordered distance. The main difference between the lazy neighbor heuristic and the algorithm in [11] is that in our method some points may be unmarked during an insertion. THEOREM 3. (CORRECTNESS) The
114
lazy
3.2 Lazy FastPair. Lazy FastPair combines the lazy insertion algorithm of FastPair with lazy deletion. Hence for an insertion we compute the nearest neighbor of the new point, but do not change previously computed neighbors, while for a deletion we mark certain points as having deleted neighbors. If, in a query, it happens that d(p) is the minimum distance and p is marked, we recompute the nearest neighbor of p and iterate. For this algorithm, it may happen that the actual nearest neighbor distance of a marked point is smaller than the stored distance. This does not harm the correctness of the algorithm, as shown in the following. THEOREM 4. (CORRECTNESS) Lazy FastPair correctly maintains the closest pair. Proof. All the operations maintain the following invariant: Let p be a point in S such that d(p) is minimal; then either there exists q € S such that (p, q) is the closest pair, or p is marked. Suppose, on the contrary, that p is not marked, d(p) is minimal, but (a, 6) is the closest pair, with a ^ p, b ^ p and D(a,b) < d(p). Without loss of generality, suppose also that 6 was inserted after a. Then 6 must be marked, otherwise we would have d(b) = D(a, b) < d(p). And when b was inserted, its associated distance must have been at most equal to D(a,b). Hence d(p) cannot be smaller than d(b). So (a, b) is not the closest pair and p must belong to the closest pair. From this invariant, it is easy to check that the query procedure returns the closest pair.
It is interesting to notice that the invariant given in the proof of correctness for the lazy neighbor heuristic is not a necessary condition. Also, in lazy FastPair, a point cannot be unmarked during an insertion step. This is illustrated on Fig. 3. As for the lazy neighbor heuristic, amortized time bounds do not change. It is easy to find a sequence neighbor of worst cases in which the algorithm does not perform
most O(logn) points are marked at each deletion (one from each subset), and all these points might later be reinserted in amortized time O(nlogn), as in the original conga lines data structure. Hence the amortized time of deletion remains O(nlog2 n).
Figure 3: Illustration of the insertion procedure : in Lazy FastPair, an insertion cannot unmark a point. better than the neighbor heuristic. This sequence occurs when at some point the nearest neighbor graph is starshaped and the point at the center of the star is deleted. Then all remaining points are marked. Subsequent queries might be forced to find the nearest neighbors of all these marked points, and eventually reconstruct a star-shaped graph. 3.3 Lazy Conga Lines. The question naturally arises of whether it is possible to combine lazy deletion with the conga lines data structure, and whether this can lead to better complexity bounds. An outline of algorithm is as follows. We maintain O(log n) conga lines as before. When inserting a point, we create a new conga line for it. During deletion, the point to be deleted is removed, and points that were pointing to it are marked. When a query is made, we check the top of the heap that is used to store conga line edges. We iteratively extract edges from the heap until an edge corresponding to an unmarked point is found. A new subset Si is created for all the extracted edges, a new conga line is created, and the graphs are merged until the number of subsets is O(logn). At that point, an unmarked edge can be found at the top of the heap, that corresponds to the actual closest pair. Correctness of the algorithm follows from correctness of the conga lines data structure and that of lazy deletion. Essentially, the algorithm is similar to lazy FastPair, except that when the query procedure finds a marked point, it is not reinserted immediately, but is inserted together with all the other marked points at the top of the heap. Another difference is the maintenance of O(log n) subsets, which is not ensured in FastPair. At
4 Experiments We implemented the lazy neighbor heuristic and lazy FastPair and ran them on the applications from Eppstein's testbed [8]. This includes agglomerative clustering, as described in the introduction, as well as greedy matching in graphs as described by Reingold and Tarjan [12] and the multifragment heuristic from Bentley [1, 2]. We refer the reader to the reference paper for further details about the testbed. We did not implement lazy conga lines structure, because we did not find any rationale why this could be faster than Lazy FastPair in general. This implementation could however be planned later for completeness, especially in a maximum weight matching application previously identified as the only application for which conga lines performed better than FastPair. We did not compare our new algorithms to the quadtree method from the same paper, that requires quadratic storage complexity. We assume that the number of points is high enough so that linear storage complexity is a necessary condition. In Fig.4-7 we report the execution times in seconds with respect to the number of points. For readability, and to ease the comparison with the results in [8], the plots are presented with logarithmic scales. The machine used was a Pentium IV, 1.7 GHz, running Linux. Note that the FastPair and neighbor heuristic methods never perform less distance calculations than their lazy counterparts, hence having compared the number of distance calculations would have been beneficial for the lazy methods. The lazy neighbor heuristic performed better than FastPair on most agglomerative clustering tasks - except when using the L^ distance - but significantly worse than it on the multifragment TSP heuristic application. Lazy FastPair performed better than all other algorithms, except on agglomerative clustering in a fractal set, for which the lazy neighbor heuristic is better when the number of points is greater than 1500. Lazy FastPair performed strikingly better than all algorithms on the clustering application with the L^ distance. In general, it can be seen that these algorithms run in near quadratic time, which in our experiments means that updates are made in near linear time. This can be verified on Fig. 8 showing three graphs in which the running time is plotted with respect to n2. Quadratic behavior was already observed in [8] and explains why algorithms that have better worst-case amortized time
115
Figure 4: algorithms performance for hierarchical clustering with Lp metrics
116
Figure 5: algorithms performance for hierarchical clustenng with other metrics
Figure 6: algorithms performance for greedy minimum matching;
Figure 7: algorithms performance for the multifragment TSP heuristic
117
bounds such as conga lines are not winners in the comparison: the worst case does not seem to be relevant in any of the experiments. A quadratic complexity implies that the average number of nearest-neighbor searches per iteration is a constant number. In order to verify this, we measured the overall number of nearest neighbor searches and divided it by the number of closest pair queries. The results are presented in Table 1 for the three applications with pseudorandom distances using Lazy FastPair. For the greedy matching and multifragment heuristic applications, this number is equal to the average number of iterations performed in the closest pair query before an unmarked point is found. It is greater than one in the clustering case, since an insertion, that always costs one search, is performed at each iteration to include the newly formed cluster in the set. We can see that the number can be considered as constant in the three cases for pseudorandom distances. The results are similar for the other distances: with the L^ metric in 31 dimensions, for instance, the number of searches per closest pair query is about 1.17 for the clustering, 0.68 for the greedy matching and 0.93 for the multifragment TSP heuristic. 5 Conclusion We proposed two simple lazy algorithms for the dynamic closest pair problem with arbitrary distance measure. Lazy FastPair performs better than all previously proposed algorithm, although it has the same worst-case and amortized bounds as the simple neighbor heuristic. These algorithms are applicable to many different problems and can lead to much more efficient applications in practice, especially when distance calculation costs are high. Experiments should be carried out, for instance, on string clustering tasks using the edit distance, whose calculation requires a dynamic programming subroutine. We recommend the use of the Lazy FastPair algorithm in all situations where linear space complexity is required. Among other issues, we may wonder whether it is possible to make these algorithms even more lazy, or use laziness in another way, and whether it is possible to use laziness to obtain better amortized complexity bounds. The prediction of the average acceleration ratios on some known distributions might also be of interest. Finally, the most interesting theoretical issue is probably that of the lower bounds on the complexity of each operation. Acknowledgements. The authors thank the anonymous reviewer for his suggestions and for pointing out reference [2].
118
Figure 8: plots of the running time vs. n11
References
number of points 500 1000 2000 4000 8000
average number of searches 1.429 1.424 1.455 1.457 1.451 (a) clustering
number of points 500 1000 2000 4000 8000
average number of searches 0.537 0.644 0.629 0.625 0.637
(b) greedy minimum matching
number of points 500 1000 2000 4000 8000
average number of searches 0.845 0.831 0.838 0.840 0.839
(c) multifragment TSP heuristic
Table 1: average number of nearest-neighbor searches per closest pair query
[1] J. Bentley. Experiments on traveling salesman heuristics. In SODA: ACM-SIAM Symposium on Discrete Algorithms, pages 91-99, 1990. [2] . Fast Algorithms for Geometric Traveling Salesman Problems. In ORSA Journal on Computing, 4(4):387-411, 1992. [3] S. N. Bespamyatnikh. An optimal algorithm for closest pair maintenance. In COMPGEOM: Annual ACM Symposium on Computational Geometry, 1995. [4] P. B. Callahan and S. R. Kosaraju. Algorithms for dynamic closest pair and n-body potential fields. In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1995. [5] J. Cardinal and D. Eppstein. Lazy algorithms for dynamic closest pair with arbitrary distance measures. Technical Report 502, ULB, 2003. [6] D. P. deGarrido, W. A. Pearlman, and W. A. Finamore. A clustering algorithm for entropyconstrained quantizer design with applications in coding image pyramids. IEEE Trans, on Circuits and Sytems for Video Technology, 5:83-85, 1995. [7] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 2nd edition, 2000. [8] D. Eppstein. Fast hierarchical clustering and other applications of dynamic closest pairs. J. Experimental Algorithmics, 5(l):l-23, 2000. also in 9th ACM-SIAM Symp. on Discrete Algorithms, 1998, pp. 619-628. [9] D. Eppstein, P. Agarwal, and J. Matousek. Dynamic algorithms for half-space reporting, proximity problems, and geometric minimum spanning trees. In IEEE Annual Symposium on Foundations of Computer Science (FOCS), 1992. [10] W. H. Equitz. A new vector quantization clustering algorithm. IEEE Trans. Acoust., Speech, Signal Processing, 37(10):1568-1575, October 1989. [11] T. Kaukoranta, P. Franti, and O. Nevalainen. Fast and space efficient PNN algorithm with delayed distance calculations. In 8th International Conference on Computer Graphics and Visualization (GraphiCon'98), 1998. [12] E. M. Reingold and R. E. Tarjan. On a greedy heuristic for complete matching. SI AM J. of Computing, 10(4):676-681, 1981.
119
Approximating the Visible Region of a Point on a Terrain1 Boaz Ben-Moshe*
Paz Carmi*
Abstract Given a terrain T and a point p on or above it, we wish to compute the region Rp that is visible from p. We present a generic radar-like algorithm for computing an approximation of Rp. The algorithm extrapolates the visible region between two consecutive rays (emanating from p) whenever the rays are close enough] that is, whenever the difference between the sets of visible segments along the cross sections in the directions specified by the rays is below some threshold. Thus the density of the sampling by rays is sensitive to the shape of the visible region. We suggest a specific way to measure the resemblance (difference) and to extrapolate the visible region between two consecutive rays. We also present an alternative algorithm, which uses circles of increasing radii centered at p instead of rays emanating from p. Both algorithms compute a representation of the (approximated) visible region that is especially suitable for visibility from p queries. Finally, we report on the experiments that we performed with these algorithms and with their corresponding fixed versions, using a natural error measure. Our main conclusion is that the radar-like algorithm is significantly better than the others. 1 Introduction Let T be a triangulation representing a terrain (i.e., there is a height (^-coordinate) associated with each triangle vertex). We are interested in the following well known problem. Given a point p on (or above) T, compute the region Rp of T that is visible from p. A point q on T is visible from p if and only if the line segment pq lies above T (in the weak sense). Thus Rp * Research by Ben-Moshe and Katz is partially supported by grant no. 2000160 from the U.S.-Israel Binational Science Foundation, and by the MAGNET program of the Israel Ministry of Industry and Trade (LSRT consortium). Research by Carmi is partially supported by a Kreitman Foundation doctoral fellowship. t Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel, benmosheecs.bgu.ac.il. * Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 84105. Israel, cannipfics.bgu.ac.il. 5 Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel, carmipfflcs.bgu.ac.il.
120
Matthew J. Katz§
consists of all points on T that are visible from p. The problem of computing the visible region of a point arises as a subproblem in numerous applications (see, e.g., [3, 7, 9, 11]), and, as such, has been studied extensively [2, 3, 4, 5, 7]. For example, the coverage area of an antenna for which line of sight is required may be approximated by clipping the region that is visible from the tip of the antenna with an appropriate disk centered at the antenna. Since the combinatorial complexity of Rp might be fi(n2) [3, 6], where n is the number of triangles in T, it is desirable to also have fast approximation algorithms, i.e., algorithms that compute an approximation of Rp. Moreover, a good approximation of the visible region is often sufficient, especially when the triangulation itself is only a rough approximation of the underlying terrain. Note that in this paper we are assuming that the terrain representation (i.e., the triangulation T) is fixed and cannot be modified. Simplifying the triangulation can of course result in a significant decrease in the actual running time of any algorithm for computing the visible region. This approach was studied in a previous paper [1]. See, e.g., [8] for more information on terrain simplification. We present a generic radar-like algorithm for computing an approximation of Rp. The algorithm computes the visible segments along two rays p\, p2 emanating from p, where the angle between the rays is not too big. (I.e., each of the rays specifies a direction, and the algorithm computes the (projections of the) visible portions of the cross section of T in this direction.) It then has to decide whether the two sets of visible segments (one per ray) are close enough so that it can extrapolate the visible region of p within the wedge defined by p\ and p2> or whether an intermediate ray is needed. In the latter case the algorithm will now consider the smaller wedge defined by p\ and the intermediate ray. Thus a nice property of the algorithm is that the density of the sample rays varies and depends on the shape of Rp. In order to use this generic algorithm one must provide (i) a measure of resemblance for two sets of visible segments, where each set consists of the visible segments along some ray from p, and (ii) an algorithm to extrapolate the visible region between two rays whose corresponding sets were found similar enough.
In Section 2 we describe in more detail the generic algorithm and provide the missing ingredients. In Section 3 we present several other algorithms for computing the visible region Rp. The first algorithm computes Rp exactly. Since we need such an algorithm for the experimental evaluation of our approximate algorithms, we decided to devise one that is based on the general structure of the radar-like algorithm. Our exact algorithm is rather simple and is based on known results; nevertheless it seems useful. Specifically, the algorithm repeatedly computes the portion of Rp within a slice, defined by a pair of rays passing through vertices of the terrain, that does not contain a vertex of T in its interior. This computation can be done efficiently as is shown in [5]. The second algorithm (called the expanding circular-horizon algorithm or ECH for short) is in some sense orthogonal to the radar-like algorithm; it uses circles of increasing radii centered at the view point p instead of rays emanating from p. It is influenced by the exact algorithm described by De Floriani and P. Magillo [5]. The algorithm approximates the visible region Rp by maintaining the (approximate) viewing angles corresponding to a set of sample points on the expanding circular front (see Section 3). This allows us to partition the current front into maximal visible and invisible arcs. We now examine the sets of visible arcs on the current and previous fronts. If they are close enough, then the portion of Rp within the annulus defined by the two circles is approximated. Otherwise, we compute the visible arcs on a circle of intermediate radius and repeat. Both the radar-like algorithm and the expanding circular-horizon algorithm have corresponding fixed versions, that play an important role in the experiments that were performed, see below. In the fixed version of the radar-like algorithm, the angle between two consecutive rays is fixed and we approximate the portion of Rp in the sector defined by the rays in any case, even if they are not close enough. In the fixed version of the expanding circular-horizon algorithm, the increase in radius between two consecutive circles is fixed and again we approximate the portion of Rp in the annulus defined by the circles in any case. In Section 4 we suggest a natural way to measure the error in an approximation R'p of Rp produced by one of our algorithms. The error associated with R'p is the area of the XOR of R'p and Rp, divided by the area of the disk of radius /, where / is the range of sight that is in use. Using this error measure (and the exact algorithm), we performed a collection of experiments (described in Section 4) with the radar-like and expanding circular-horizon algorithms and their corresponding
fixed versions. Our main conclusions from these experiments are that (i) the sensitive versions are significantly better than their corresponding fixed versions (when the total number of slices / annuli composing the final approximation is the same in both versions), and (ii) the radar-like algorithm is significantly better than the expanding circular-horizon algorithm. In Section 4 we offer some explanations to these findings. 2 The Radar-Like Algorithm In this section we first present our radar-like generic algorithm. Next we describe the measure of resemblance and the extrapolation algorithm that we devised, and that are needed in order to transform the generic algorithm into an actual algorithm. The generic algorithm is presented in the frame below. The basic operation that is used is the crosssection operation, denoted cross-section(T,p,Q], which computes the visible segments along the ray emanating from p and forming an angle 6 with the positive xaxis. More precisely, cross-section(T,p,0) computes the (projections of the) visible portions of the cross section of T in the direction specified by this ray. Roughly speaking, the generic algorithm sweeps the terrain T counter clockwise with a ray p emanating from p, performing the cross-section operation whenever the pattern of visible segments on p is about to change significantly with respect to the pattern that was found by the previous call to cross-section. The algorithm then extrapolates, for each pair of consecutive patterns, the visible region of p within the wedge defined by the corresponding locations of p.
Given a triangulation T representing a terrain (i.e., with heights associated with the triangle vertices), and a view point p on or above T: 0<-0.
a <— some constant angle, say, Tr/45. Si <— cross-section(T,p, 9). 82 <— cross-section(T,p,0 + a). while (0 < 360) if (5i is close-enough to 52) extrapolate^!, 6*3); 0 <— 82-angle;
S\ <— £2; 62 <— cross-section(T, p, min(0 + a, 360));
else
fj, <— (S\.angle + S^-angle)/?; S-2 <— cross-section(T,p,p,);
121
Figure 1: Grey marks visible and black marks invisible, (a) The close-enough threshold function: 5 times the relative length of the XOR of Si and 82- (b) The extrapolate function.
In order to obtain an actual algorithm we must provide precise definitions of close-enough and extrapolate. Close-enough: A threshold function that checks whether two patterns 81,82 are similar, where each of the patterns corresponds to the set of visible segments on some ray from p. There are of course many ways to define close-enough. We chose the following definition. In practice, the rotating ray is actually a rotating segment of an appropriate length. Let / denote this length. We refer to I as the range of sight. Now rotate the ray containing 82 clockwise until it coincides with the ray containing Si. See Figure 1 (a). Next compute the length of the XOR of Si and 82, that is, the total length covered by only one of the sets 81,82. This length is then divided by I. Denote by v the value that was computed, and let S be the angle between Si and 83. If 6 • v < C, where C is some constant, then return TRUE else return FALSE. The role of S in the above formula is to force close-enough to return TRUE when the angle between the rays is small, even if the patterns that are being compared still differ significantly. Extrapolate: Given two patterns 81,82 which are close-enough, we need to compute an approximation of the portion of the visible region of p that is contained in the corresponding wedge. We do this as follows. Consider Figure 1 (b). For each 'event point' (i.e., start or end point of a visible segment) on one of the two horizontal rays, draw a vertical segment that connects it with the corresponding point on the other ray. For each rectangle that is obtained color it as follows, where grey means visible and black means invisible. If the horizontal edges of a rectangle are either both visible
122
from p or both invisible from p, then, if both are visible, color it grey, else color it black. If, however, one of the horizontal edges is visible and the other is invisible, divide the rectangle into four pieces by drawing the two diagonals. The color of the upper and lower pieces is determined by the color of the upper and lower edges, respectively, and the color of the left and right pieces is determined by the color of the rectangles on the left and on the right, respectively. Assuming there are no two event points such that one of them is exactly above the other, the coloring procedure is well defined. That is, the odd numbered rectangles will be colored with a single color, and the even numbered rectangles will be divided. Remark 1. The representation of the (approximated) visible region R'p that is computed by the radar-like algorithm is especially suitable for queries of the form: Given a query point q on T", determine whether q is visible from p. or, more precisely, determine whether (the projection of) q lies in R'p. We first verify that q is within the range of sight /, i.e., that q lies within the disk of radius / centered at p. Next we determine whether q lies in R'p, in logarithmic time, by two binary searches. The first search locates the sector of the disk in which q lies, and the second locates the 'rectangle' within the sector in which q lies. Finally, it remains to check in which of the at most four triangles corresponding to this rectangle q lies. Remark 2. One can think of alternative definitions for close-enough and extrapolate. However, it seems reasonable to require the following two properties: (i) A small change in the set of visible segments along a ray should only cause small changes in the close-
enough measure and in the visible region computed by extrapolate (within the appropriate wedge), and (ii) If there are no "surprises" between two close enough rays, then the visible region computed by extrapolate within the wedge should be very similar to the real visible region. In addition, the definitions should remain simple and easy to implement.
height based at q can be seen from p; 9q is the viewing angle corresponding to q. Our algorithm approximates the visible region /2p, by maintaining the (approximate) viewing angles corresponding to the points on the expanding circular horizon. More precisely, the algorithm only considers the points on the circular horizon at directions a, 2a, 3a,... with respect to p, where Q is a parameter of the algorithm. Initially, the circular horizon is the point p itself, and the corresponding viewing 3 Other Algorithms In this section we present several other algorithms angle is —?r/2. for computing the visible region. The first algorithm computes the visible region exactly; its general structure is similar to that of the radar-like algorithm. The second algorithm (called the expanding circular-horizon algorithm or ECH for short) is influenced by the exact algorithm of De Floriani and Magillo [5]. It computes an approximation of the visible region using circles of increasing radii (instead of rays) and similar definitions and q is visible; Right: Oq = Qq> of close-enough and extrapolate. Towards the end of this Figure 3: Left: 9q = section we mention the fixed versions of the radar-like and q is invisible. and expanding circular-horizon algorithms. The algorithms presented in this section are part The (approximate) viewing angles for the current of our testing environment for the radar-like algorithm. circular horizon are computed from those of the previous However, we believe that the exact algorithm and the circular horizon (by resolve-viewing-angles) as follows expanding circular-horizon algorithm are of indepen- (see Figure 3). Let q be a point on the current horizon dent interest. at direction ia with respect to p, and let q' be the point on the previous horizon at the same direction. Let q}, and q is said to be visible region Rp exactly) for the experimental evaluation of from p if and only if Bq = q. our approximate algorithms, we decided to devise one After applying resolve-viewing-angles to the current that is based on the general structure of the radar-like horizon C, we partition C into maximal visible and algorithm, instead of using one of the known algorithms. invisible arcs as follows. An arc of C between two Our exact algorithm is rather simple. It repeatedly consecutive sample points (i.e., points at direction ia computes the (exact) portion of Rp within a slice that and (i + l)a with respect to p) is called a primitive is defined by a pair of rays passing through vertices of arc. We first consider each of the primitive arcs a. If the terrain T, and that does not contain a vertex of both endpoints of a are visible (resp., invisible), then T in its interior; see Figure 2. This'can be done in we assume all points in a are visible (resp., invisible). time O(mlogm), where m is the number of edges of If however one of the endpoints qi is visible and the the terrain that cross the slice, as is shown in [5]. other qi+i is invisible, we assume all points in the half Remark. The radar-like algorithm can be modified of a adjacent to
123
Figure 2: Left: the exact algorithm draws a ray through each vertex of T; Right: a slice that is defined by two consecutive rays, the corresponding cross sections, and the exact portion of Rp within the slice.
Given a triangulation T representing a terrain (i.e., with heights associated with the triangle vertices), and a view point p on or above T: a «— some constant angle, say Tr/180. d <— some constant distance, say, 10 meters.
r1<–rmin
GI <— determine the viewing angles corresponding to the 1-n/a. sample points on the circle of radius r\. C-2 <— resolve-viewing-angles(T,p, C\, r\ + d). while (rx < rmax) if (Ci is close-enough to €-2) extrapolate((7i ,6*2); ri *— C?.radius; C1 *— /nr ^2,
r- min(r1+d, r max); C-2 *— resolve-viewing-angles(T,p, C\, r); else r «— (C\:radius 4- C^.radius)/'2; C-2 <— resolve-viewing-angles(T,p, Cj, r);
3.3 The corresponding fixed versions. Both the radar-like algorithm and the expanding circular-horizon algorithm have corresponding fixed versions. In the fixed version of the radar-like algorithm the angle between two consecutive rays is fixed and we approximate the portion of Rp in the sector defined by the rays in any case, even if they are not close-enough; see Figure 4. In the fixed version of the expanding circular-horizon algorithm, the increase in radius between two consecutive
124
circles is fixed and again we approximate the portion of Rp in the annulus defined by the circles in any case, see Figure 5. 4 Experimental Results In this section we report on the experiments that we performed with the approximation algorithms described in Sections 2 and 3. Namely, the radar-like algorithm, the expanding circular-horizon algorithm (ECH), and their corresponding fixed versions. We have also implemented the exact algorithm (Section 3.1), which is needed for the error computation. 4.1 The error measure. In our experiments we use the following natural error measure. Let R'p be an approximation of Rp obtained by some approximation algorithm, where Rp is the region visible from p. Then the error associated with R'p is the area of the XOR of R'p and Rp, divided by the area of the disk of radius I, where / is the range of sight that is in use. See Figure 6. 4.2 The experiments. Ten input terrains representing ten different and varied geographic regions were used. Each input terrain covers a rectangular area of approximately 15 x 10 km 2 , and consists of approximately 5,000-10,000 triangle vertices. For each terrain we picked several view points (z, y coordinates) randomly. For each view point p we applied each of the four approximation algorithms (as well as the exact algorithm) 20 times: once for each combination of height (either 1, 10, 20, or 50 meters above the surface of T) and range of sight (either 500, 1000, 1500, 2500, or 3500 meters). For each (approximated) region that was obtained, we computed the associated error, according to
Figure 4: The visible region computed by the radar-like algorithm (left) and by its corresponding fixed version (right), each composed of 72 slices.
Figure 5: The visible region computed by the expanding circular-horizon algorithm (left) and by its corresponding fixed version (right).
125
Figure 6: Left: the exact region Rp] Middle: the approximate region R'p computed by the radar-like algorithm; Right: XOH(R'p,Rj,).
the error measure above. All this was repeated three times; once per each of three levels of sampling (see below). The level of sampling is determined by the number of calls to extrapolate that are issued during the execution of an algorithm. We used three levels of sampling: 80, 140, and 220. Since the extrapolation between two consecutive rays is comparable to the extrapolation between two circular horizons, this seems a fair basis for comparison. (In order to achieve a specific level of sampling when running one of the nonfixed versions, we repeated the computation several times, with different values of the constant C, until the desired level of sampling was reached.) Accuracy level: 80 || i=soo 4.53 Fixed ECH 3.71 ECH 2.09 Fixed radar-like 1.37 Radar-like
1=1000 3.97 3.49 1.36 0.91
1=15OO
1=2500
I=350O
3.87 3.40 1.19 0.78
3.88 3.29 1.08 0.65
3.75 3.35 0.97 0.62
Table 1: Results for sampling level 80.
Accuracy level: 140 | 500 2.71 Fixed ECH 2.30 ECH Fixed radar-like 1.21 0.89 Radar-like
| 1000 2.35 2.09 0.97 0.71
1500 2.12 1.88 0.91 0.62
2500 2.16 1.88 0.81 0.59
Table 2: Results for sampling level 140.
3500 2.12 1.94 0.72 0.53
Accuracy level: 220 Fixed ECH ECH Fixed radar-like Radar-like
500 1.36 1.22 0.79 0.53
1000 1.30 1.15 0.63 0.41
1500 | 2500 1.18 1.19 1.13 1.09 0.59 0.51 0.29 0.37
3500 1.20 1.13 0.40 0.28
Table 3: Results for sampling level 220.
for example, the first table. This table contains our results for sampling level 80. The first line in this table corresponds to the fixed version of the expanding circular-horizon algorithm (ECH). The first entry in this line (4.53) is the average error (in percents) obtained when running ECH with accuracy level 80 and range of sight 500 for each of the view points (over all terrains) and each of the four possible heights. Tables 4 and 5 show the amount of work needed in order to reach a certain level of accuracy. In Table 4 the amount of work is measured by the number of calls to cross-section (alternatively, resolveviewing-angles), and in Table 5 the amount of work is measured by the total running time. For example, using the fixed radar-like algorithm, the average number of calls to cross-section needed to obtain an error of 1 percent was 80, and, using the ECH algorithm, the average running time needed to obtain an error of 0.5 percent was 1648 milliseconds. All experiments were performed on the following platform: Pentium 4, 2.4GHz, 512MB, Linux 8.1, Java 1.4.
4.4 Conclusions. Based on the results above the radar-like approach is significantly better than the ex4.3 The results. Our results are presented in the panding circular-horizon approach. For each of the samfollowing two sets of tables. The first three tables show pling levels, the regions computed by the two radar-like the error for each of the four algorithms as a function algorithms were more accurate than those computed by of the sampling level and the range of sight. Consider, the two ECH algorithms for any range of sight (see Ta-
126
Error: Fixed ECH ECH Fixed radar-like Radar-like
1.00 263 231 80 61
0.75 616 522 140 103
0.50 1009 893 220 174
when taking into consideration the improved accuracy of the adaptive version, we see (Table 5) that the the adaptive version is on average about 10 percent faster than the corresponding fixed versions.
Table 4: Average number of calls to cross-section (alternatively, resolve-viewing-angles) by accuracy level. | Error: Fixed ECH ECH Fixed radar-like Radar-like
l700 597 579 112 101
0.75 1045 1012 192 168
0.50 1648 1591 301 274
Table 5: Average running time (in milliseconds) by accuracy level.
bles 1-3). Moreover, for each level of accuracy, the ECH algorithms had to work much harder than the radarlike algorithms (according to both measures) in order to reach the desired level of accuracy (see Tables 4-5). A possible explanation for the better performance of the radar-like algorithms is that in the ECH algorithms the computation of the visible arcs on the current circular horizon is only an approximation, while in the radar-like algorithms the visible segments on a ray are computed exactly. Referring to Figure 7, if the ECH algorithms miss a ridge like the one in the left picture (drawn as a narrow rectangle), then all subsequent circles will miss it and therefore might conclude that the corresponding arcs are visible while they are not. On the other hand, a ridge like the one in the right picture that is missed by the radar-like algorithms does not affect subsequent computations leading to smaller errors. Another clear conclusion is that the adaptive (i.e., non-fixed) versions are more accurate than their corresponding fixed versions. For each sampling level, the regions computed by the adaptive versions were significantly more accurate than those of the fixed versions (see Tables 1-3). This advantage of the adaptive versions is especially noticeable when the sampling level is low. As expected, the adaptive versions are somewhat slower than the corresponding fixed versions for a given sampling level, since the adaptive versions perform more cross-section (alternatively, resolve-viewing-angles) operations. Actually, we found that on average the radarlike algorithm issues about 9 percent more calls to crosssection than the fixed radar-like algorithm. However,
Figure 7: ECH vs. the radar-like algorithm. Finally, we recently performed some experiments with the two radar-like algorithms using somewhat larger terrains consisting of approximately 40.000 vertices and covering a rectangular area of approximately 20 x 20 km2. In general, the errors that we got were somewhat smaller (using the same levels of sampling and ranges of sight), and the adaptive version remained more accurate than the fixed version. The smaller errors are probably due to the higher resolution, although, in general, the nature of the terrain can significantly affect the accuracy of the approximations. Acknowledgment. The authors wish to thank Ofir Ganani and Maor Mishkin who helped implementing the radar-like and the expanding circular-horizon algorithms, and Joe Mitchell for helpful discussions.
References [1] B. Ben-Moshe, M.J. Katz, J.S.B. Mitchell and Y. Nir. Visibility preserving terrain simplification. Proc. 18th ACM Sympos. Comput. Geom. 303-311, 2002. [2] D. Cohen-Or and A. Snaked. Visibility and dead-zones in digital terrain maps. Computer Graphics Forum 14(3):171-179, 1995. [3] R. Cole and M. Sharir. Visibility problems for polyhedral terrains. J. of Symbolic Computation 7:11-30, 1989. [4] L. De Floriani and P. Magillo. Visibility algorithms on triangulated digital terrain model. Internal. J. of CIS 8(1): 13-41, 1994. [5] L. De Floriani and P. Magillo. Representing the visibility structure of a polyhedral terrain through a horizon map. Internat. J. of CIS 10:541-562, 1996. [6] F. Devai. Quadratic bounds for hidden line elimination. Proc. 2nd ACM Sympos. Comput. Geom. 269-275, 1986.
127
[7] R. Franklin, C.K. Ray and S. Mehta. Geometric algorithms for siting of air defense missile batteries. Tech. Report, 1994. [8] P.S. Heckbert and M. Garland. Fast polygonal approximation of terrains and height fields. Report CMU-CS95-181, Carnegie Mellon University, 1995. [9] M.F. Goodchild and J. Lee. Coverage problems and visibility regions on topographic surfaces. Annals of Operation Research 18:175-186, 1989. [10] N. Greene, M. Kass and G. Miller. Hierarchical z-buffer visibility. Computer Graphics Proc., Annu. Conference Series 273-278, 1993. [11] A.J. Stewart. Fast horizon computation at all points of a terrain with visibility and shading applications. IEEE Trans. Visualization Computer Graphics 4(l):82-93, 1998.
128
A computational framework for handling motion Leonidas Guibas*
Menelaos I. Karavelas^
Abstract We present a framework for implementing geometric algorithms involving motion. It is written in C++ and modeled after and makes extensive use of CGAL (Computational Geometry Algorithms Library) [4]. The framework allows easy implementation of kinetic data structure style geometric algorithms—ones in which the combinatorial structure changes only at discrete times corresponding to roots of functions of the motions of the primitives. This paper discusses the architecture of the framework and how to use it. We also briefly present a polynomial package we wrote, that supports exact and filtered comparisons of real roots of polynomials and is extensively used in the framework. We plan to include our framework in the next release of CGAL.
1 Introduction Motion is ubiquitous in the world around us and is a feature of many problems of interest to computational geometers. While projects such as CGAL [15] have provided an excellent software foundation for implementing static geometric algorithms (where nothing moves), there is no similar foundation for algorithms involving motion. In this paper we present such a framework for algorithms that fit the constraints of kinetic data structures. Kinetic data structures were introduced by Basch et. al. in '97 [1, 12]. They exploit the combinatorial nature of most geometric data structures—the combinatorial structure remains invariant under some motions of the underlying geometric primitives and, when the structure does need to change, it does so at discrete times and in a limited manner. Algorithms that fit within the kinetic data structures framework have been found for a number of geometric constructs of interest, including Delaunay and regular triangulations in two and three dimensions and various types of clustering. Computational geometry is built on the idea of predicates—functions of the description of geometric primitives which return a discrete set of values. Many of the predicates reduce to determining the sign of an algebraic expression of the representation (i.e. coordinates of points) of the geometric primitives. For example, to test whether a point lies above or below a plane, we compute the dot product of the point with the normal * Stanford University, [email protected] tUniversity of Notre Dame, [email protected] ^Stanford University, [email protected]
Daniel Russel*
of the plane and subtract the plane's offset along the normal. If the result is positive, the point is above the plane, zero on the plane, negative below. The validity of many combinatorial structures built on top of geometric primitives can be proved by checking a finite number of predicates of the primitives. These predicates are called certificates. For example, a threedimensional convex hull is proved to be correct by checking, for each face, that all points are below the outward facing plane supporting it. The kinetic data structures framework is built on top of this view of computational geometry. Let the geometric primitives move by replacing each of their coordinates with a function of time. As time advances, the primitives now trace paths in space called trajectories. The values of the certificates which proved the correctness of the static structure now become functions of time, called the certificate functions. As long as these functions have the correct value, the original structure is still correct. However, if one of the certificate functions changes value, the original structure must be updated and some new set of certificate functions computed. We call such occurrences events. Maintaining a kinetic data structure is then a matter of determining which certificate function changes value next (typically this amounts to determining which certificate function has the first root after the current time) and then updating the structure and certificate functions. The CGAL project [9, 4] provides a solid basis for performing exact and efficient geometric computations as well as a large library of algorithms and data structures. A key idea they use is that of a computational kernel, an object which defines primitives, methods to create instances of primitives, and functors1 which act on the primitives. CGAL defines a geometric kernel [15], which provides constant complexity geometric objects and predicates and constructions acting on them. The algorithms use methods provided by the kernel to access and modify the geometric primitives, so the actual representation need never be revealed. As a result, the implementation of the kernel primitives and predicates can be replaced, allowing different types of computation (such as fixed precision or exact) and different types of primitive representation (such as Cartesian or homoge1 Functors are C++ classes that provide one or more operatorQ methods.
129
neous coordinates) to be used. The library uses C++ 2 Design Considerations templates to implement the generic algorithms which There were a number of important considerations which results in little or no run time inefficiency. guided our design. They are CGAL provides support for exact computation (as • Runtime efficiency: There should be little or no opposed to fixed precision computation with double-type penalty for using our framework compared to imnumbers which can result in numerical errors), as do plementing your own more specialized and tightly some other libraries, such as CORE [24] and LED A [3]. integrated components. We use templates to make Exact computation can greatly simplify algorithms as our code generic and modular. This allows most they no longer need to worry about numerical errors of the cost of the flexibility of the framework to and can properly handle degeneracy. However, it can be be handled at compile time by the compiler rather painfully slow at run time. Floating point filters [10, 21] than at runtime, and allows components to be easwere developed to address this slowness when computily replaced if further optimization is needed. ing predicates. The key observation is that fixed precision computation often produces the correct answer. • Support for multiple kinetic data structures: MulFor example, in our point-plane orientation test from tiple kinetic data structures operating on the same above, if the dot product of the point coordinates and set of geometric primitives must be supported. Unthe plane normal is very much larger than the plane like their static counterparts, the description of the offset, then the point is above the plane, even if the trajectory of a kinetic primitive can change at any exact value of the dot product is in some doubt. Floattime, for example when a collision occurs. While ing point filters formalize this observation by computing such trajectory changes can not change the current some additive upper bound on the error. Then, if the state of a kinetic data structure since the trajectomagnitude of value computed for the predicate is larger ries are required to be C°-continuous, they do affect than that upper bound, the sign of the computed value the time when certificates fail. As a result there is guaranteed to be correct, and the predicate value is must be a central repository for the kinetic primknown. However, if the computed value is not large itives which provides signals to the kinetic data enough, the calculation must be repeated using an exstructures when trajectories change. This reposiact number type. A variety of techniques have been tory, which is discussed in Section 3.6 can be easily developed to compute the error bounds. One of the omitted if there is no need for its extra functionaleasiest to use and tightest general purpose techniques is ity. A less obvious issue raised by having multiple interval arithmetic [18, 2]. kinetic data structures is that events from different In this paper we present a framework for implekinetic data structures must be able to be stored menting kinetic data structures. We also provide a together and have their times compared. The ramistandalone library for manipulating and comparing real fications of this concern are discussed in Section 3.5. roots of polynomials using fixed precision and exact • Support for existing static data structures: We arithmetic and which can use floating point filters to acprovide functionality to aid in the use of existing celerate the computations. We have used the framework static data structures, especially ones implemented for investigating kinetic data structure based techniques using CGAL, by allowing static algorithms to act for updating Delaunay triangulations [14]. on snapshots of the running kinetic data structure. The framework depends on CGAL for a small numOur method of supporting this is discussed in ber of low level classes and functions, mostly concernSection 3.4. ing evaluating and manipulating number types and performing filtered computations. In addition, it provides • Support exact and filtered computation: Our polymodels of the CGAL geometric kernel to enable usage nomial solvers support exact root comparison and of static geometric data structures and algorithms on other operations necessary for exact kinetic data moving primitives. structures. In addition we support filtered comNote on terminology: We adopt the terminology putation throughout the framework. The effects used by the C++ Standard Template Library and talk of these requirements are discussed in Sections 3.2 about concepts and models. A concept is a set of and 3.3. functionalities that any class which conforms to that • Thoroughness: The common functionality shared concept is expected to implement. A class is a model of by different kinetic data structures should as much a concept if it implements that functionality. Concepts as possible be handled by the framework. Different will be denoted using THISSTYLE. kinetic data structures we have implemented using the framework only share around 10 lines of code. An example of such a kinetic data structure is included in the appendix in Figure 2.
130
• Extensibility and modularity. The framework should be made of many lightweight components which the user can easily replace or extend. All components are tied together using templates so replacing any one model with another model of the same concept will not require any changes to the framework. • Ease of optimization: We explicitly supported many common optimizations. The easy extensibility of the framework makes it easy to modify components if the existing structure is not flexible enough. • Ease of debugging of kinetic data structures: We provide hooks to aid checking the validity of kinetic data structures as well as checking that the framework is used properly. These checks are discussed in Section 3.5. We also provide a graphical user interface which allows the user to step through the events being processed and to reverse time and to look at the history.
3
Architecture
3.1 Overview The framework is divided in to five main concepts as shown in Figure 1. They are: • FUNCTIONKERNEL: a computational kernel for representing and manipulating functions and their roots. • KINETICKERNEL: a class which defines kinetic geometric primitives and predicates acting on them. • MOVINGOBJECTTABLE: a container which stores kinetic geometric primitives and provides notifications when their trajectories change. • INSTANTANEOUSKERNEL: a model of the CGAL kernel concept which allows static algorithms to act on a snapshot of the kinetic data structure. • SIMULATOR: a class that maintains the concept of time and a priority queue of the events. In a typical scenario using the framework, a SIMULATOR and MOVINGOBJECTTABLE are created and a number of geometric primitives (e.g. points) are added to the MOVINGOBJECTTABLE. Then a kinetic data structure, for example a two dimensional kinetic Delaunay triangulation, is initialized and passed pointers to the SIMULATOR and MOVINGOBJECTTABLE. The kinetic Delaunay triangulation extracts the trajectories of the points from the MOVINGOBJECTTABLE and the current time from the SIMULATOR. It then uses an instance of an INSTANTANEOUSKERNEL to enable a static algorithm to compute the Delaunay triangulation of the points at the current time. An instance of a KINETICKERNEL is used to compute the in_circle certificate function for each edge of the initial Delaunay triangulation. The kinetic data
structure requests that the SIMULATOR solve each certificate function and schedule an appropriate event. The SIMULATOR uses the FUNCTIONKERNEL to compute and compare the roots of the certificate functions. Initialization is now complete and the kinetic data structure can be run. Running consists of the SIMULATOR finding the next event and processing it until there are no more events. Here, processing an event involves flipping an edge of the Delaunay triangulation and computing five new event times. The processing occurs via a callback from an object representing an event to the kinetic Delaunay data structure. If the trajectory of a moving point changes, for example it bounces off" a wall, then the MOVINGOBJECTTABLE notifies the kinetic Delaunay data structure. The kinetic Delaunay data structure then updates all the certificates of edges adjacent to faces containing the updated point and reschedules those events with the SIMULATOR. A more detailed example will be discussed in Section 4. We will next discuss the principle concepts. 3.2 The polynomial package: The FUNCTIONKERNEL, solvers and roots. The FUNCTIONKERNEL is a computational kernel for manipulating and solving univariate equations. The polynomial package provides several models of the FUNCTIONKERNEL all of which act on polynomials. The FUNCTIONKERNEL defines three primitives: the FUNCTION, the SOLVER and the CoNSTRUCTEDFuNCTiON. The two nested primitives shown in Figure 1 are NT the number type used for storage and ROOT the representation type for roots. The kernel defines a number of operations acting on the primitives such as translating zero, counting the number of roots in an interval, evaluating the sign of a function at a root of another, finding a rational number between two roots, and enumerating the roots of a function in an interval. Our models additionally provide a number of polynomial specific operations such as computing Sturm sequences and Bezier representations of polynomials. The FUNCTION concept (a polynomial in our models) supports all the expected ring operations, i.e., FUNCTIONS can be added, subtracted and multiplied. The CONSTRUCTEDFUNCTION wraps the information necessary to construct a polynomial from other polynomials. The distinction between FUNCTIONS and CONSTRUCTEDFUNCTIONS is necessary in order to support filtering, discussed below. If no filtering is used, a CONSTRUCTEDFUNCTION is an opaque wrapper around a FUNCTION. The ROOT type supports comparisons with other ROOTS and with constants and a few other basic operations such as generation of an isolating interval for the root, negation, and computation of its multiplicity. We plan to extend the ROOT to support full field operations, but have not done so yet.
131
Figure 1: Framework architecture: Each large white box represents a main concept, the sub boxes their contained concepts, and regular text their methods. A "Uses" arrow means that a model of concept will generally use methods from (and therefore should take as a template parameter) the target of the arrow. A "Provides model of" arrow means that the source model provides an implementation of the destination concept through a typedef. Finally, a "Notifies" arrow means that the class notifies the other class of events using a standardized notification interface. See Section 3 for a description of each of the main concepts. The most important attributes differentiating our various models of the FUNCTIONKERNEL are the type of solver used and the how the resulting root is represented. We currently provide five different solver types
• Bezier: a solver which used a Bezier curve based representation of the polynomial to perform root isolation [17]. • CORE: a solver which wraps the CORE Expr type [24].
• Eigenvalue: a solver which computes roots using fixed precision computation of the eigenvalues of a We also provide a FUNCTIONKERNEL specialized to hanmatrix. dle linear functions, which can avoid much of the over• Descartes: a set of solvers which use Descartes head associated with manipulating polynomials. We also plan to provide specialized function kernels for rule [20] of signs to isolate roots in intervals. small degree polynomials using the techniques presented • Sturm: a set of solvers which use Sturm se- in [7, 8, 16]. All of the solvers except for the Eigenvalue quences [25] to isolate roots. An earlier use of solver can perform exact computations when using an Sturm sequences in the context of kinetic data exact number type and produce roots which support structures was published in [13]. exact operations.
132
The simplest way to try to represent a root of a polynomial is explicitly, using some provided number type. This is used by numerical solvers (such as the our Eigenvalue solver), which represent roots using a double, and the CORE-based solver which represents the root using CORE'S Expr type. However, since roots are not always rational numbers, this technique is limited to either approximating the root (the former case) or depending on an expensive real number type (the latter case). An alternative is to represent roots using an isolating interval along with the polynomial being solved. Such intervals can be computed using Sturm sequences, Descartes rule of signs or the theory of Bezier curves. When two isolating intervals are compared, we subdivide the intervals if they overlap in order to attempt to separate the two roots. This subdivision can be performed (for simple polynomials) by checking the sign of the original polynomial at the midpoint of the interval. However, subdivision will continue infinitely if two equal roots are compared. To avoid an infinite loop, when the intervals get too small, we fall back on a Sturm sequence based technique, which allows us to exactly compute the sign of one polynomial at the root of another. This allows us to handle all root comparisons exactly. We have variants of the Sturm sequence based solver and the Descartes rule of sign based solver, that perform filtered computations. Unfortunately, in a kinetic data structure, the functions being solved are certificate functions which are generated from the coordinate functions of the geometric primitives. If the coordinate functions are stored using a fixed precision type, then computing the certificate function naively will result in the solver being passed an inexact function, ending all hopes of exact comparisons. Alternatively, the certificate function generation could be done using an exact type, but this technique would be excessively expensive as fixed precision calculations are often sufficient. This means that in the kinetic data structures setting, the filtered root computation must have access to a way of generating the certificate function. To solve this problem we introduce the concept of a FUNCTIONGENERATOR. This is a functor which takes a desired number type as a parameter and generates a function. The computations necessary to generate the function are performed using the number type passed. The FUNCTIONGENERATOR gets wrapped by a CoNSTRUCTEoFuNCTioN and passed to the solver. The filtered solvers can first request that the certificate function generation be performed using an interval arithmetic type, so the error bounds are computed for each coefficient of the certificate polynomial. The solver then attempts to isolate a root. In general, the root isolation computation involves determining the signs of modified versions of the generated polynomial. If
some of the sign values cannot be determined (because the resulting interval includes zero), the solver requests generation of the certificate polynomial using an exact number type and repeats the calculations using the exact representation. In a kinetic data structures situation, we are only interested in roots which occur after the last event processed. In addition, there is often an end time beyond which the trajectories are known not to be valid, or of no interest for the simulation. These two times define an interval containing all the roots of interest. The Bezier, Descartes and Sturm based solvers all act on intervals and so can capitalize on this extra information. The SIMULATOR, described in Section 3.5, keeps track of these two time bounds and makes sure the correct values are passed to all instances of the solvers. All of the solvers except for the CORE-based solver correctly handle non-square free polynomials. All exact solvers handle roots which are arbitrarily close together and roots which are too large to be represented by doubles, although the presence of any of these issues slows down computations, since filtering is no longer effective and we have to resort to exact computation. A qualitative comparison of the performance of our solvers can be found in Table 1. We plan to describe the polynomial package in more detail in a later paper. 3.3 Kinetic primitives and predicates: the KiNETicKERNEL. The KiNETicKERNEL is the kinetic analog of the CGAL KERNEL. It defines constant complexity kinetic geometric primitives, kinetic predicates acting on them and constructions from them. The short example in Section 3.1 uses the KINETICKERNEL to compute the in_circle certificate functions. We currently provide two models which define two and three dimensional moving weighted and unweighted points and the predicates necessary for Delaunay triangulations and regular triangulations. The FUNCTION concept discussed in Section 3.2 takes the place of the ring concept of the CGAL KERNEL and is the storage type for coordinates of kinetic primitives. As in CGAL, algorithms request predicate functors from the kernel and then apply these functors to kinetic primitives. There is no imperative programming interface provided at the moment. In principle, kinetic predicates return univariate functions, so they should return a FUNCTION. However, as was discussed in Section 3.2, in order to support floating point filters, the input to a solver is a model of CoNSTRUCTEDFuNCTiON. As a result, when a predicate functor is applied no predicate calculations are done. Instead, a model of FUNCTIONGENERATOR is created that stores the arguments of the predicate and can perform the necessary predicate calculations when requested. We provide helper classes to aid users in adding their own predicates to a KINETICKERNEL.
133
Solver: Eigenvalue Filtered Descartes Descartes (double) Descartes Sturm Filtered Sturm CORE Bezier
Low degree 0.5 18 5 32
81 28 291 143
Wilkinson 12 230 90 2k 2.5k 99 126k 19k
Mignotte 400 9k 240k 12k 9k 114k 2M
Small Intervals 15 30 7 150 2.9k 55 2.5k 29
Non-simple
160 44 750 780 130 180
Table 1: The time taken to isolate a root of various classes of polynomials is shown for each of the solvers. "Low degree" are several degree six or lower polynomials with various bounding intervals and numbers of roots. "Wilkinson" is a degree 15 Wilkinson polynomial which has 15 evenly spaced roots. "Mignotte" is a polynomial of degree 50 with two roots that are very close together. "Small Intervals" is a degree nine polynomial solved over several comparatively small parts of the real number line. "Non-simple" are non-square free polynomials. All of the solvers except the Eigenvalue and the Descartes (double) produce exact results. When roots are only needed from a small fraction of the real line, (the "High degree" test case), interval-based solvers are actually quite competitive with the numeric solvers, although the comparison of the resulting roots will be more expensive. Note the large running time on the Mignotte polynomials since the solvers have to fall back on a slower computation technique to separate the close roots. We believe the comparatively large running times of the Bezier based solver reflect the relative immaturity of our implementation, rather than a fundamental slowness of the method. The Eigenvalue solver is based on the GNU Scientific Library [11] and the ATLAS linear algebra package [23]. Times are in (j,s in a Pentium 4 running at 2.8GHz. "k" stands for thousands and "M" for millions of /xs. It is important to note that the KINETICKERNEL does not have any notion of finding roots of polynomials, of performing operations at the roots of polynomials, or of static geometric concepts. The first and second are restricted to the FUNCTIONKERNEL (Section 3.2) and the SIMULATOR (Section 3.5). The third is handled by the iNSTANTANEOUsKERNEL which is discussed in the next section. 3.4 Connecting the kinetic and the static worlds: the INSTANTANEOUSKERNEL. There are many well implemented static geometric algorithms and predicates in CGAL. These can be used to initialize, test and modify kinetic data structures by acting on "snapshots" of the changing data structure. A model of the INSTANTANEOUSKERNEL concept is a model of the CGAL KERNEL which allows existing CGAL algorithms to be used on such snapshots. For example, as mentioned in Section 3.1, with the INSTANTANEOUSKERNEL we can use the CGAL Delaunay triangulation package to initialize a kinetic Delaunay data structure. We can also use the INSTANTANEOUSKERNEL and static Delaunay triangulation data structure to manage insertion of new points into and deletion of points from a kinetic Delaunay triangulation. The INSTANTANEOUSKERNEL model redefines the geometric primitives expected by the CGAL algorithm to be their kinetic counterparts (or, in practice, handles to them). When the algorithm wants to compute a predicate on some geometric primitives, the INSTANTANEOUSKERNEL first computes the static representation of the kinetic primitives, and then uses these to compute the static predicate.
134
We are able to use this technique due to a couple of important features of the CGAL architecture. First of all, the kernel is stored as an object in CGAL data structures, so it can have state (for the INSTANTANEOUSKERNEL the important state is the current time). Secondly, predicates are not global functions, instead they are functors that the algorithm requests from the kernel. This means that they too, can have internal state, namely a pointer to the INSTANTANEOUSKERNEL and this state can be set correctly when they are created. Then, when an algorithm tries to compute a predicate, the predicate functor asks the INSTANTANEOUSKERNEL to convert its input (handles to kinetic geometric primitives) into static primitives and can then use a predicate from a static CGAL kernel to properly compute the predicate value. The pointer from the INSTANTANEOUSKERNEL predicate to the INSTANTANEOUSKERNEL object is unfortunate, but necessary. Some CGAL algorithms request all the predicate functors they need from the kernel at initialization and store those functors internally. Since a given CGAL object (i.e. a Delaunay triangulation) must be able to be used at several snapshots of time, there must be a way to easily update time for all the predicates, necessitating shared data. Fortunately, predicates are not copied around too much, so reference counting the shared data is not expensive. Note that the time value used by our INSTANTANEOUSKERNEL model must be represented by a number type, meaning that it cannot currently be a model of ROOT. This somewhat limits the use of the INSTANTANEOUSKERNEL. We use it primarily for
initialization and verification, neither of which need to occur at roots of functions. Some techniques for addressing verification will be discussed in the next section. Let us conclude the discussion of the INSTANTANEOUSKERNEL concept by noting that we do not require that models of the static kernels used by the INSTANTANEOUSKERNEL be CGAL KERNELS, but rather that they conform with the CGAL KERNEL concept. The user has the ability to provide his/her own kernel models and may or may not use CGAL. 3.5 Tracking time: the SIMULATOR. Running a kinetic data structure consists of repeatedly figuring out when the next event occurs and processing it. This is the job of the SIMULATOR. It handles all event scheduling, descheduling and processing and provides objects which can be used to determine when certificate functions become invalid. Since events occur at the roots of certificate functions, the ROOT type defined by a FUNCTIONKERNEL is used to represent time by the SIMULATOR. In the example in Section 3.1 the kinetic Delaunay data structure requests that the SIMULATOR determine when in.circle certificate functions become invalid and schedules events with the SIMULATOR. The SIMULATOR also makes sure the appropriate callbacks to the kinetic Delaunay data structure are made when certificates become invalid. Our model of the SIMULATOR is parameterized by a FUNCTIONKERNEL and a priority queue. The former allows the solver and root type to be changed, so numeric, exact or filtered exact computation models can be used. The priority queue is by default a queue which uses an interface with virtual functions to access the events, allowing different kinetic data structures to use a single queue. It can be replaced by a queue specialized for a particular kinetic data structure if desired. The ROOT concept is quite limited in which operations it supports—it effectively only supports comparisons. Roots cannot be used in computations or as the time value in an INSTANTANEOUSKERNEL. As a result, we take a somewhat more topological view of time. Two times, to and t\, are considered topologically equivalent if no roots occur in the interval [io»^i]- The lack of separating roots means that the function has the same sign over the interval. This idea can be extended to a set of kinetic data structures. When a simulation is running, if the time of the last event which occurred, tiast, and the time of the next event, tnext, are not equal, then the current combinatorial structures of all of the kinetic data structures are valid over the entire interval (tlast, tnext). In addition there is a rational value of time, tr, which is topologically equivalent to all times in the interval. Computations can be performed at tr since it can be easily represented. This flexibility is used
extensively in the SIMULATOR. When such a tr exists, the kinetic data structures are all guaranteed to be valid and non-degenerate and so can be easily verified. The SIMULATOR can notify the kinetic data structures when this occurs and they can then use an INSTANTANEOUSKERNEL to perform selfverification. We can also use this idea to check the correctness of individual certificates upon construction. We define a certificate to be invalid when the certificate function is negative. As a result it is an error, and a common sign of a bug in a kinetic data structure, to construct a certificate function whose value is negative at the time of construction. Unfortunately, the time of construction is generally a root and this check cannot be performed easily. However, we can find a time topologically equivalent to the current time for that function (or discover if no such time exists) and evaluate the function at that time. This is still a very expensive operation, but faster than the alternative of using a real number type. In addition, in order to properly handle two events occurring simultaneously, the SIMULATOR must check if the certificate function being solved is zero at the current time. If it is zero, and negative immediately afterwords, then the certificate fails immediately. This can be checked in a similar manner. Even roots of polynomials (where the polynomial touches zero but does not become negative) can generally be discarded without any work being done since they represent a momentary degeneracy. However, at an even root, the kinetic data structure is degenerate, and as a result is not easily verifiable. Since, kinetic data structures are generally written only to handle odd roots, when verification is being performed as above, each even root must be divided into two odd roots before being returned. These cases are handled properly by our SIMULATOR and solvers. 3.6 Coordinating many kinetic data structures: the MOVINGOBJECTTABLE. A framework for kinetic data structures needs to have support for easily updating the trajectories of kinetic primitives, such as when a collision occurs. This requirement is in contrast to static geometric data structures where the geometric primitives never change and their representations are often stored internally to the data structure. In the simple example presented in Section 3.1, the kinetic Delaunay triangulation queries the MOVINGOBJECTTABLE for all the moving points on initialization. Later, when the simulation is running, the MOVINGOBJECTTABLE notifies the kinetic Delaunay data structure whenever a point's trajectory changes. The MOVINGOBJECTTABLE allows multiple kinetic data structures to access a set of kinetic geometric prim-
135
itives of a particular type and alerts the kinetic data structures when a new primitive is added, one is removed, or a primitive's trajectory changes. Our model of the MOVINGOBJECTTABLE is actually a generic container that provides notification when an editing session ends (for efficiency, changes are batched together). There is no internal functionality specific to kinetic data structures or to a particular type of primitive. The user must specify what type of kinetic primitive a particular instance of the MOVINGOBJECTTABLE model will contain through a template argument (in the architecture diagram Figure 1, a three dimensional moving point is used as an example). This type is exposed as the Object type shown in the figure. To access an object, a KEY which uniquely identifies the object within this container and which has a type specific to this container type is used. The MOVINGOBJECTTABLE uses a notification system which will be briefly explained in Section 3.7 to notify interested kinetic data structures when a set of changes, to the primitives is completed. The kinetic data structures must then request the keys of the new, changed and deleted objects from the MOVINGOBJECTTABLE and handle them accordingly. We provide helper classes to handle a number of common scenarios such as a kinetic data structure which is incremental and can handle the changes of a single object at a time (as is done in the example, Figure 2 in the appendix), or a kinetic data structure which will rebuild all certificates any time any objects are changed (which can be more efficient when many objects change at once). The user can easily add other policies as needed. The MOVINGOBJECTTABLE model provided will not meet the needs of all users as there are many more specialized scenarios where a more optimized handling of updates will be needed. The general structure of the MOVINGOBJECTTABLE model can be extended to handle many such cases. For example, if moving polygons are used, then some kinetic data structures will want to access each polygon as an object, where as others will only need to access the individual points. This extra capability can be added without forcing any changes to existing (point based) kinetic data structures by adding methods to return modified polygons in addition to those which return changed points. When trajectory changes happen at rational time values, the MOVINGOBJECTTABLE can check that the trajectories are C"0-continuous. Unfortunately, the situation is much more complicated for changes which occur at roots. Such trajectories cannot be exactly represented in our framework at this time. The MOVINGOBJECTTABLE only knows about one type of kinetic primitive and has no concept of time, other kinetic primitives or predicates. When a kinetic data structure handles an insertion, for example, it must query the SIMULATOR for an appropriate time value and
136
generate primitives using the KINETICKERNEL. A more detailed discussion of how to use the MOVINGOBJECTTABLE appears in Section 4. 3.7 Miscellaneous: graphical display, notification and reference management. We provide a number of different classes to facilitate graphical display and manipulation of kinetic data structures. There are two and three dimensional user interfaces based on the Qt [19] and Coin [6] libraries, respectively. We provide support for displaying two and three dimensional weighted and unweighted point sets and two and three dimensional CGAL triangulations. Other types can be easily added. A number of objects need to maintain pointers to other independent objects. For example, each kinetic data structure must have access to the SIMULATOR so that it can schedule and deschedule events. These pointers are all reference counted in order to guarantee that they are always valid. We provide a standard reference counting pointer and object base to facilitate this [5]. Runtime events must be passed from the MOVINGOBJECTTABLE and the SIMULATOR to the kinetic data structures. These are passed using a simple, standardized notification interface. To use notifications, an object registers a proxy object with the MOVINGOBJECTTABLE or SIMULATOR. This proxy has a method new_notification which is called when some state of the notifying object changes and is passed a label corresponding to the state that changed. For convenience in implementing simple kinetic data structures, we provide glue code which converts these notifications into function calls—i.e., the glue code converts the MOVINGOBJECTTABLE notification that a new object has been added, into the function call new.object on the kinetic data structure. The glue code is used in the example in Figure 2. The base class for the notification objects manages the registration and unregistration to guard against invalid pointers and circular dependencies. This notification model is described in [5]. 4 Implementing a kinetic data structure 4.1 5ort_kds overview. Figure 2, located in the appendix, depicts a complete kinetic data structure, Sort_kds, implemented using our framework. The data structure being maintained is very simple: a list of the geometric objects in the simulation sorted by their x coordinate. However, it touches upon the most important parts of the framework. For a simple kinetic data structure like this, much of the code is shared with other kinetic data structures. We provide a base class that implements much of this shared functionality. However, we do not use it here in order to better illustrate the various parts of our framework.
Like most kinetic data structures the maintained data has two parts (in this case stored separately) • the combinatorial structure being maintained, in this case in the list objects, declared at the end of the class. • the mapping between connections in the combinatorial structure and pending events. In this case the connections are pairs of adjacent objects in the sorted list. The mapping is stored in the map certificates, using the key of the first point in the pair. When a pair is destroyed (because the objects are no longer adjacent), the event key stored in the mapping is used to deschedule the corresponding event. As is characteristic of many kinetic data structures, Sort_kds defines a class Event, which stores the information for a single event, and has six main methods. The methods are: • new_object: a point has been added to the simulation and must be added to the data structure. • change-object: an point has changed its trajectory and the two certificates involving it must be updated • delete_object: an point has been removed from the simulation. It must be removed from the data structure, the events involving it descheduled and a new event created. • swap: an event has occurred and two objects are about to become out of order in the list and so must be exchanged. • rebuild.certificate: for some reason, a predicate corresponding to a particular piece of the combinatorial structure is no longer valid or the action that was going to be taken in response to its failure is no longer correct. Update the predicate appropriately. This method is only called from within the kinetic data structure. • validate_at: check that the combinatorial structure is valid at the given time. The first three methods are called in response to notifications from the MOVINGOBJECTTABLE. The fourth method is called by Event objects. The last method is called in response to a notification from the SIMULATOR. 4.2 Sort.kds in detail. On initialization the Sort_kds registers for notifications with a MOVINGOBJECTTABLE and a SIMULATOR. It receives notifications through two proxy objects, motJistener. and sim_listener_, which implement the notification interface and call functions on the kinetic data structure when appropriate. We provide standard proxy objects, Moving_object_tableJistener_helper
and Simulator_kds_listener_, which are used, but implementers of kinetic data structures are free to implement their own versions of these simple classes. The MOVINGOBJECTTABLE proxy calls the new .object, delete_object and change.object methods of the kinetic data structure when appropriate. The SIMULATOR proxy calls the validate_at method when there is a rational time value at which verification can be performed. See Section 3.5 for an explanation of when this occurs. The proxy objects store the (reference counted) pointers to the MOVINGOBJECTTABLE and SIMULATOR objects for later use. The SIMULATOR pointer is used by the kinetic data structure to request the current time and schedule and deschedule events. The MOVINGOBJECTTABLE pointer is used to access the actual coordinates of the kinetic objects. Once initialization is completed, the behavior of the kinetic data structure is entirely event driven. The first thing that will occur is the addition of a point to the MOVINGOBJECTTABLE which results in the new_object method being called. This method is passed a Key which uniquely identifies a point in the MOVINGOBJECTTABLE. The Sort_kds makes use of the INSTANTANEOUSKERNEL to properly handle the insertion by using a iNSTANTANEOUsKERNEL-provided functor which compares the x coordinates of two objects at the current instant of time. This functor is then passed to the STL [22] library function upper_bound which returns the location in the sorted list of the point before which the new point should be inserted to maintain a sorted order. The point is inserted and the new pairs created (the new point and the objects before and after it) must have certificates created for them and events scheduled. The rebuild_certificate function is called to handle updating the certificates. The rebuild_certificate function will also deschedule any previous certificates when necessary. Note that this implementation assumes that new_object is only called at instants when there is a rational time topologically equivalent to the current root. The current.time.nt call made to the SIMULATOR will fail otherwise-i.e. when two events occur simultaneously, a degeneracy. The easiest way to handle this is to postpone insertion until a non-degenerate rational time exists or to only insert objects at rational times. We ignore that issue in the example since handling it is somewhat situation dependent. The rebuild.certificate function updates the certificate associated with a passed pair to make sure it is correct. It first checks if there is a previous event corresponding to the pair which needs to be descheduled, and if so requests that the SIMULATOR deschedule it. Then a SOLVER is requested from the SIMULATOR, passing in the CONSTRUCTEDFUNCTION created by the KINETICKERNEL'S Less_x_2 predicate applied to the pair of objects
137
in question. Then an Event is created to exchange the two objects and scheduled in the SIMULATOR at for that time. Note that the certificate function may not have any roots after the current time. In that case, the solver will return RooT::infinity (this is a special value of the ROOT type representing +00). The SIMULATOR detects this and will not schedule the associated event, but will instead return a placeholder Event-key. The Event is in charge of alerting the Sort_kds that it needs to be updated when a particular certificate failure occurs. Typically event classes are very simple, effectively just storing a pointer to the kinetic data structure and an identifier for the combinatorial piece which needs to be updated in addition to the time when the update must occur. This certificate also stores a copy of the SOLVER for reasons which will be discussed in the next paragraph. In order to be handled by the SIMULATOR, the Event class must have the following methods • time() which returns the time at which the event occurs and • set_processed(bool) which is called with the value true when the event occurs. In addition, in order to ease debugging, it must be able to be output to an std::ostream. The swap method is the update method in the Sort_kds. When a pair of objects is swapped, three old pairs of points are destroyed and replaced by three new pairs. Calls to rebuiId-certificate handle the updating of the certificates between a point of the swapped pair and its outside neighbors in the list. The pair that has just been exchanged should be dealt with differently for optimal efficiency. The predicate function corresponding to the new ordering of the swapped pair is the negation of that for the old ordering (i.e. Xk(t)—Xj (t) as opposed to Xj(t) — £&(£)), and so has the same roots. As a result, the old SOLVER can be used to find the next root, saving a great deal of time. In addition, the event which is currently being processed does not need to be descheduled as it is deleted by the SIMULATOR. Notice that the update method does not make any reference to time. This is necessary to properly support degeneracies, since few or no exact calculations can be made without a topologically equivalent rational time, which might not exist. The new-object method is mostly used for initialization and so can be assumed to occur at a non-degenerate time, the same assumption is less easily made about an event. As described in Section 3.5, the SIMULATOR can periodically send out notifications that there is a rational time at which all the kinetic data structures are nondegenerate and can be easily verified. The validate_at method is called in response to such a notification. Validation consists of using the INSTANTANEOUSKERNEL to
138
check that each pair in the list is ordered correctly. The remaining two methods, change-object and delete_object are only necessary if the the kinetic data structure wishes to support dynamic trajectory changes and removals. These methods are called by the mot_listener_ helper when appropriate. That is all it takes to implement a kinetic data structure which is exact, supports dynamic insertions and deletions of objects, allows points to change motions on the fly, and allows a variety of solvers and motion types to be used without modifications. 5 Conclusions and future work Our framework does not provide a mechanism for exactly updating the motions of objects at event times, for example bouncing a ball when it collides with a wall. Providing this functionality efficiently is nontrivial since, in general, the time of an event, te, is a ROOT which is not a rational number. The trajectory after the bounce is a polynomial in t — te and hence will not have rational coefficients. One approach would be to represent the polynomial coefficients using a number type that represents real algebraic numbers (such as CORE Expr or an extended version of our ROOT type) and write solvers that handle this. While our solvers currently support this functionality (except for the CORE based one), it is extremely slow and the bit complexity of the coefficients will rapidly increase with the number of trajectory modifications. In many circumstances it is not necessary to know the new trajectory exactly, as long as the approximations preserve the continuity of the trajectory and do not violate any predicates. An alternative approach is then to find a polynomial with rational coefficients of some bounded bit complexity which is close to the exact new trajectory. Ensuring that the new trajectory does not violate any predicates can be slightly tricky, as can ensuring continuity. We have not worked out all the ramifications of this approach and whether it can be made fully general. A third alternative would be to allow fuzzy motions—motions represented by polynomials whose coefficients are refinable intervals, for example, whose accuracy will depend on how accurately we need to know the motion. A root of such a polynomial cannot be known exactly and indeed may not exist at all, complicating matters. How to consistently process such events to give a generally meaningful and approximately correct simulation needs to be explored. We are investigating extending filtering into more areas of the framework. For example, currently, the INSTANTANEOUSKERNEL must compute the static coordinates of the objects requested using an exact number type and then pass this exact representation to a static predicate. If the static predicate uses filtering, it will
then convert the exact representation into an interval Graphics (TOG), 15(3):223-248, 1996. ISSN 0730representation, and attempt to perform the predicate 0301. computation. In many cases this will be enough and the exact representation will never need to be used as is. A [11] GNU Scientific Library. URL http://www.gnu. org/software/gsl/. better alternative would be to initially generate an interval representation of the static objects and attempt the interval based predicate calculation. Only when that [12] Leonidas Guibas. Kinetic data structures: A state of the art report. In Proc. 3rd Workshop on fails, compute the exact representation. CGAL provides Algorithmic Foundations of Robotics, 1998. support for all the necessary operations. Acknowledgments This research was partly supported by the NSF grants ITR-0086013 and CCR020448.
[13] Leonidas Guibas and Menelaos Karavelas. Interval methods for kinetic simulations. In Proc. 15th Annual ACM Symposium on Computational Geometry, pages 255-264, 1999.
References [14] Leonidas Guibas and Daniel Russel. An empirical [1] Julien Basch, Leonidas Guibas, and John Hershcomparison of techniques for updating delaunay berger. Data structures for mobile data. In Proc. triangulations. Manuscript, 2004. 8th Annual ACM-SIAM Symposium on Discrete [15] Susan Hert, Michael Hoffman, Lutz Kettner, SylAlgorithms, pages 747-756, 1997. van Pion, and Michael Seel. An adaptable and [2] Herve Bronnimann, Christoph Burnikel, and Sylextensible geometry kernel. In Algorithm Engivain Pion. Interval arithmetic yields efficient dyneering: 5th International Workshop, WAE 2001, namic filters for computational geometry. Discrete pages 79-91. Springer-Verlag Heidelberg, 2001. Applied Mathematics, 109:25-47, 2001. [16] Menelaos I. Karavelas and loannis Z. Emiris. Root [3] Christoph Burnikel, Jochen Konemann, Kurt comparison techniques applied to computing the Mehlhorn, Stefan Naher, Stefan Schirra, and Chrisadditively weighted voronoi diagram. In Proc. 14th tian Uhrig. Exact geometric computation in ACM-SIAM Symposium on Discrete Algorithms, LEDA. In Proc. llth Annual ACM Symposium on pages 320-329, 2003. Computational Geometry, pages 18-19, 1995. [17] Bernard Mourrain, Michael Vrahatis, , and Jean[4] CGAL. URL http://www.cgal.org. Claude Yakoubsohn. On the complexity of isolating real roots and computing with certainty the [5] David Cheriton. CS249 Course Reader. 2003. URL topological degree. J. of Complexity, 18(2):612http://cs249.stanford.edu. 640, 2002. [6] Coin. URL http://www.coin3d.org. [18] Sylvain Pion. Interval arithmetic: an efficient implementation and an application to computational [7] loannis Z. Emiris and Elias P. Tsigaridas. Comgeometry. In Proc. of the Workshop on Applicaparison of fourth-degree algebraic numbers and aptions of Interval Analysis to Systems and Control plications to geometric predicates. Technical Rewith special emphasis on recent advances in Modal port ECG-TR-302206-03, INRIA Sophia-Antipolis, Interval Analysis MISC'99, pages 99-109, 1999. 2003. [8] loannis Z. Emiris and Elias P. Tsigaridas. Meth- [19] Qt application development framework. URL ods to compare real roots of polynomials of small http://www.trolltech.com/products/qt. degree. Technical Report ECG-TR-242200-01, IN[20] Fabrice Rouillier and Paul Zimmerman. Efficient RIA Sophia-Antipolis, 2003. isolation of a polynomial real roots. Technical [9] Andreas Fabri, Geert-Jan Giezeman, Lutz Kettner, Report RR-4113, INRIA, February 2001. Stefan Schirra, and Sven Schonherr. On the design of CGAL a computational geometry algorithms [21] Jonathan Shewchuk. Adaptive Precision Floatingpoint Arithmetic and Fast Robust Geometric library. Software- Practice and Experience, 30(11): Predicates. Discrete & Computational Geometry, 1167-1202, 2000. 18(3):305-363, October 1997. [10] Steven Fortune and Christopher Van Wyk. Static analysis yields efficient exact integer arithmetic for [22] Standard Template Library. URL http://www. computational geometry. ACM Transactions on sgi.com/tech/stl/.
139
[23] R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(l-2):3-35, 2001. URL http://math-atlas, sourceforge.net/. [24] Chee Yap. A new number core for robust numerical and geometric libraries. In Proc. 3rd CGC Workshop on Geometric Computing, 1998. URL http://www.cs.nyu.edu/exact/core/. [25] Chee Yap. Fundamental problems of algorithmic algebra. Oxford, 2000.
140
template class Swap.event; // The template arguments are the KineticKemel, the Simulator //and the MovingObjectTable. template class Sort_kds: //for ref counted pointers public CGAL::Ref_counted_base<Sort_kds< KK, Sim, MOT> > ( typedef Sort_kdsThis: // The way the Simulator represents time. typedef typename Simr.Time Time; //A label for a moving primitive in the MovingObjectTable typedef typename MOT::Key Object_key; //A label for a certificate so it can be de scheduled. typedef typename Sim::Event_key Event_key; // To shorten the names. Use the default choice for the static kernel. typedef typename CGAL::KDS:: CartesianJnstanlaneousJcemel<MOT> InstantaneousJcemel ; // this is used to identify pairs of objects in the list typedef typename std::list::iterator iterator, typedef Swap_event«^rime,This,iterator,typename Sim::Solver> Event; //Redirects the Simulator notifications to function calls typedef typename CGAL::KDS.: Simulator JcdsJistener Simjistener; //Redirects the MovingObjectTable notifications to function calls typedef typename CGAL::KDS:: Moving_object_table_listener_helper MOTJistener; public typedef CGAL::Ref_counted_pointer Pointer, //Register this KDS with the MovingObjectTable and the Simulator Sort_kds(typename Sim-Pointer sim, typename MOT::Pointer mot, const KK &kk=KK()): simjistenerjsim, this), motjistener_(mot, this), kemeL(kk), kerneLL(mot) { ) /* Insert k and update the affected certificates, std:: upper Jbound returns the first place where an item can be inserted in a sorted list. Called by the MOTJistener.*/ void new_object(Object_key k) { kemel_i_.set_time(simulator()->current_time_nt()); iterator it = std::upper_bound(sorted_.begin(), sorted_.end(), k,kernel_i_.less_x_2_object()); sorted_.insert(it, k); rebuild_certificate("it); rebuild_certificate(--it); /* Rebuild the certificate for the pair of points *it and *(++it). If there is a previous certificate there, deschedule it. */ void rebuild_certificate( const iterator it) { if (it == sorted_.end()) return; if (events_.find(*it) != events_.end()) { simulatorO->delete_event(events_[*it]); events_.erase(*it); } if (next(it)== sorted_.end()) return; typename KK::Less_x_2 less=kemel_.less_x_2_object(); typename Sim::Solver s = simulator()->solver_object(less(object(*(it)), object(*next(it)))); Time ft= s.next_time_negative(); // the Simulator will detect if the failure time is at infinity events_[*it]= simulator()->new_event(Event(ft, it, Pointer(this),s)); /* Swap the pair of objects with *it as the first element. The old solver is used to compute the next root between the two points being swapped. This method is called by an Event object. */ void swapOterator it, typename Sim::Solver &s) { events_.erase(*it);
simulator()-xlelete_event(events_[*next(it)]); events_.erase(*next(it)); std::swap(*it, *next(it)); rebuild_certificate(next(it)); Time ft= s.next_time_negative(); events_[*it]= simulator()->new_event(Event(ft, it, this.s)): rebuild_certificate(~it); /* Verify the structure by checking that the current coordinates are properly sorted for time t. This function is called by the Simjistener. * void validate_at( typename Sim:: NT t) const { kernel_i_.set_time(t); typename Instantaneous_kemel::Less_x_2 less= kemel_i_.less_x_2_o for (typename std::list::const_iterator it = sorted_.begin(); *it != sorted_.back(); assert( !less(*it,*next(it)));
/* Update the certificates adjacent to object k. This method is called by the MOT_listener. std: :equal_range finds all items equal to a key in a sorted list (there can only be one). */ void change_object(Object_key k) { iterator it = std::equal_range(sorted_.begin(), sorted_.end().k).first; rebuild_certificate(it);rebuild_ceitificate(— it); /* Remove object k and destroy 2 certificates and create one new one. This function is called by the MOTJistener. */ void delete_object(0bject_key k) { iterator it = std::equal_range(soited_.begin(), sorted_.end(),k).first; sorted_.erase(it-); rebuild_certificate(it); simulator()-:>delete_event(events_[*it]); events_.erase(*it); } template static It next(It it){ return ++it; } typename MOT::Object object(Object_key k) const { return mot_listener_.notifier()->object(k); } Sim* simulatorO {return sim_listener_.notifier(); } Simjistener sim_listener_; MOTJistener motjistener_; // The points in sorted order std::list sorted_; // eventsJk] is the certificates between k and the object after it std::map events_; KK kerneL; InstantaneousJcemel kernel J_; /* It needs to implement the time() and processf) functions and operator« */ template swap(left_object_, s_); } const Time &time() const {return t_;} Id left_object_; typename Sort-Pointer sorter_; Solver s_; Time t_; >; template std::ostream &operator«(std::ostream &out, const Swap_event &ev)( return out« "swap " « *ev.left_object_ « " at " « ev.t_;
Figure 2: A simple kinetic data structure: it maintains a list of points sorted by their x coordinate. The code is complete and works as printed. Insertions and deletions are in linear time due to the lack of an exposed binary tree class in STL or CGAL. Support for graphical display is skipped due to lack of space.
141
Engineering a Sorted List Data Structure for 32 Bit Keys* Roman Dementiev*
Lutz Kettner t
Abstract Search tree data structures like van Emde Boas (vEB) trees are a theoretically attractive alternative to comparison based search trees because they have better asymptotic performance for small integer keys and large inputs. This paper studies their practicability using 32 bit keys as an example. While direct implementations of vEB-trees cannot compete with good implementations of comparison based data structures, our tuned data structure significantly outperforms comparison based implementations for searching and shows at least comparable performance for insertion and deletion. 1 Introduction Sorted lists with an auxiliary data structure that supports fast searching, insertion, and deletion are one of the most versatile data structures. In current algorithm libraries [11, 2], they are implemented using comparison based data structures such as a6-trees, red-black trees, splay trees, or skip lists (e.g. [11])- These implementations support insertion, deletion, and search in time 0(logn) and range queries in time 0(fc + logn) where n is the number of elements and k is the size of the output. For w bit integer keys, a theoretically attractive alternative are van Emde Boas stratified trees (vEB-trees) that replace the logn by a logw [14, 10]: A vEB tree T for storing subsets M of w = 2k+l bit integers stores the set directly if \M\ = 1. Otherwise it contains a root (hash) table r such that r[i] points to a vEB tree Tt for 2fc bit integers. T» represents the set Mi = {x mod 22>° : x € M A x » 2fc = z}. 1 Furthermore, T stores minM, maxM, and a top data structure t consisting of a 2fc bit vEB tree storing the set M, - {x » 2k : x € M}. This data structure takes space 0(| A/I log w) and can be modified to consume only linear space. It can also be combined with a doubly 'Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). tMPI liiformatik, Stuhlsatzenhausweg 85, 66123 Saarbriicken, Germany, [dementlev,kettner,jmehnert,sanders]0mpi-sb. mpg.de 1 We use the C-like shift operator '»', i.e., x » i = [.x/2'j.
142
Jens Mehnertt
Peter Sanders*
linked sorted list to support fast successor and predecessor queries. However, we are only aware of a single implementation study [15] where the conclusion is that vEB-trees are of mainly theoretical interest. In fact, our experiments show that they are slower than comparison based implementations even for 32 bit keys. In this paper we address the question whether implementations that exploit integer keys can be a practical alternative to comparison based implementations. In Section 2, we develop a highly tuned data structure for large sorted lists with 32 bit keys. The starting point were vEB search trees as described in [10] but we arrive at a nonrecursive data structure: We get a three level search tree. The root is represented by an array of size 216 and the lower levels use hash tables of size up to 256. Due to this small size, hash functions can be implemented by table lookup. Locating entries in these tables is achieved using hierarchies of bit patterns similar to the integer priority queue described in [1]. Experiments described in Section 3 indicate that this data structure is significantly faster in searching elements than comparison based implementations. For insertion and deletion the two alternatives have comparable speed. Section 4 discusses additional issues. More Related Work: There are studies on exploiting integer keys in more restricted data structures. In particular, sorting has been studied extensively (refer to [13, 7] for a recent overview). Other variants are priority queues (e.g. [1]), or data structures supporting fast search in static data [6]. Dictionaries can be implemented very efficiently using hash tables. However, none of these data structures is applicable if we have to maintain a sorted list dynamically. Simple examples are sweep-line algorithms [3] for orthogonal objects,2 best first heuristics (e.g.. [8]), or finding free slots in a list of occupied intervals (e.g. [4]). "General line segments are a nice example where a comparison based data structure is needed (at least for the Bentley-Ottmann algorithm) — the actual coordinates of the search tree entries change as the sweep line progresses but the relative order changes only slowly.
2
The Data Structure
We now describe a data structure Stree that stores an ordered set of elements M with 32-bit integer keys supporting the main operations element insertion, element deletion, and locate(y). Locate returns min(x (E M : y<x). We use the following notation: For an integer x, x[i] represents the i-th bit, i.e., x = ^%LQ2lx[i]. x[i..j], i < j +1, denotes bits i through j in a binary representation of x - a;[0..31], i.e., x[i..j] = El^*-^]. Note that x[i..i — 1] = 0 represents the empty bit string. The function msbPos(z) returns the position of the most significant nonzero bit in z, i.e., msbPos(2) = [Iog2 z\ = max{i : x[i] ^ O}.3 Our Stree stores elements in a doubly linked sorted element list and additionally builds a stratified tree data structure that serves as an index for fast access to the elements of the list. If locate actually returns a pointer to the element list, additional operations like successor, predecessor, or range queries can also be efficiently implemented. The index data structure consists of the following ingredients arranged in three levels, root, Level 2 (L2), and Level 3 (L3): The root-table r contains a plain array with one entry for each possible value of the 16 most significant bits of the keys. r[i] = null if there is no x G M with x[16..31] = i. If \Mi\ = 1, it contains a pointer to the element list item corresponding to the unique element of M.J. Otherwise, r[i] points to an L2-table containing Mi = {x € M : z[16..31] = i}. The two latter cases can be distinguished using a flag stored in the least significant bit of the pointer.4 An L2-table TI stores the elements in Mi. If |Mi| > 2 it uses a hash table storing an entry with key j if 3x € Mi : z[8..15] —j. Let Mij = {x e M : x[8..15] = j,x[16..31] = i}. If | My | = 1 the hash table entry points to the element list and if |My| > 2 it points to an L3-table representing Mij using a similar trick as in the root-table.
Minima and Maxima: For the root and each L2table and L3-table, we store the smallest and largest element of the corresponding subset of M. We store both the key of the element and a pointer to the element list. The root-top data structure t consists of three bitarrays ^[O.^16 - 1], £2[0..4095], and <3[0..63]. We have tl[i] = 1 iff M, ^ 0. i2[j] is the logicalor of tl[32j]..tl[32j + 31], i.e., t2\j] = 1 iff Bt € {32J..32J + 31} : Mi ^ 0. Similarly, t3[k] is the logical-or of £2[32fc]..£2[32fc + 31] so that t3[k] = 1 iff 3i e {1024A:..1024A: 4- 1023} : Mi ^ 0. The L2-top data structures ti consists of two bit arrays £*[0..255] and i.2[0..7] similar to the bit arrays of the root-top data structure. The 256 bit table t\ contains a 1-bit for each nonempty entry of Ti and the eight bits in t 2 contain the logical-or of 32 bits in t\. This data structure is only allocated if |Mi| > 2. The L3-top data structures ty with bit arrays £y[0..255] and t?,-[0..7] reflect the entries of My in a fashion analogous to the L2-top data structure. Hash Tables use open addressing with linear probing [9, Section 6.4]. The table size is always a power of two between 4 and 256. The size is doubled when a table of size k contains more than 3fc/4 entries and k < 256. The table shrinks when it contains less than k/4 entries. Since all keys are between 0 and 255, we can afford to implement the hash function as a full lookup table h that is shared between all tables. This lookup table is initialized to a random permutation h : 0..255 —> 0..255. Hash function values for a table of size 256/2* are obtained by shifting h[x] i bits to the right. Note that for tables of size 256 we obtain a perfect hash function, i.e., there are no collisions between different table entries. Figure 1 gives an example summarizing the data structure.
2.1 Operations: With the data structure in place, the operations are simple in principle although some case distinctions are needed. To give an example, Figure 2 contains high level pseudo code for locate(y) that finds the smallest x € M with y < x. locate(y) first uses the 16 most significant bits of y, say i = y[l6..31] to find a pointer to Mi in the root table. If Mi is empty (/-[i] = null), or if the precomputed 3 msbPos can be implemented in constant time by converting maximum of M,; is smaller than y, locate looks for the number to floating point and then inspecting the exponent. In our implementation, two 16-bit table lookups turn out to be the next nonzero bit i' in the root-top data structure and returns the smallest element of Mi/. Otherwise, the somewhat faster. 4 This is portable without further measures because all modern next element must be in Mi. Now, j = y[8..15] serves systems use addresses that are multiples of four (except for as the key into the hash table TJ stored with Mj and the An L3-table ry stores the elements in My. If I My | > 2, it uses a hash table storing an entry with key k if 3x e My : x[0..7] = k. This entry points to an item in the element list storing the element with z[0..7] = Jfe,x[8..15] = j,x[16..31] = i.
strings).
143
Figure 1: The Stree-data structure for M = (1,11, 111, 1111,111111} (decimal). (* return handle of minx € M : y < x *) Function locate(y : N) : ElementHandle if y > max M then return oo i := y[16.31] if r[i] = null or y > max Mi then return min Mti .iocate(i) if Mi = {x} then return x
j := y[8..15] if n[j] = null or y > max My then return min Mijt\ .locate^) if Mij ~ {x} then return x
return ri;,-[t^.locate(y[0..7])]
/ / n o larger element // index into root table r // single element case // key for L2 hash table at M» // single element case // L3 Hash table access
(* find the smallest j >i such that tk\j] = 1 *) Method locate(i) for a bit array tk consisting of n bit words (* n = 32 for t1, t2, t|, t\- n = 64 for t3; n = 8 for t?, *£. *) (* Assertion: some bit in tk to the right of i is nonzero *) j :— i div n // which n bit word in b contains bit i? a := tk[nj..nj + n - 1] // get this word set a[(i mod n) -+- l..n — 1] to zero // erase the bits to the left of bit i if a = 0 then // nothing here —> look in higher level bit array j := £fc+1.locate(j) // tk+l stores the or of n-bit groups of tk k a := t [nj..nj + n - 1] // get the corresponding word in tk return nj + msbPos(a) Figure 2: Pseudo code for locating the smallest a; € M with y < x.
144
pattern from level one repeats on level two and possibly on level 3. locate in a hierarchy of bit patterns walks up the hierarchy until a "nearby" nonzero bit position is found and then goes down the hierarchy to find the exact position. We now outline the implementation of the remaining operations. A detailed source code is available at http://www.mpi-sb.mpg.de/~kettner/proj/veb/.
list can accommodate several elements. A similar more problem specific approach is to store up to K elements in the L2-tables and L3-tables without allocating hash tables and top data structures. The main drawback of this approach is that it leads to tedious case distinctions in the implementation. An interesting measure is to completely omit the element list and to replace all the L3 hash tables by a single unified hash table. This not only saves space, but also allows a fast direct acfind(x) descends the tree until the list item correspondcess to elements whose keys are known. However range ing to x is found. If x £ M a null pointer is returned. queries get slower and we need hash functions for full No access to the top data structures is needed. 32 bit keys. insert(o;) proceeds similar to locate(x) except that Multi-sets can be stored by associating a singly linked it modifies the data structures it traverses: Minima and list of elements with identical key with each item of the maxima are updated and the appropriate bits in the element list. top data structure are set. At the end, a pointer to the element list item of x's successor is available so that x Other Key Lengths: We can further simplify and can be inserted in front of it. When an Mi or MIJ grows speed up our data structure for smaller key lengths. For to two elements, a new L2/L3-table with two elements 8 and 16 bit keys we would only need the root table and its associated top data structure which would be very is allocated. fast. For 24 bit keys we could at least save the third del(x) performs a downward pass analogous to find(rc) level. We could go from 32 bits to 36-38 bits without and updates the data structure in an upward pass: Minmuch higher costs on a 64 bit machine. The root table ima and maxima are updated. The list item correspondcould distinguish between the 18 most significant bits ing to x is removed. When an L2/L3-table shrinks to and the L2 and L3 tables could also be enlarged at some a single element, the corresponding hash table and top space penalty. However, the step to 64 bit keys could be data structure are deallocated. When an element/L3quite costly. The root-table can no longer be an array; table/L2-table is deallocated, the top-data structure the root top data structure becomes as complex as a 32 above it is updated by erasing the bit corresponding bit data structure; hash functions at level two become to the deallocated entry; when this leaves a zero 32 bit more expensive. word, a bit in the next higher level of bits is erased etc. Floating Point Keys can be implemented very easily 2.2 Variants: The data structure allows several in- by exploiting that IEEE floats keep their relative order teresting variants: when interpreted as integers. Saving Space: Our Stree data structure can consume considerably more space than comparison based search trees. This is particularly severe if many trees with small average number of elements are needed. For such applications, the 256 KByte for the root array r could be replaced by a hash table with a significant but "nonfatal" impact on speed. The worst case for all input sizes is if there are pairs of elements that only differ in the 8 least significant bits and differ from all other elements in the 16 most significant bits. In this case, hash tables and top data structures at levels two and three are allocated for each such pair of elements. The standard trick to remedy this problem is to store most elements only in the element list. The locate operation then first accesses the index data structure and then scans the element list until the right element is found. The drawback of this is that scanning a linked list can cause many cache faults. But perhaps one could develop a data structure where each item of the element
3 Experiments We now compare several implementations of search tree like data structures. As comparison based data structures we use the STL map which is based on red-black trees and ab.tree from LEDA which is based on (a, &)trees with a = 2, b = 16 which fared best in a previous comparison of search tree data structures in LEDA [12].5 We present three implementations of integer data structures. orig-Stree is a direct C++ implementation of the algorithm described in [10], LEDA-Stree is an implementation of the same algorithm available in LEDA [15], and Stree is our tuned implementation. orig-Stree and LEDA-Stree store sets of integers rather than sorted lists but this should only make them faster than the other implementations. 5 To use (2,16)-trees in LEDA you can declare a jsortseq with implementation parameter ab.tree. The default implementation for sortseq based on skip lists is much slower in our experiments.
145
Figure 3: Locate operations for random keys that are drawn independently from M.
Figure 4: Constructing a tree using n insertions of random elements.
146
Figure 5: Deleting n random elements in the order in which they were inserted.
Figure 6: Locate operations for hard inputs.
147
The implementations run under Linux on a 2GHz Intel Xeon processor with 512 KByte of L2-cache using an Intel E7500 Chip set. The machine has iGByte of RAM and no swap space to exclude swapping effects. We use the g++ 2.95.4 compiler with optimization level -06. We report the average execution time per operation in nanoseconds on an otherwise unloaded machine. The average is taken over at least 100 000 executions of the operation. Elements are 32 bit unsigned integers plus a 32 bit integer as associated information. Figure 3 shows the time for the locate operation for random 32 bit integers and independently drawn random 32 bit queries for locate. Already the comparison based data structures show some interesting effects. For small n, when the data structures fit in cache, red-black trees outperform (2,16)-trees indicating that red-black trees execute less instructions. For larger n this picture changes dramatically, presumably because (2,16)-trees are more cache efficient. Our Stree is fastest over the entire range of inputs. For small n, it is much faster than comparison based structures up to a factor of 4.1. For random inputs of this size, locate mostly accesses the root-top data structure which fits in cache and hence is very fast. It even gets faster with increasing n because then locate rarely has to go to the second or even third level t2 and t3 of the root-top data structure. For medium size inputs there is a range of steep increase of execution time because the L2 and L3 data structures get used more heavily and the memory consumption quickly exceeds the cache size. But the speedup over (2,16)trees is always at least 1.5. For large n the advantage over comparison based data structures is growing again reaching a factor of 2.9 for the largest inputs. The previous implementations of integer data structures reverse this picture. They are always slower than (2,16)-trees and very much so for small n.6 We tried the codes until we ran out of memory to give some indication of the memory consumption. Previous implementations only reach 218 elements. At least for random inputs, our data structure is not more space consuming than (2.16)-trees.7 Figures 4-5 show the running times for insertions and deletions of random elements. Stree outperforms (2,16)-trees in most cases but the differences are never very big. The previovis implementations of integer b For the LEDA implementation one obvious practical improvement is to replace dynamic perfect hashing by a simpler hash table data structure. We tried that using hashing with chaining. This brings some improvement but remains slower than (2,16)-trees. 7 For hard inputs, Stree and (2,16)-trees are at a significant disadvantage compared to red-black trees.
148
data structures and, for large n, red-black trees are significantly slower than Stree and (2,16)-trees. The dominating factor here is memory management overhead. In fact, our first versions of Stree had big problems with memory management for large n. We tried the default new and delete, the g++ STL allocator, and the LEDA memory manager. We got the best performance with with a reconfigured LEDA memory manager that only calls malloc for chunks of size above 1024 byte and that is also used for allocating the hash table arrays8. The g+4- STL allocator also performed quite well. We have not measured the time for a plain lookup because all the data structures could implement this more efficiently by storing an additional hash table. Figures 6 shows the result for an attempt to obtain close to worst case inputs for Stree. For a given set size \M\ — n, we store Mhard = {28iA, 28iA + 255 : i = 0..n/2 - 1} where A = [225/nJ. Mhard maximizes space consumption "of our implementation. Furthermore, locate queries of the form 28jf A + 128 for random j e 0..n/2 - 1 force Stree to go through the root table, the L2-table, both levels of the L3-top data structure, and the L3-table. As to be expected, the comparison based implementations are not affected by this change of input. For n < 218, Stree is now slower than its comparison based competitors. However, for large n we still have a similar speedup as for random inputs.
4
Discussion
We have demonstrated that search tree data structures exploiting numeric keys can outperform comparison based data structures. A number of possible questions remain. For example, we have not put particular emphasis on space efficient implementation. Some optimizations should be possible at the cost of code complexity but with no negative influence on speed. An interesting test would be to embed the data structure into other algorithms and explore how much speedup can be obtained. However, although search trees are a performance bottleneck in several important applications that have also been intensively studied experimentally (e.g. the best first heuristics for bin packing [5]), we are not aware of real inputs used in any of these studies.9 8 By default chunks of size bigger than 256 bytes and all arrays are allocated with malloc. 9 Many inputs are available for dictionary data structure from the 1996 DIMACS implementation challenge. However, they all affect only find operations rather than locate operations. Without the need to locate, a hash table would always be fastest.
Acknowledgments: We would like to thank Kurt Mehlhorn and Stefan Naher for valuable suggestions.
References [1] A. Andersson and M. Thorup. A pragmatic implementation of monotone priority queues. In DIMACS'96 implementation challenge, 1996. [2] M. H. Austern. Generic programming and the STL : using and extending the C++ standard template library. Addison-Wesley, 7 edition, 2001. [3] J. L. Bentley and T. A. Ottmann. Algorithms for reporting and counting geometric intersections. IEEE Transactions on Computers, pages 643-647, 1979. [4] P. Herman and B. DasGupta. Multi-phase algorithms for throughput maximization for real-time scheduling. Journal of Combinatorial Optimization, 4(3):307-323, 2000. [5] E. G. Coffman, M. R. Garey Jr., , and D. S. Johnson. Approximation algorithms for bin packing: A survey. In D. Hochbaum, editor, Approximation Algorithms for NP-Hard Problems, pages 46-93. PWS, 1997. [6] P. Crescenzi, L. Dardini, and R. Grossi. IP address lookup made fast and simple. In Euopean Symposium on Algorithms, pages 65-76, 1999. [7] D. J. Gonzalez, J. Larriba-Pey, and J. J. Navarro and. Algorithms for Memory Hierarchies, volume 2625 of LNCS, chapter Case Study: Memory Conscious Parallel Sorting, pages 171-192. Springer, 2003. [8] D. S. Johnson. Fast algorithms for bin packing. Journal of Computer and System Sciences, 8:272-314, 1974. [9] D. E. Knuth. The Art of Computer Programming — Sorting and Searching, volume 3. Addison Wesley, 2nd edition, 1998. [10] K. Mehlhorn and S. Naher. Bounded ordered dictionaries in O(log log N) time and O(n) space. Information Processing Letters, 35(4): 183-189, 1990. [11] K. Mehlhorn and S. Naher. The LEDA Platform of Combinatorial and Geometric Computing. Cambridge University Press, 1999. [12] S. Naher. Comparison of search-tree data structures in LEDA. personal communication. [13] N. Rahman. Algorithms for Memory Hierarchies, volume 2625 of LNCS, chapter Algorithms for Hardware Caches and TLB, pages 171-192. Springer, 2003. [14] P. van Emde Boas. Preserving order in a forest in less than logarithmic time. Information Processing Letters, 6(3):80-82, 1977. [15] M. Wenzel. Worterbiicher fur ein beschranktes Universum (dictionaries for a bounded universe). Master's thesis, Saarland University, Germany, 1992.
149
This page intentionally left blank
Workshop on Analytic Algorithmics and Combinatorics
Invited Plenary Speaker Abstract Theory and Practice of Probabilistic Counting Algorithms Philippe Flajolet, INRIA, France Many computer applications require finding quantitative characteristics of very large data sets. The situation occurs, for instance, in data mining of text-based data as well as in the detection of anomalies in router activities. Randomization can lead to surprisingly efficient algorithms: for instance, using a single pass over the data and an auxiliary storage of two kilobytes, one can determine the number of distinct elements in a massive data set within an accuracy typically better than 2\%. We shall review some of these algorithms, including "Probabilistic Counting" of Flajolet-Martin and the more recent Loglog Counting algorithm of Durand-Flajolet. The very design of these algorithms is based on a thorough analysis of associated probability distributions by methods of analytic combinatorics, which then provides the right dimensionings, the exact constants making the algorithms unbiased, and the correct probabilistic estimates of a risk of anomalous behavior.
152
Analysis of a Randomized Selection Algorithm Motivated by the LZ'77 Scheme Mark Daniel Ward Department of Mathematics Purdue University West Lafayette, IN 47907-2067 mwardQmath. purdue. edu Abstract We consider a randomized selection algorithm that has n initial participants and a moderator. In each round of the process, each participant and the moderator throw a biased coin. Only the participants who throw the same result as the moderator stay in the game for subsequent rounds. With probability 1, all participants are eliminated in finitely many rounds. We let Mn denote the number of participants remaining in the game in the last nontrivial round. This simple algorithm has surprisingly many interesting applications. In particular, it models (asymptotically) the number of longest prefixes in the Lempel-Ziv '77 data compression scheme. Such multiplicity was used recently in [13] to design an error-resilient LZ'77 scheme. We give precise asymptotic characteristics of the jth factorial moment of Mn for all j € N. Also, we present a detailed asymptotic description of the exponential generating function for Mn. In particular, we exhibit periodic fluctuation in the distribution of Mn, and we prove that no limiting distribution exists (however, we observe that the asymptotic distribution follows the logarithmic series distribution plus some fluctuations). The results we develop are proved by probabilistic and analytical techniques of the analysis of algorithms. In particular, we utilize recurrence relations, analytical poissonization and depoissonization, the Mellin transform, and complex analysis.
1 Introduction. We consider a randomized selection algorithm that has n initial participants and a moderator. At the outset, n participants and one moderator are present. Each has a biased coin with probability p of showing heads when flipped, and we write q = 1 — p. At each stage of the selection process, the moderator flips its coin once; participants remain for subsequent rounds if and only if their result agrees with the moderator's result. Note that all participants are eliminated in finitely many rounds with probability 1. We let Mn denote the number of participants remaining in the last nontrivial round (i.e., the final round in which some participants still remain). Equivalent descriptions of the algorithm are given in the next section. Briefly we explain the algorithm in terms of tries. Consider a trie built from strings of O's and 1's drawn "The research of this author was supported by NSF Grant CCR-0208709, and by NIH Grant R01 GM068959-01.
Wojciech Szpankowski* Department of Computer Science Purdue University West Lafayette, IN 47907-2066
spaOcs.purdue.edu from an i.i.d. source. We restrict attention to the situation where n such strings have already been inserted into a trie. When the (n + l)-st string is inserted into the trie, Mn denotes the size of the subtree that starts at the insertion point of this new string. The results of our analysis of Mn yield information about the redundancy in the LZ'77 algorithm [18]. In LZ'77, for a given training sequence .Xi,...,X n , the next phrase is the longest prefix of the uncompressed sequence X n+ i,X n+ 2,... that occurs at least once in the training sequence Xi,..., Xn. Such a phrase can be found by building a suffix tree from the training sequence and inserting the (n + l)-st suffix into the tree. The depth of insertion is the length of the next LZ'77 phrase and the size of the subtree starting at the insertion point represents the number of potential phrases (i.e., any phrase can be chosen for encoding). The latter quantity is asymptotically equivalent to our Mn (constructed for independent tries) with an error bound of O(logn/ra) (cf. [9, 16]). Finally, we observe that multiplicity of LZ'77 phrases is used in Lonardi and Szpankowski [13] to design an error resilient LZ'77 scheme called LZRS'77 (the "RS" denotes ReedSolomon error-correcting coding). Thus precise analysis of Mn allows us to obtain detailed information about the redundancy of LZ'77 and its error resilient version LZRS'77. Related problems have been studied. For instance, suppose the moderator is replaced; instead, participants remain in the selection process if and only if they throw heads. Also, if Mn ^ 1, then the selection process is deemed inconclusive and the entire selection process is repeated. Finally, Hn denotes the number of rounds until the selection process determines a conclusive "leader." Prodinger [14] first posed this problem and made a non-trivial analysis, but he considered fair (unbiased) coins. Then Fill et. al. [3] found the limiting distribution of the number of rounds, but they also utilized fair coins. Recently, Janson and Szpankowski [11] gave precise asymptotic information about E[Hn],
153
Var[.ffn], and the distribution of #„; we note that the analysis in [11] dealt with biased coins. A wealth of results have been published that are pertinent to the methodology developed below (see especially [4] and [17]). We strongly emphasize that these methods are widely applicable to a great variety of other problems. The precise asymptotic descriptions of the distribution of Mn and the factorial moments of Mn should entice others to continue utilizing such methods in studying related problems. We establish the asymptotic distribution of Mn and the factorial moments of Mn. Note that a first order asymptotic solution for the distribution and the factorial moments in a cone about the positive x axis is not too difficult to obtain, but a second order asymptotic solution is relatively much more difficult to derive. Our method is to first poissonize the problem. In other words, we no longer require n to be fixed, but instead we let the number of initial participants in the selection process be a random variable that is Poisson distributed with mean n. Then we utilize the Mellin transform and analytic methods to obtain asymptotic solutions in the Poisson model. Finally, we depoissonize the results to obtain the asymptotic distribution and factorial moments of Mn, both accurate to second order. Interestingly, when Inp/ In q is rational, we note that the asymptotic distribution and factorial moments of Mn exhibit fluctuations. Therefore Mn does not have a limiting distribution or limiting factorial moments, but we provide precise formulas for both quantities. In particular, we prove that the asymptotic distribution of Mn follows the logarithmic series distribution (plus some fluctuations), that is, P(Mn = j) * (lM)(p*(l -p) + (l-p)jp)/j where h is the entropy rate. Our results are organized in the following way: In the next section, two theorems are given. Theorem 1 provides a precise asymptotic description of the distribution of Mn. Then Theorem 2 gives analogous results for the factorial moments of Mn. Both theorems contain results which are second order accurate. Then we briefly discuss the consequences of the two theorems. In particular, we elaborate on the fluctuations mentioned above. In the third section, we prove both theorems. As we just mentioned briefly, our methodology uses poissonization. So we utilize a depoissonization lemma of Jacquet and Szpankowski (see [10] and [17]). 2 Main Results. We first give a mathematically rigorous formulation of the problem to be analyzed. Let p be fixed with 0 < p < 1, and write q = 1 — p. Define X(j) to be the string X^X^X^..., where {X^ \ i,j € N} is a collection of i.i.d. random variables on {0,1}, with
154
. In other words, when comparing the jih and (n 4- l)st strings, let lj denote the length of the longest common prefix of these two strings. Then define Ln = maxj< n ^ n . In other words, among the first n strings, let Ln denote the length of the longest common prefix with the (n-f-l)-st string. Finally, define Mn = #{j | 1 < j < n, 4n) = Ln}. So Mn is the number of the first n strings that have a common prefix of length Ln with the (n + l)-st string. By convention, let MQ = 0. In passing we observe that if the strings X(l),..., X(n) are suffixes of a single string, then our Mn is asymptotically equivalent to the multiplicity of phrases in the LZ'77 scheme. Now we present the problem from the viewpoint of tries. The alignment Cjlt...,jh among k strings X(ji),...,X(jk) is the length of the longest common prefix of the k strings. We observe that l^ = Cjtn+iThe fcth depth Dn+i(k) in a trie built over n +1 strings is the length of the path from the root of the trie to the leaf containing the fcth string. Note Dn+i(n 4-1) = max Cj.n+i 4- 1. Therefore Ln — Dn+i(n + 1) — 1. l<jn+i 4-1 = £>n+i(n 4-1)}. That is, Mn is the size of a subtree starting at the branching point of a new insertion. Define the exponential generating functions
for complex u € C and j e N. If / : C -> C, then the recurrence relation
holds for all n € N. If /(O) = 0, then the recurrence also holds when n = 0. To verify (2.1), just consider the possible values of X^' for 1 < j < n + 1. Two useful facts follow immediately from this recurrence relation. First, if n 6 N, then
Also, if j 6 N and n > 0 then
i
We derive an asymptotic solution for these recurrence relations using poissonization, the Mellin transform, . , .1 . 1 and depoissomzation; details are given in the next section. These methods yield the following two theorems. THEOREM 2.1. Let zk = ^f VJb e Z, where * = J /or some relatively prime r, s € Z (recall that we are interested in the situation where jj is rational). Then
THEOREM 2.2. Let for some relatively prime r, s
where
\
and F is the Euler gamma function. It follows immediately that
where Sj(t) =
and F is the Euler gamma function.
and
Note that the term O(n~l). Note that <Jj is a periodic function that has small magnitude and exhibits fluctuation. For instance, when p = 1/2 then The approximate values of given below for the first ten values of j.
J 1 2 3 4 5 6 7 8 9 10
1.4260 xlO~ 5 1.3005 x!0~ 4 1.2072 xlO~ 3 1.1527 xlO- 2 1.1421 x 10-1 1.1823 x!0° 1.2853 xlO 1 1.4721 xlO 2 1.7798 xlO 3 2.2737 xlO 4
(2-6)
Note that S is a periodic function that has small magnitude and exhibits fluctuation. For instance, when 3.1463 x 10~6. The non-fluctuating part of the distribution of P(Mn = j) follows the logarithmic series distribution, as already mentioned above. If In p/ Inq is irrational and u is fixed, then we observe 5(x,u) —> 0 as x —> oo. Thus S does not exhibit fluctuation when In p/ In q is irrational. Remark: We emphasize that the same methodology can be used to obtain even more terms in the asymptotic formulae given in the two theorems.
3 Analysis and Proofs. We note that, if In p/ In q is irrational, then 8j(x) -* 0 Now we present our analytical approach for proving as x -f oo. So 6j does not exhibit fluctuation when Theorems 1 and 2. Our first strategy is to poissonize the In p/ Ing is irrational. problem. Then we utilize the Mellin transform and comThe next result describes the asymptotic distribu- plex analysis; thus we obtain asymptotic descriptions of tion of Mn. the distribution and factorial moments of Mn, but we
155
emphasize that these results axe valid for the poissonized 3.2 Mellin Transform. If / is a complex- valued model of the problem. We must depoissonize our results function which is continuous on (0, co) and is locally in order to find the asymptotic distribution and factorial integrable, then the Mellin transform of / is defined as moments of Mn in the original model. 3.1 Poissonization. We first utilize analytical poissonization. The idea is to replace thefixed-sizepop- (see [5] and page 400 of [17]). Three basic properties of ulation model (i.e., the model in which the number of the Mellin transform are useful in proving the next two initial participants n in the selection process is fixed) resuhs \ye observe that by a poissonized model in which the number of initial participants is a Poisson random variable with mean n. This is affectionately referred to as "poissonizing" the problem. So we let the number of initial participants in the selection process be N, a random variable that has Poisson distribution and mean n (i.e., P(N = j ) = e~nni/j\ Vj > 0). We apply the Poisson transform to the exponential generating functions If M > 0 we also notice G(z,u) and Wj(z), which yields:
By using (2.2) to expand the coefficients of zn in G(z, «) for n > 1, we observe C(z, «) = We first find the fundamental strip of we observe that
Similarly, we apply (2.3) to the coefficients of zn in
e observe that G(z,u) = G(z,u)e~z. If weWE multiply by e~z throughout (3.7) and then simplify, we
obtain
Similarly, from (3.8) we know that if j € N then
Note that the functional equations (3.9) and (3.10) for the poissonized versions of G(z,u) and Wj(z) are simpler than the corresponding equations (3.7) and (3.8) from the original (Bernoulli) model. We solve (3.9) and (3.10) asymptotically for large z € M. 156
We notice tha G x
l ( >u) ^ 1 as a: ^ 0, but we want to ^ G(x,u)j= O(x] as x -» 0. So we replace G(x,u) by writing G(ar,ti) = G(x,u) - I . We expect G(x,u) = O(l) = O(x°) as x -^ oo. Therefore the fundamental strip of G(x, u) includes (—1,0). stead have
We next determine the fundamental strip of Wj(x). 3.3 Results for the Poisson Model. We are reBy (3.10), we know stricting attention to the case where In p/ Ing is rational. Thus we can write In p/ In q = r/t for some relatively prime r, t e Z. Then, by a theorem of Jacquet and Schachinger (see page 356 of [17]), we know that the set of poles of W;(a)x— is exactly We also observe that W'.7*(s)a:~5 has simple poles at each Zk- Now we assume that u ^ 1. Then (7*(s,w)ar~* has the same set of poles as Wj(s)x~8, each of which is a simple pole. Let TI denote theline segmentfrom—|—iA to —|+ iAin the complex plane, where A is alarge real number. Let T^ denote the line segmentfrom—|+ iA to M+iA. Let TS denote the line segment from M + iA to M — iA. Let T4 denote the line segment from M — iA to - 1 -iA. Now we claim that, if j e N a then We expect So -j',0) is the fundamental strip of Wj(x). If u € M with u < min{l/p,l/g} and if Re(s) € ( -1,0) then it follows from (3.9) and the properties of Using the Cauchy residue theorem [1], integrating clockwise around the curve described by Ti,l2,T3,T4, we the Mellin transform given above that <7*(s,w) = have If j € N and Re(s) 6 (-j',0), then by (3.10) and the properties of the Mellin transform we mentioned, we see that
We note that the Mellin transform is a special case of the Fourier transform. So there is an inverse Mellin in transform. Since Wj is continuous on (0, oo), then where the sum is taken over all poles a\ of the region bounded by Ti,T2,r3,74. By the smallness property of the Mellin transform (see page 402 of [17]), we observe that if c € (—a, — /?), where (—a, strip of Wj. Thus
is the fundamental We also observe (see page 408 of [17]) that
since c = —1/2 is in the fundamental strip of Vj € N. Similarly
since c = —1/2 is in the fundamental strip o
157
Combining these results proves the claim made in (3.11). The same reasoning shows that Now we compute Res that
We first observe
We make the observation that itfouowsthat Using this observation, we claim that if j € N then
(3.13) Combining these results, the claim given in (3.14) now where h = —plnp — q Inq denotes entropy and where follows from (3.12). As an immediate corollary of (3.14), we see that
To prove the claim, wefirstobserve that, if k € Z, then
We note that, if In p/In q is irrational and u is fixed, then
and
Thus
and
do not exhibit fluctuation when In P/In q is irrational.
3,4 Depoissonization. Recall that, in the original problem statement, n is a large, fixed integer. Most Of our analysis has utilized a model where n is a Poisson random variable. Therefore, to obtain results about Now the claim made m (3.13) follows immediately fro the problem we orginally stated, it is necessary to depoissonize our results. We utilize the depoissonization techniques discussed in [10] and Chapter 10 of [17], espedaily the Depoissonization Lemma, to prove Theorems G land 2. For the reader convenience we recall here some depoissonization results of [10]. Recall that a measurwhere h = -plnp - qlnq denotes entropy and where aDle function t/;:(0,oo) -+ (0,oo) is slowly varying if ilj(tx)/^(x) -¥ 1 as x -+ oo for every fixed t > 0. THEOREM 3.1. Assume that a Poisson transform of a sequence gn which is an entire function of a complex variable z. Suppose that there exist rea/ constante a < 1, /3, 9 <E (0,7r/2), d, c2, The proof is similar to the proof of (3.13). If k ^ 0 then and ZQ, and a slowly varying function $ such that the following conditions hold, where Se is the cone So =
158
(0) For allziSo with \z\ > z0,
of [17] (see page 456) is also satisfied. So by Theorem 3.1 it follows that Theorems 1 and 2 hold, as claimed. To see that (2.5) follows from (2.4), consider the following. From (2.4), we have E[wMn] =
Then for n > 1,
More precisely, Observe The "Big-Oh" terms in (3.17) and (3.18) are uniform for any family of entire functions G that satisfy the conditions with the same a, /?, 9, c\, 02, ZQ and tp. Also note that Now, we are in a position to depoissonize our results. By (3.13), it follows that
since \Sj\ is uniformly bounded on C By (3.14), we see that
Then we apply these observations to (3.19) to conclude that (2.5) holds. Finally, we note that (2.6) is an immediate corollary of (2.5).
References
when u is fixed since \6\ is uniformly bounded on C. We define rf)(z) = 1 Vz and note ^ Is a slowing varying function (i.e., iff : (0,oo) -> (0,oo) and t/)(tx)/ij)(x) -t 1 as x -> co for every fixed t > 0). Also there exist real- valued constants CM» Cj,Af» ZM, ZJ,M such that
So condition (I) of Theorem 3.1 is satisfied. It follows immediately by Theorem 10.4 of [17] that condition (O)
[1] Lars V. Ahlfors, Complex Analysis, 3rd ed. (New York: McGraw-Hill, 1979). [2] William Feller, An Introduction to Probability Theory and Its Applications, Volume I, 3rd ed., Volume II, 2nd ed. (New York: Wiley, 1968, 1971). [3] James Allen Fill, Hosam M. Mahmoud, and Wojciech Szpankowski, "On the distribution for the duration of a randomized leader election algorithm," Tie Annals of Applied Probability 6 (1996), 1260-1283. [4] Philippe Flajolet and Robert Sedgewick, Analytic Combinatorics (forthcoming). [A preview is available at .] [5] Philippe Flajolet, Xavier Gourdon, and Philippe Dumas, "Mellin transforms and asymptotics: Harmonic sums," Theoretical Computer Science 144 (1995), 358. [6] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik, Concrete Mathematics, 2nd ed. (Reading, Massachusetts: Addison-Wesley, 1994). [7] Daniel H. Greene and Donald E. Knuth, Mathematics for the Analysis of Algorithms, 3rd ed. (Boston: Birkhauser, 1990).
159
[8] Peter Henrici, Applied and Computational Complex Analysis, Volumes 1-3. (New York: Wiley, 1974, 1977, 1986). [9] Philippe Jacquet and Wojciech Szpankowski, "Autocorrelation on words and its applications. Analysis of suffix trees by string-ruler approach," Journal of Combinatorial Theory A66 (1994), 237-269. [10] Philippe Jacquet and Wojciech Szpankowski, "Analytical depoissonization and its applications," Theoretical Computer Science 201 (1998), 1-62. [11] Carl Svante Janson and Wojciech Szpankowski, "Analysis of an asymmetric leader election algorithm," The Electronic Journal of Combinatorics 4(1), R17 (1997), 1-16. [12] Donald E. Knuth, Fundamental Algorithms, 3rd ed. (Reading, Massachusetts: Addison Wesley, 1997). [13] Stefano Lonardi and Wojciech Szpankowski, "Joint source-channel LZ'77 coding," Data Compression Conference (2003), 273-282. [14] Helmut Prodinger, "How to select a loser," Discrete Mathematics 120 (1993), 149-159. [15] Robert Sedgewick and Philippe Flajolet, An Introduction to tie Analysis of Algorithms (Reading, Massachusetts: Addison-Wesley, 1996). [16] Wojciech Szpankowski, "A generalized suffix tree and its (un)expected asymptotic behaviors," SIAM Journal on Computing 22 (1993), 1176-1198. [17] Wojciech Szpankowski, Average Case Analysis of Algorithms on Sequences (New York: Wiley, 2001). [18] Jacob Ziv and Abraham Lempel, "A universal algorithm for sequential data compression," IEEE Transactions on Information Theory IT-23 (1977), 337-343.
160
The Complexity of Jensen's Algorithm for Counting Polyominoes^ Gill Barequet*
Micha Moffie
Abstract The following is a brief overview of the developRecently I. Jensen published a novel transfer-matrix algo- ments of counting fixed polyominoes: rithm for computing the number of polyominoes in a rectangular lattice. However, his estimation of the computational • In 1962 Read [Re62] derived generating functions complexity of the algorithm (O((V2)n), where n is the size for calculating the number of fixed polyominoes. of the polyominoes), was based only on empirical evidence. These functions become intractable very fast and This paper is based primarily on an analysis of the number of some class of strings that plays a significant role in the were used for computing A(n) for only n — algorithm. It turns out that this number is closely related 1,..., 10 (with an error in A(10)). to Motzkin's numbers. We provide a rigorous computation that roughly confirms Jensen's estimation. We obtain the bound O(n5/2(\/3)n) on the running time of the algorithm, • In 1967 Parkin et al. [PLP67] computed the number while the actual number of polyominoes is about C4.06n /n, of polyominoes of up to size 15 (with a slight error for some constant C > 0. in 4(15)) on a CDC 6600 computer. Keywords: Lattice animals, polyominoes, computational complexity. 1 Introduction A polyomino of size n is an edge-connected set of n squares on a regular square lattice. Fixed polyominoes are considered distinct if they have different shapes or orientations. The symbol A(n) in the literature usually denotes the number of fixed polyominoes of size n. Figure l(a) shows the only two fixed dominoes (adjacent pairs of squares). Similarly, Figures l(b) and l(c) show the six (resp., 19) fixed trominoes (resp., tetrominoes)—polyominoes of size 3 and 4, respectively. Thus, A(2) = 2, A(3) = 6, A(4) = 19, and so on. Polyominoes have triggered the . imagination of many scientists, not only mathematicians. The issue of the number of fixed polyominoes arises, for instance, when investigating the properties of liquid flow through grained material [BH57], such as water flowing through coffee grains. Statistical physicists refer to polyominoes as lattice animals, whose number is relevant when computing the mean cluster density in percolation processes. To this day there is no known analytic formula for A(n). The only known methods for computing A(n) are based on explicitly or implicitly enumerating all the polyominoes. "Work on this paper by the first author has been supported in part by the Fund for the Promotion of Research at the Technion. t Center for Graphics and Geometric Computing, Dept. of Computer Science, The Technion—Israel Institute of Technology, Haifa 32000, Israel. E-mail: [barequet | smmicha]flcs.technion .ac.il
• In 1971 Lunnon [Lu71] computed the values of A(n) up to n = 18 (with a slight error in A(17)). His program generated polyominoes that could fit into a restricted rectangle. Since his algorithm generated the same polyominoes more than once, the program spent a considerable amount of time checking polyominoes for repetitions. It was thus an O(n42(n))-time algorithm, and the program ran for about 175 hours (a little more than a week) on a Chilton Atlas I. • An algorithm of Martin [Ma74] and Redner [Re82] that computes polyominoes of a given size and perimeter was used in 1976 by Sykes and Glen [SG76] to enumerate A(n) up to n = 19. • In 1981 Redelmeier [Re81] introduced a novel enumeration algorithm. His new method, which was actually a procedure for subgraph counting, did not reproduce any of the previously-generated polyominoes. Thus, he did not need to keep in memory all the already-generated polyominoes, nor did he need to check if a newly-generated polyomino was already counted. He implemented his method efficiently, and computed A(ri) up to n = 24. It was the first A(n)-time algorithm, and Redelmeier's program required about 10 months of CPU time on a PDP-11/70. Mertens and Lautenbacher [ML92] later devised a parallel version of Redelmeier's algorithm and used it for computing the number of polyominoes (of up to some size) on a triangular lattice.
161
Figure 1: Fixed dominoes, trominoes, and tetrominoes • In 1995 Conway [Co95] introduced a transferIn this paper we analyze the running time of matrix algorithm, subsequently used by Conway Jensen's Algorithm [JeOl]. It has two factors: a polyand Guttmann [CG95] for computing .4(25). nomial in n, and an exponential term J(W) (where W — n/2). Jensen did not provide an analysis of the • Oliveira e Silva [O101] used (in an unpublished running time of his algorithm. Instead, he showed a work) a parallel version of Redelmeier's algorithm graph [ibid., §2.1.3] that plots J(W) (the number of sigto count free polyominoes (for which the orienta- natures of length W, a key ingredient of the algorithm— see Section 3) as a function of W, for 1 < W < 23. tion does not matter) of up to size 28. The graph was drawn in a half-logarithmic scale, and • Jensen [JeOl] significantly improved the algorithm the plotted points visually seem to be more-or-less loof Conway and Guttmann, and computed A(n) up cated around some line. From the slope of this imagito n = 46. Following quickly, Knuth [KnOl] applied nary line, Jensen estimated that J(W) was proportional (in an unpublished work) a few local optimizations to 2W. Therefore, Jensen concluded that the running to Jensen's algorithm and was able to compute time of his algorithm was proportional to (\/2)n times .4(47). In a further computation Jensen [JeOla] some polynomial factor. Jensen's algorithm prunes the claimed to also obtain .4(48). In an unpublished signatures, discarding intermediate configurations that work Jensen also claimed to parallelize the algo- cannot be completed into valid polyominoes. In this rithm and compute A(n) for n < 56. paper we analyze S(W), the number of unpruned signatures. We show that it equals M(W + 1) — 1, where It is known that A(ri) is exponential in n. M(W) is the Wth Motzkin's number [Mo48]. Since the Klarner [K167] showed that A(n) ~ CXnn9 (for some Wth Motzkin's number is proportional to 3^ (neglectconstants C > 0 and 0 « —1), so that the limit ing some polynomial factor), we obtain an upper bound A = limn^oo^n + l)/A(n)) exists. Golomb [Go65] on the running time of Jensen's algorithm, which is progave A its well-known name Klarner's constant, out of portional to (\/3)n times some polynomial factor. which not even a single significant digit is known for This paper is organized as follows. In Section 2 sure. There have been several attempts to lower and we give a brief description of Jensen's transfer-matrix upper bound A, as well as to estimate it, based on know- algorithm for counting fixed polyominoes. In Section 3 ing A(n) up to certain values of n. The constant A is we analyze the number of some class of strings that estimated to be around 4.06 [CG95].
162
plays a significant role hi the algorithm. In Section 4 to-be-added cells on the right may connect disconnected we provide a detailed analysis of the running time of the components and yield a legal polyomino. Only at the termination of the iteration does the algorithm need to algorithm. check whether a signature corresponds to legal poly2 Jensen's Transfer-Matrix Algorithm ominoes. Otherwise the corresponding number of polyIn this section we briefly describe Jensen's algorithm ominoes (of that signature) is ignored and not counted. for counting fixed polyominoes. The reader is referred Also note in the figure that the lowest '2' and the highest '4' in the signature encode boundary cells which are to the original paper [JeOl] for the full details. The algorithm separately counts all the polyomi- connected by cells to the left of the boundary. There are several implementation details that are noes of size n bounded by rectangles whose dimensions omitted here. One such detail is the initialization of are W (its width, y-span) and L (its length, re-span). an iteration, that is, building the set of signatures that In each iteration of the algorithm different values of W correspond to (yet incomplete) polyominoes spanning and L are considered, for all possible values of W and only one column. Another detail is the exhaustive set L. (Due to symmetry, only values of W < L need to be of polyomino-expansion rules. In an expansion step considered.) the boundary kink is lowered by one by considering For specific values of W and L, the strategy is as the kink cell and either making it occupied or leaving follows. The polyominoes are built from left to right, it empty. The symbols encoding the cells adjacent and in each column from top to bottom. Instead of to the kink, together with the fact of whether or not keeping track of all polyominoes, the procedure keeps the new cell is occupied, determine how to update the records of the numbers of polyominoes with identical signature. Most combinations require local updates, right boundaries. Towards this aim the right boundaries whereas a few combinations require scanning the entire of the (yet incomplete) polyominoes are encoded by signature for updating. In addition, the signature needs signatures as will be described shortly. A polyomino to be expanded by two bits that indicate whether or not is expanded in the current column, cell by cell, from (yet incomplete) polyominoes touch the top and bottom top to bottom. The new cell is either occupied (i.e., belongs to the new polyomino) or empty (i.e., does not boundaries of the bounding rectangle. Some important optimizations are also performed. belong to it). Thus, the right boundaries of the (yet Since only polyominoes of size n are sought, the sigincomplete) polyominoes have a "kink" at the currently natures of intermediate polyominoes exceeding this size considered cell. By "expanding" we mean updating may be discarded. In addition, local rules can be apboth the signatures (possibly creating new signatures) plied to prune out intermediate configurations that canand their respective numbers of polyominoes. For not be completed into valid polyominoes. For example, implementation purpose the numbers are maintained as it would save time to discard signatures of intermediate polynomials in the form of generating functions: The polyominoes that have more than one connected comterms of the polynomial P(t) = ^crf* mean that ponent, and whose union into one component will reCi distinct (possibly incomplete) polyominoes of size i quire a total of more than n cells. The same applies correspond to some signature. for signatures of intermediate polyominoes whose yetThe right boundaries of (yet incomplete) polyomito-be-constructed connections to all of the top, bottom, noes are encoded by signatures that contain five symand right borders will require more than n cells in total. bols: the digits 0-4. The symbol '0' stands for an empty At the termination of each iteration, the procedure cell. The symbol '!' stands for an occupied cell that discards all signatures that correspond to illegal polyis not connected by other cells to any other boundary ominoes, such as disconnected polyominoes (as mencells. The other three symbols represent cells which are tioned above), or polyominoes that do not touch all connected to other boundary cells. In case there are the boundaries of the bounding rectangle. Then all the several boundary cells that are connected either along the boundary or by cells to the left of it, the lowest numbers of polyominoes associated with legal polyomicell is encoded by the symbol '2', the highest cell by noes are summed up to yield the number of legal polythe symbol '4', and all the other cells (if any) in that ominoes bounded by some W x L rectangle. By iterating group by the symbol '3\ Figure 2 (similar to Figure 1 over all possible values of W and L and summing up, the of [JeOl]) shows the signature of some (yet incomplete) algorithm computes the total number of polyominoes of polyomino, before and after cell expansion. Note that size n. (Naturally, for one specific value of W, the iterthis polyomino is indeed incomplete because it has three ations of the inner loop on L may transfer information disconnected components. Yet the algorithm needs to between them to save execution time.) As mentioned in the introduction, Jensen did not consider such intermediate configurations since the yet-
163
Figure 2: A sample intermediate polyomino, its right boundary, and its signature (example taken from [JeOl]) provide a rigorous analysis of the running time of his algorithm. Instead, he empirically estimated that it is proportional to (\/2)n, where n is the size of the sought after polyominoes. In the following two sections we provide a more rigorous estimation of the running time. 3 Boundary Strings We begin by investigating the number of kinkless boundary strings of length W. (In this section the boundaries are assumed to be horizontal for ease of exposition.) Consider a square lattice of width W and unrestricted height /i, in which some of the W • h cells are occupied and the other cells are empty. Identify the connected components of the occupied cells, where
164
connectivity is through edges only. Assign a symbol to each connected component and a special symbol '-' to empty cells, and regard the lowest row of the lattice as the "signature" string representing the configuration of occupied cells. The completely empty string (containing only '-'s) is not allowed. Obviously, a signature may represent many different configurations. Moreover, the number of distinct symbols needed to specify all signatures is fWy2] 4-1, since there may be at most [W^/2] connected components. Figure 6(a) shows a lattice configuration and its representing signature string. The question is, then, what is S(W), the number of different signature strings (as a function of W, up to renaming of the non-'-' symbols).
We now prove the relation between the number of Edges between occupied cells of the same connected component (or between empty cells) are mapped to the signature strings and Motzkin's numbers. step (1,0). For a block of consecutive occupied cells THEOREM 3.1. S(W) = M(W + !)-!. of the same connected component, distinguish between four cases: Proof. Let S*(W) = S(W) + 1, that is, the number of 1. If there exists only one block (of the same compoall legal signatures plus the unique empty signature (nent), then its left (resp., right) bounding edge is ,-,-,. . . ,-) that is illegal in the context of polyominoes. mapped to the step (1,1) (resp., (1, —1)); We will now evaluate S*(W). The number of signature strings whose last character is '-' is simply S*(W — Otherwise (in case there are at least two blocks): 1). Otherwise (when the last character is not '-'), the corresponding last cell of the signature may be 2. For the first block both bounding edges are mapped connected (through the boundary and/or the area to to the step (1,1); the left of the boundary, which in this section is the 3. For an intermediate block (if any), the left (resp., area on top of the signature) to other signature cells. right) bounding edge is mapped to the step (1, —1) Let i be the lowest index (counting from left to right) of (resp., (1,1)); the signature cell connected to the Wth cell (obviously 1 < i < W). For i > 2, the (i — l)st cell must be empty, 4. For the last block both bounding edges are mapped and the. number of distinct signatures is the product to the step (1, —1). of the number of subsignatures of the leftmost i — 2 cells and the number of subsignatures of the rightmost Figure 6 shows an example of this bijection. It is rather easy to prove that this is a bijection. W — i + 1 cells (of which the leftmost and rightmost are occupied by definition, so there is freedom only in the By definition, different strings correspond to different middle W — i — I cells). Figure 3 shows an example in paths, and vice versa. We only need to verify that which W = 17 and i = l. (Recall that cell connectivity every signature string is mapped to a valid path, and that every path can be realized by some configuration is through edges only!) and its respective string. To this aim we note that Thus the order of the start and end of the blocks (of the same connected component) along the signature must be a legal parentheses sequence. Otherwise the borders of the connected components would intersect. On with the convention 5*(— 1) = S*(Q) = 1. It is easily one hand, this guarantees, following directly from the seen that this is exactly the recurrence of the Motzkin's definition of the bijection, that a signature is always mapped to a path that does not go below the xseries (with a shift of one index): axis. On the other hand, it ensures that every path is realizable. Every path is uniquely mapped into a string (by applying the bijection rules backward), and since it is a legal parentheses sequence, a cell configuration for k > 2 and Af(0) = M(l) = 1. It follows represented by this signature can easily be constructed, immediately that S(W) = S*(W) - 1 = M(W + !)-!. e.g., by building rectilinear "bridges" connecting blocks Indeed, {S(W)}fi=l = {1,3,8,20,50,126,...} while of the same symbol, supported by "legs" standing on the first few Motzkin's numbers are {M(k)}^L.Q = top of the left cell of each block, and high enough to allow lower bridges found between legs supporting the {1,1,2,4,9,21,51,127,...}. same bridge. Figure 7 shows the realization of the string Figures 4 and 5 show all the different possible shown in Figure 6. Finally, we explain the index shift between the boundaries and their signatures for 1 < W < 5. It is easy to see that M (k) < 3* since M(k) is also number of signature strings and Motzkin's numbers. the number of paths from (0, 0) to (Ar, 0) in a k x k grid, That is, why S(W) = M(W+1) - 1. This is simply that use only the steps (1,1), (1,0), and (1,— 1), and because a signature of length k is mapped to a path of do not go under y = 0. In fact, it is well-known that length k + 1, while the (k + l)st Motzkin's number is the number of paths of length k +1. The missing string M(k) = 0(3Vfc3/2). An alternative proof for Theorem 3.1 is to show a is the empty string (-,-,-,..-,-), that corresponds to the bijection between the signature strings and Motzkin's straight or-parallel path. The reader can also observe paths. Consider the edges between boundary cells. the resemblance between signature strings to drawing
165
Figure 3: The leftmost signature cell connected to the Wth signature cell (when the latter is occupied)
Figure 4: All possible signatures for 1 < W < 4 chords in an outerplanar graph, which is yet another of Jensen's algorithm that differ in the dimensions of the bounding rectangle of the polyominoes counted in way to describe Motzkin's numbers. each iteration. Denote by W (resp., L) the width (resp., 4 Computational Complexity length), that is, the y-span (resp., x-span) of abounding In this section we closely follow the notation of [JeOl]. rectangle in one iteration of Jensen's algorithm. In each Denote by n the size of the polyominoes whose number such iteration W is at most Wmax (the maximum width we intend to compute. Overall, we have p iterations needed to be considered), and L = 2Wmax + 1 — W;
166
Figure 5: (cont. of Figure 4) All possible signatures for W — 5 • Searching for the new signature in the database. If thus, Lmax (the maximum length) is at most 2Wmaxit did not previously exist, insert it and its polyIn each iteration of the algorithm we expand the so-far nomial. Otherwise, replace the existing polynomial traversed lattice by W - L cells. Each such expansion by the sum of the existing and new polynomial. consists of determining whether a new cell is either occupied or empty. This involves the consideration We denote by h the time needed for a random access into of all possible signatures a. Denote the total number the signatures database, by s the time for computing a of possible signatures by Wconf- Each individual cellnew signature, and by u the time needed for updating expansion step requires the following operations: a polynomial. Clearly, the over-all running time of the • Fetching a signature from the database of signa- algorithm is tures; • Computing a new signature that is a function of Obviously, Wmax = \n/2] and I/max = n. Therethe old signature, and of whether the new cell is fore, Wmax»£max — O(n). In the original description of the algorithm, p = Wmax • Lmax = O(n2). However, occupied or not; instead of two nested loops for W and £, one can have • Updating the polynomial that counts the number only one loop on W with only L = Lmax. After processof possible distinct configurations for the specific ing each column, one can check all the configurations. The polynomials representing the respective numbers of signature; and
167
Figure 6: Mapping a signature to a path
Figure 7: String realization polyominoes of all the valid configurations (those with one connected component spanning the entire W x L rectangle) will update the number of polyominoes of the appropriate sizes. Thus we can improve p by a factor of n and have p = O(n). If we implement the database of signatures as a perfect hashing table (with the signature as a key), we can expect each access to the database to depend (for the average case) only linearly on the size of the signature, independently of the number of signatures stored in the database. That is, h = O(Wmax) — O(n). It follows directly from the description of the algorithm that 5 = O(Wmax) = O(n). Indeed, most of the signature updates are local, and are reflected by O(l) operations in the vicinity of the expanded cell. Only a few update rules require the traversal of the entire signature (whose complexity is O(ri)} and a few global updates of it. A typical such update is required when the expanded cell touches the top boundary cell of some connected component, and we either need to find the second-to-top, or even the bottom boundary cell of the
168
same component. For this purpose we traverse the signature top to bottom and count top and bottom cells of connected components until they balance appropriately, providing enough information for identifying the sought after boundary cell. The running tune of this traversal is linear hi n.1 The polynomial operations required by Jensen's algorithm are the addition of two polynomials P\ (£), p2(t) of maximum degree n and multiplication of a polynomial P(t) of maximum degree n by t. (The latter operation implements the addition of one occupied cell.) Clearly each such operation requires O(n) time, and since each cell-expansion operation requires a constant number of such polynomial operations, we have u = O(n). * We believe that with a little effort, and by applying only a constant multiplier on the amount of memory needed to store a signature, we can show that s — o(n) by using an appropriate disjoint-sets data structure. However, this improvement is asymptotically negligible since s is dominated by h and u.
A.R. CONWAY, Enumerating 2D percolation series by the finite-lattice method: Theory J. Physics, A: Mathematical and General, 28 (1995), 335-349. [CG95] A.R. CONWAY AND A.J. GUTTMANN, On twodimensional percolation, J. Physics, A: Mathematical and General, 28 (1995), 891-904. [Go65] S.W. GOLOMB, Polyominoes, 2nd ed., Princeton Univ. Press, 1994. [JeOl] I. JENSEN, Enumerations of lattice animals and trees, J. of Statistical Physics, 102 (2001), 865881. [JeOla] I. JENSEN, http://sunburn.stanford.edu/ ~knuth/programs/Jensen.tit (a personal communication with D.E. Knuth). [K167] D.A. KLARNER, Cell growth problems, Canadian J. of Mathematics, 19 (1967), 851-863. [KnOl] D.E. KNUTH, http://sunburn.stanford.edu/ ~knuth/programs.html#polyominoes (a personal WWW page). [Lu71] W.F. LUNNON, Counting polyominoes, in: Computers in Number Theory (A.O.L. Atkin and B.J. Summing everything up, we obtain O(n4M(n/2)) = Birch, eds.), Academic Press, London, 1971, 347O(n5/2(v/3)n) as an asymptotic bound on the run372. ning time of Jensen's algorithm. As mentioned earlier, [Ma74] J.L. MARTIN, Computer techniques for evaluating lattice constants, in: Phase Transitions and CritiJensen's algorithm also prunes out signatures of boundcal Phenomena, vol. 3 (C. Domb and M.S. Green, aries that can never close into a legal polyomino beeds.), Academic Press, London, 97-112, 1974. cause, for example, the number of cells remaining to be occupied is insufficient for the polyomino to touch both [ML92] S. MERTENS AND M.E. LAUTENBACHER, Counting lattice animals: A parallel attack, J. of Statisupper and lower boundaries of the bounding box, or betical Physics, 66 (1992), 669-678. cause there are not enough cells to connect all occupied [Mo48] T. MOTZKIN, Relations between hypersurface cells into one connected component. Obviously, most of cross ratios, and a combinatorial formula for partithe signatures are pruned out when W and L are close tions of a polygon, for permanent preponderance, to n/2. This is probably the reason for the difference beand for non-associative products, Bull. American tween the (-\/3)n term in our bound and the (V^)n-like Mathematical Society, 54 (1948), 352-360. behavior observed empirically by Jensen. [O101] T. OLIVEIRA E SILVA, http://www.ieeta.pt/ ~tos/animals/a44.html (a personal WWW page). Acknowledgment [PLP67] T.R. PARKIN, L.J. LANDER, AND D.R. PARKIN, We are grateful to Giinter Rote and Stefan Felsner for Polyomino enumeration results, SI AM Fall Meetsuggesting the use of Motzkin's numbers, and to Noga ing, Santa Barbara, CA, 1967. Alon and Dan Romik for providing important insights [Re62] R.C. READ, Contributions to the cell growth on these numbers. problem, Canadian J. of Mathematics, 14 (1962), 1-20. [Re81] D.H. REDELMEIER, Counting Polyominoes: Yet References another attack, Discrete Mathematics, 36 (1981), 191-203. [Re82] S. REDNER, A Fortran program for cluster enu[BH57] S.R. BROADBENT AND J.M. HAMMERSLEY, Permeration, J. of Statistical Physics, 29 (1982), 309eolation processes: I. Crystals and mazes, Proc. 315. Cambridge Philosophical Society, 53 (1957), 629[SG76] M.F. SYKES AND M. GLEN, Percolation processes 641. in two dimensions: I. Low-density series expansions, J. Physics, A: Mathematical and General, 2 We believe that it is easy to obtain a sharper upper bound on 9 (1976), 87-95. Finally, we provide an upper bound on Nconf? the number of signatures of length W. We ignore the two bits of the signature that indicate whether or not the respective set of polyominoes touch the top and bottom boundaries of the bounding box, because these bits add only a constant factor of 4 on the number of different signatures. In Section 3 we showed that the number of signatures without a kink is exactly M(W + 1) — 1, where M(k) = 0(3*/Ar3/2) is the Hh Motzkin's number. Now consider boundaries with a kink. At most, this applies a constant factor on the number of signatures. To see this, regard the length-W signatures as a subset of all the signatures of length (W + 1) obtained by adding the kink cell to the boundary as an empty cell, and then dropping from the signature the symbol '0' from the position of the kink. By dropping these kink symbols different signatures may be identified, but this only helps. Thus we conclude that
[Co95]
the number of signatures with kinks, but since M(k) — Q(M(k + 1)) (as Hindoo M(k + l)/M(k) = 3) we have NConf(W) = 6(M(W + 2)) = Q(M(W)), and the exact constant of proportionality is of no interest.
169
Distributional Analyses of Euclidean Algorithms Viviane BALADI*
Brigitte VALLEE *
December 1st, 2003 Abstract We provide a complete analysis of the standard Euclidean algorithm and two of its "fast" variants, the nearest-integer and the odd-quotient algorithm. For a whole family of costs, including the number of iterations, we show that the distribution of the cost is asymptotically normal, and obtain the optimal speed of convergence. Precisely, we establish both local and central limit theorems, which characterize respectively distribution functions and their cumulative versions. Our results widely extend earlier results of Hensley (1994) regarding the number of steps of the standard algorithm and even in this particular case provide improved error estimates. We view an algorithm as a dynamical system restricted to rational inputs, and combine tools imported from dynamics, such as transfer operators, with various other techniques: Dirichlet series, Perron's formula, quasi-powers theorems, and the saddlepoint method. Such dynamical analyses had previously been used to perform the average-case analysis of algorithms. The present (dynamical) analysis in distribution relies on a novel approach based on bivariate transfer operators and builds upon recent results of Dolgopyat (1998) by providing pole-free regions for certain associated Dirichlet series.
1 Introduction According to Knuth [25, p. 335], "we might call Euclid's method the granddaddy of all algorithms, because it is the oldest nontrivial algorithm that has survived to the present day". Since addition and multiplication of rational numbers both require gcd calculations, Euclid's algorithm is a basic building block of computer algebra systems and multi-precision arithmetic libraries, and, in many such applications, most of the time is indeed spent in computing gcd's. The classic books by Schonhage, H. Cohen, Knuth, von zur Gathen and Jiirgen further illustrate the central role of this algorithm in computa* CNRS UMR 7586, Institut de Mathematiques de Jussieu, F-75251 Paris, France. t CNRS UMR 6072, GREYC, Universite de Caen, F-14032 Caen, France .
170
tional number theory and related applications. Although its design is quite clear, the Euclidean algorithms have not yet been completely analysed. We consider here three (discrete) Euclidean algorithms (the standard Euclidean algorithm, the centered and the odd-quotient algorithms) and their (continuous) versions, the continued fraction algorithms. We allow in general a unit cost c incurred at each step that is of moderate growth (i.e., at most logarithmic) and precisely describe the asymptotic distribution of the total cost C(x) of an execution of the Euclidean Algorithm on the input x. As a technical preparation, we first do so when x is real (and trajectories are truncated, see Theorem 1), then consider the case where x is randomly (uniformly) drawn from rational inputs with denominator and numerator less than TV. Our first main result, Theorem 3, is a central limit theorem (CLT): it states that the distribution of C is asymptotically Gaussian, with expectation and variance asymptotically proportional to log N. Theorem 3 gives the best possible speed of convergence to the Gaussian law, of order O(l/\/logN). Furthermore, the constants p,(c) and <52(c) that intervene in the main terms of the expectation and the variance appear to be computable constants with ^(c) even admitting a closed form. Our second main result, Theorem 4, is a local limit theorem (LLT) which holds for any lattice cost c (i.e., a nonzero cost whose values belong to a set of the form LN, with L > 0). Note that any integer cost is lattice. Three special instances of our results are of major interest. (i) Digit-cost c(m) = 1. This plainly yields as total cost the number of steps of our three Euclidean algorithms, for which normality results. In this special case, the LLT was previously proved by Hensley [20], in the particular case of the Standard Euclidean Algorithm, but with a worse speed of convergence. (n) Digit-cost, c(m) = Cfc(m), the characteristic function of a digit k. We then characterize the frequencies of each digit as Gaussian, a result which is itself new.
(Hi) Digit-cost c(m) = [Iog 2 mj. This gives the length of the binary encoding of a continued fraction representation, also proved to obey Gaussian fluctuations. This analysis is of interest in relation to the information content of continued fraction representations and to the boolean complexity of the Euclidean algorithm, for which a complete analysis is still lacking.
of depth P, decomposing it also as
with hi € H, 1 < i < p - 1, and hp e f. The "generic" set H contains LFTs of the form h(x) = l/(m + ex) with (m, e) e A. The LFTs appearing in the final step belong to a subset T C 'H of the form h(x) = l/(m+cx) with (m,e) € Ar\B. (See Figure 1) A full version of this extended abstract [4], titled We associate to each algorithm a dynamical system of "Euclidean Algorithms are Gaussian" is available from the interval, i.e., a transformation T : I —> Z. For nice and elementary surveys on interval dynamics, see the ArXiv [5, 26, 11]. The interval Z has already been defined (see Figure 1) and T extends the map defined on rationals 2 Methodological Outlook by the equality T(u/v) = r/u where r is the remainder 2.1 Three Euclidean algorithms and their dyof the Euclidean division on (u, v). Then, namical systems. Three Euclidean algorithms will be analyzed; each of them is related to a Euclidean division. These algorithms possess similar properties, with various parameters described in Figure 1. They form a class which will be denoted by £. where the function A depends on the algorithm and Let v > u > I be integers. The classical division, is defined in Figure 1. Note also that H is just the corresponding to the standard Euclidean algorithm Q, set of inverse branches of our map T : Z —» X. v = mu + r produces an integer m > 1 and an integer Furthermore, each h € "H maps X into X, and the set remainder r such that 0 < r < u. The centered division of the inverse branches of the iterate Tk is exactly the (centered algorithm /C) requires v > 2u and takes the set Hk; its elements are called the inverse branches of form v = mu+s, with integer s e [—u/2, +u/2[. Letting depth k. Each interval /i(Z) for h of depth k is called a s — er, with e = ±1 (and e = +1 if s = 0), it produces fundamental interval of depth fc, and, for any n > 1, the an integer remainder r such that 0 < r < u/2, and an union of all fundamental intervals of depth n is exactly integer m > 2. The odd division (odd algorithm O) Z. Each dynamical system is strongly contracting, and produces an odd quotient: it is of the form v — mu + s the contraction ratio satisfies with m odd and integer s € [—u, +u[. Letting s = er, (2.3) with e — ±1 (and e = +1 if s = 0), it produces an p:= lim (max{|/i'(x)|;/i€K n ,xeZ}) 1/n < 1. u—>oo integer remainder r with 0 < r < u, and an odd integer m > 1. In the three cases, the divisions are defined by trajectories. The pairs q = (m, e), which are called the digits. (See Figure 2.2 Costs of truncated real trajectory T(x) = (x,T(x),T2(x),... ,T n (z),...) is 1.) For the present purposes, since the pair (du,dv) pro- entirely described by the infinite sequence of digits duces the same sequence of digits as (u,v), up to mul- (mi(:r), m2(x), m 3 (z),..., mn(x),...), through tiplying all remainders r by d, we may consider the rational u/v, instead of the integer pair (u, v). Then, rationals u/v belong to interval Z, and the division expressing the pair (u, v) as a function of (r, u) is replaced by a linear fractional transformation (LFT) h that ex- When x is rational, the trajectory T(x) reaches 0 in finitely many steps, and the largest p for which presses the rational u/v as a function of r/u. To summarize, on the input (u, v), each algorithm mp(x) is finite is the depth of the trajectory (or of produces a sequence of pairs ((mi, e i ) , . . . , (ra;,, ep)) and the rational). For x € Z, it is usual to consider the truncated trajectory Tn(x) := (x, Tx,..., Tnx) together thus builds a specific continued fraction with its truncated encoding (mi(x),rri2(x),... , m n (x)), which keeps track of the first n stages of the process, and then let the truncation degree n tend to oo. Associate a nonnegative real value c(m) to each digit m, and put c(oo) = 0. We may also view c as a function on 7i, and extend the function c additively to (truncated)
171
trajectories by setting for each n > 1
Here, we consider costs c which are of moderate growth, i.e., they are nonzero and satisfy c(m) = O(\ogm). Any triple (X, T, c) formed with a dynamical system of£ and a cost c of moderate growth will be called of £MQtype. Some special cases are particularly interesting: if c = cm is the characteristic function of a fixed digit m, then C^ is just the number of occurrences of m amongst the first n digits of the CF-expansion; if c is the binary length of the digit, then C (n) is closely related to the length of a standard encoding of the trajectory Tn. We shall refer to c as a "digit-cost" and to C as the associated "total cost". The cost C is a sum (of c) over time-iterates, and it is of great importance to characterize its probabilistic behavior, as the truncation degree tends to oo. The density transformer, also known as the PerronFrobenius operator,
172
was introduced early in the study of continued fractions (see, e.g., the works of Levy Khinchin, Kuzmin, Wirsing, Babenko, and Mayer). It describes the density of T(X) when X has density /. The density transformer encapsulates important information on the "dynamics" of the iterative process. Two of its properties, the existence of a unique dominant eigenvalue [Property UDE] , relative to a density f\ , together with a spectral gap separating this eigenvalue from the remainder of the spectrum [Property SG] (see Figure 2) entail the applicability of simple and powerful spectral methods, in the SM ^-setting: As the truncation degree n tends to oo, the expectation E[C(n)] of the cost C (n) is asymptotic to n^x(c), where
is the average of cost c with respect to the dominant eigenfunction f\. In order to establish more precise results on the distribution of truncated trajectories, and study the speed of convergence, it is standard to use the moment generating function. One then works with a weighted transfer operator, which is a perturbation of the density
UDE SG Anw(l, 0) An s (l,0) Ans,w(l, 0) SMVL UEVL
Unique Dominant Eigenvalue at (1, 0) Spectral Gap at (1, 0) Analyticity of operator with respect to to at 0 for s = 1 Analyticity of operator with respect to s at 1 for w = 0 Analyticity of operator with respect to (s, w) in a neighborhood of (1, 0) Strict Maximum on the Vertical Line attained at s = a (reals), Uniform Estimates near the Vertical Bi-Line {(s, w); R(s) = 1, R(w;) = 0}: such that (real) near (1, 0)
Figure 2: Main analytical properties of the operator H S)t and its spectral radius R(s,w) (on a convenient functional space).
Distribution UDE + SG + Anw(l,0) + strict convexity of log A (1, to) at 0 Rational UDE + SG UDE + SG trajectories + An s (l,0) + SMVL + Ans,w(l,0) + UEVL + strict convexity of log (s,w) at (1,0) Figure 3: Properties of the transfer operator useful for analyzing trajectories. Truncated real trajectories
Average-case UDE + SG
transformer. Quite general weighted transfer operators n, and any Y e R were introduced by Ruelle, in connection with thermodynamic formalism (see e.g. [34]). Here, for a given digit-cost c, the transfer operator HI )W depends on a parameter w and is defined by
Furthermore, for any 0 satisfying 9 > r\ (where r\ is the subdominant spectral radius of HI), the mean and the variance satisfy
so that H^o = HI. When c is of Al^-type, the operator is well-defined and analytic with respect to w when w is near 0 [Property .Any, (1,0)]. Then, perturbation theory [24] entails that the two properties UDE and SG extend to the operator HI)U, with a dominant eigenvalue A(l , w) and a spectral gap. If the digit-cost is non constant, the pressure function A(l,w) := logA(l,w) is furthermore strictly convex. Finally, (see Figure 3) for non constant digit-costs of class MQ, this leads to a proof that, for any probability with smooth density, the asymptotic distribution of the total cost is Gaussian. The dominant This Central Limit Theorem (CLT) is quite well-known: terms of the expectation and the variance involve the for published results, including sometimes the Local Limit Theorem, see for instance [10] or [7] for interval first derivatives of function A(l,u>). maps, and [1] for a more abstract framework. The proof Theorem 1. For a triple of SMQ type with non- that we give in Section 3 is based on a compact and constant c and any initial probability on 1 with a C1 versatile Quasi-Powers statement of Hwang (see [23]). density, there are fi(c) > 0 and S(c) > 0 so that for any It encapsulates the consequences of the two main tools
173
usually applied to get the CLT with speed of convergence: the Levy continuity theorem for characteristic functions and the Berry-Esseen inequality. Hwang's Theorem includes useful asymptotic expressions (with error terms), for the expectation and variance. We shall also apply this theorem in our analysis of the Euclidean algorithms.
of invariant sets (see e.g. [35]). But they are also used in continuous-time dynamics (flows) over continuous spaces, with s complex related to the parameter of the Fourier transform of the continuous-time correlation function (see e.g. [30, 14]). The study of Hs for s with large imaginary part is indeed related to the speed of decay of correlations for Anosov flows, a very difficult problem only solved recently [14].
2.3 Continued fractions of rational numbers: Average-case analysis of Euclidean Algorithms. As already seen, rational trajectories reach 0 in finitely many steps and describe the execution of the Euclidean algorithms. They have been less extensively studied. The above-mentioned results about real inputs do not say anything about the set of rational inputs, since the rationals have zero Lebesgue measure. As it is often the case, the discrete problem is more difficult than its continuous counterpart. However, this problem is of central importance, since it corresponds to the analysis of the Euclidean algorithms: An execution of a Euclidean algorithm on the input (w, v) gives rise to a rational trajectory T(u/v) which ends at 0, so that the analysis of the algorithm depends on the probabilistic analysis of such rational trajectories. Note that the reference parameter is no longer the truncation degree n, but the size v := max(u,v) of the input (it, v). We study here the total cost related to an execution of the Euclidean algorithm. The most basic cost is the number of iterations P(u, v), which corresponds to the trivial digit-cost c = 1 , but we are interested in general total costs C relative to digit-costs c of moderate growth. Now, the trajectory of (u/v) is described by (2.2) and the total cost C(u,v] is of the form
Using the operator Hs in the analysis of Euclidean algorithms leads to a "dynamical analysis" approach that provides a unifying framework for the averagecase analysis in the £M (/-setting and shows that the rational trajectories have an expected behavior similar to what is observed on truncated real trajectories. More precisely (see [41]), the expectation Ejv[P] of the number of steps P on rational trajectories whose inputs have numerator and denominator less than N is asymptotic to p,\ogN, where /i = 2/|A'(l)|, and A(s) is the dominant eigenvalue of the operator Hs. More generally, for a triple of SMQ-type, the mean value E^[C] of the total cost C(u/v) of trajectory T(u/v) satisfies
The average-case complexity of Euclidean Algorithms is by now well-understood. For the basic total cost P, the results have been obtained around 1969, independently by Heilbronn and Dixon for algorithm £, and by Rieger for fC. Recently, Vallee [41, 38, 40] used another perturbation of the density transformer, the transfer operator H s , together with its weighted version HS defined as (2-10)
with /z(c) the asymptotic mean value (2.6) of real (truncated) trajectories. We now explain the role of the parameter s in the operator, discussing briefly the proof of the above claim. For the discrete algorithms, the relevant parameters are described by generating functions, a common tool in the average-case study of algorithms [17]. As it is usual in number theory contexts, the generating functions are Dirichlet series S(s) where the parameter s "marks" the size v of input («, v). It turns out that S(2s) can be (easily) expressed with the "quasi-inverse" (/ — H^)" 1 of the transfer operator, together with the weighted operator HS . Hence the asymptotic extraction of the coefficients of S(s) can be achieved by means of Tauberian theorems [13, 36], which yield the desired estimates on mean values of the total costs related to rational trajectories. To check that the hypotheses of the Tauberian theorems are satisfied, analyticity of the operator Hs at s = 1 [Property /ln s (l,0)], in conjunction with UDE and SG, is useful, together with Property SMVL, which requires that the spectral radius of the transfer operator Hs admits a strict maximum on the vertical line 3?(s) = 1 at s = 1 (see Figures 2 and 3).
to analyze Euclidean Algorithms. Note first that oper- 2.4 Euclidean Algorithms: Distributional analators similar to Hs appear in the study of discrete-time ysis and dynamical methods. In view of the dynamics over continuous spaces: with s real, they are average-case results (and the distributional results for related, via the Bowen formula, to Hausdorff-dimension truncated real trajectories) just presented, it is natural
174
to ask whether the distribution of the total cost C(u/v) on rational trajectories with 0
which is a generalization of all previously introduced transfer operators [see (2.5, 2.7, 2.10) and note that H£^ is just the derivative of Hs>u, at w = Oj. We shall establish below (4.25) a simple relation between S(2s, w) and the quasi-inverse (7 — H^)"1 of the twovariable operator HSiUM
where the transfer operator FSilo is relative to the final set f used in the last step of the algorithm.
2.5 Condition UNI and Property UEVL. Property UEVL. It expresses that, when (3? is near the reference point (1,0) with |Ss| > to, for some IQ > 0, a uniform power-law estimate |5s|^, with £ small, holds on the quasi-inverse (/ — HStW)~l (see Figure 2). Note that the work of Mayer [28, 29] relative to the standard Euclidean algorithm shows that there is a > 0 for which the quasi inverse (/ — H.,^)"1 is analytic for 3?s > 1 — a and s ^ 1. However, this property does not suffice and we actually need UEVL. We adapt here powerful methods due to Dolgopyat [14] and we show that the two-variable transfer operator satisfies UEVL. In the spirit of Chernov [9], Dolgopyat [14] introduced several "uniform nonintegrability" (UNI ) conditions on the dynamical foliations. Then, he proved that the UNI Conditions implies UEVL Property in the case of one-variable transfer operators Hs which are related to dynamical systems with a finite number of branches. We have to adapt Dolgopyat's result to our context. We first give a new formulation of the strongest of Dolgopyat's UNI Conditions, only implicit in Section 5 of [14], which expresses that the derivatives of inverse branches of the dynamical system (J, T) are "not too close", except on a set of "small measure". Condition UNI. We introduce a "distance" A between two inverse branches h et k of same depth,
In a sense, distributional analyses are obtained by (uniform) perturbation of average-case analyses. Foln lowing [17], parameters of combinatorial or number- and, for h in H , and T] > 0, we denote by J(/i, 77) then theoretic structures, provided they remain "simple' ' union of the fundamental intervals fc(Z), where k € H enough, lead to local deformations (via an auxiliary vari- ranges over the A-ball of center h and radius 77: able w) of the functional relations defining univariate generating functions. Under fairly general conditions, such deformations are amenable to perturbation theory Recalling that p is the contraction ratio defined in (2.3), and admit of uniform expansions near singularities. for any a, 0 < a < 1, the In this paper, we apply dynamical methods for the Condition UNI expresses that, an n on first time to the distributional analysis of (discrete) al- Lebesgue measure of J(h,p ) (for h € H ] is O(p ), gorithms. This leads to a new facet of the domain: with a uniform O-term (with respect to h,n,a). distributional dynamical analysis. We now need uniform estimates of coefficients of bivariate series S(s, w) with respect to the parameter w, which "marks" the cost. While Tauberian Theorems sufficed for averagecase analyses, they prove insufficient for distributional analyses, since they do not provide (uniform) remainder terms. We then need a more precise "extractor" of coefficients, and the Perron formula is well-suited to this purpose. A sufficient condition for the applicability of Perron's formula is the "uniform" Property UEVL on the quasi-inverse of the transfer operator Hs,w.
We first prove that all the Systems of € satisfy Condition UNI. This is due to the good properties of their dual systems (see Lemma 7 in [4]). Then we have to modify Dolgopyat's arguments which prove that UNI implies UEVL: we must consider dynamical systems which may possess an infinite number of branches, and we work with bivariate weighted transfer operators H.S)U; involving a cost function. We finally prove the following theorem, which is the central functional analytic result of the paper and provides Property UEVL.
175
Theorem 2 [Dolgopyat-type estimates]. Consider a (c) In the special case c = 1, the constants /x := triple of SMQ-type, with contraction ratio p, and let <52 := 52(1) involve the first two derivatives of the HS)U, be its weighted transfer operator. For any £ > 0, pressure function A(s): there is a real neighborhood EI x W\ of (1,0), and there is a constant M > 0, such that, for all s = a 4- it, w = v 4- ir with ( l/p2, In particular, 2/p. is the entropy for (T,f\dx). general case, ^t(c) = fj, • fi,(c), and Dolgopyat was interested in the decay of correlations relative to flows. Later on, Pollicott and Sharp used Dolgopyat's bounds together with Perron's formula to find error terms in some asymptotic estimates for geodesic flows on surfaces of variable negative curvature (see e.g. [31]). Note that they work with univariate Dirichlet series with positive coefficients whereas we need uniform estimates on coefficients of bivariate Dirichlet series with complex cofficients. To the best of our knowledge, the present paper is the first instance where these powerful tools are applied to distributional analyses in discrete combinatorics. This paper can also be viewed in essence as a realization of Hensley's prophecy in [20]: "This suggests that [our results] might be sharpened considerably. Perhaps, what is needed is a deeper understanding of the function X(s) and the [transfer] operators" . 2.6 Our main results. With Dolgopyat estimates and Perron's formula, we obtain the following Central Limit Theorem (see Section 4 for a description of the main steps of the proof).
In the
where £(c), S(c) are the constants of Theorem 1, and X(c) = A^(l,0). Note that a depends a priori on the cost c (remark that this was not the case for 9 in Theorem 1). The constant x(c) can De viewed as a covariance coefficient between the number of steps P and the cost c. Since there exists a closed form for the invariant density /i for the systems of £, the constants //, fi(c), and thus //(c) can be easily computed (see [41]). On the other hand, the constants <5(c), 5, 6(c) do not seem to admit a closed form. However, Lhote has proved that they can be computed in polynomial time [27]. With similar methods, we then obtain the following Local Limit Theorem for any lattice cost of moderate growth (see Section 4 for a description of the main steps of the proof). Note that the LLT was previously obtained by Hensley [20], in the particular case of the Standard Euclidean Algorithm and of the trivial cost c = 1, and with a worse speed of convergence. His proof used the transfer operator, but in way appreciably different from ours: Hensley approximates discrete measures on rationals by continuous measures, and he obtains distributional results on rational trajectories by approximating them by truncated real trajectories. In particular, he avoids dealing with parameters s of large imaginary parts.
Theorem 3. [Central Limit Theorem for costs.] Consider a Euclidean Algorithm of the class S. Denote by fijv the set of pairs (u, v) with u/v € "!• and 0 < v < N, and by fi^ the set ofcoprime pairs of Qyv- For any cost c of moderate growth, there is a > 0 so that: (a) The distribution of the total cost C on fiyv,^/v is asymptotically Gaussian, with speed of convergence Theorem 4. [Local Limit Theorem for lattice costs.] O(l/\flogN), i.e., there exist two constants p,(c) > 0 For any algorithm among Q, 1C, O, and any lattice and S(c) > 0 such that, for any N, and any y € R cost c of span L > 0 (i.e., a cost c whose values belong to a subset LN) and of moderate growth, and letting fj.(c) > 0 and 52(c) > 0 be the constants from Theorem 3, the following holds,
(6) The mean and the variance satisfy, EN[C] = n(c) log N + rj(c) + O(N~a), VN(C] = 52(c) log N + Si(c) + O(N~a).
176
with a O uniform for x € R. The same holds for ffV in
3
Dynamical Analysis of Continued Fraction Expansions of Real Numbers In this section, we introduce the weighted transfer operator (§3.1) and recall its main properties (§3.2). §3.3 explains (as a useful and exemplary parenthesis) the fundamental role this operator plays in the distributional analysis of truncated real trajectories. Theorem 1 is proved using Hwang's Quasi-Power result.
In particular, the quasi-inverse (/ — written (formally) as (3.18)
can be
We shall recall next some classical properties of the weighted transfer operator.
3.2 Summary of classical spectral properties of transfer operators (For more precisions, see e.g. [3, 7, 10, 12, 37]). Let HS,™ be the transfer operator (3.16) associated to a £MQ triple. Let S0, W0 be the real sets from (3.15). When (5Rs,3?w) belongs to EQ x WQ, the operator HSiU, acts boundedly on the Banach space Cl ( 2") endowed with the norm || • || i defined by II /Hi = SUP I/I + SUP I /'I- It depends analytically on (s,w). When w = 0, we omit the second index in the operator and its associated objects. (a) Spectral Dominant Properties and Analyticity. For real parameters (s, w) € EO x WQ, the operator VLs,w is quasi-compact *, satisfies UDE and SG. Analyticity Property, together with perturbation theory, proves We consider a digit-cost of .M£-type, i.e., a strictly that the dominant eigenvalue A(s,w), the dominant positive digit-cost c defined on 7i which satisfies c(m) = eigenvector fSiW, the subdominant spectral radius rSjU, O(logm). Then the series are well-defined and analytic when (s, w} is near the real axes. (6) Pressure function. For (cr, i/) e EQ x WQ, the pressure function is defined as A( (1/2), J/Q 0. 3.1 Transfer operators of interval maps. If J is endowed with an initial probability density with respect to Lebesgue measure / = /o, repeated applications of the map T modify the density, and the flow of successive densities /i, / 2 , • • - , / « , • • • describes the global evolution of the system at time t = 0,1,2, . . . , n , — The operator H such that f i — H[/o] and more generally fn = H[/ n _i] = Hn[/0] for all n is called the density transformer, or the Perron-Frobenius operator. It is defined as
In the sequel, we study properties of trajectories in the ^A-l^-setting. We extend the digit cost c to a cost function, also denoted c, on H* by
and we introduce the two-variable transfer operator H S)W which extends the transfer operator, via HI,O = H. It depends on two (complex) parameters s and w, (3.16)
The operator FSiW is defined in a similar way, with T replacing H. Note that the additive property of costs (together with the multiplicative property of the derivatives) entails (3.17)
and A'(l ) is the opposite of the entropy of the dynamical system (T, ^i) while AJ U (1,0) is the average of the cost (with respect to /i), equal to ju(c) [see (2.6)]. (c) Strict convexity of the pressure. The pressure is strictly convex in s, i.e., A"2(cr, v) > 0. Also A(a+gi/, ^) is strictly convex as a function of v for all fixed q ^ 0 and fixed cr. If c is not constant, the pressure is strictly convex in w, i.e., A^2(o", v) > 0. (d) Function a. There is a complex neighborhood W of 0 and a unique function a : W —» C such that X(cr(w),w) = 1; this function is analytic near 0, and cr(0) = l. (e) Condition UNI and aperiodicity. Since the systems of the class S satisfy the UNI Condition, they fulfill l An operator is quasi-compact if the upper part of its spectrum is formed with isolated eigenvalues with finite multiplicity
177
aperiodicity conditions: First, for any t ^ 0, 1 does with a uniform O-term for w e W. In other words, not belong to SpHi+jt.o (this is a weak version of the the moment generating function Mn(w] behaves as a SMVL Condition). Second, if c is a lattice cost with " Quasi- Power ", and we are in a position to apply the span L, for all t, and all r not multiple of 27T/L, 1 does following result, in order to prove Theorem 1: not belong to SpHi+if.,^. (See Proposition 1 in [4]).
Hwang's Quasi- Power Theorem [23]. Assume that 3.3 Transfer operators and real trajectories. the moment generating functions relative to a sequence We now describe the fundamental role played by the of functions CM on probability spaces fiyv are analytic operator Hi iUJ in the analysis of truncated real trajec- in a complex neighborhood W of 0, where they satisfy
tories. We consider any triple of SMQ type, with a non-constant cost c. The interval X is endowed with a EN[exp(wCN)} = exp[j3NU(w) + V(w}} (l + O(«^)) , probability measure with smooth density /, and we are interested in the asymptotic behavior of the distribution with 0tt, KN —> oo as N —> oo, and U(w), V(w) analytic on W. Assume also that [/"(O) ^ 0. Then, the mean of and the variance satisfy when the truncation degree n tends to oo. [Recall that the /ij's (1 < i < n) are the first n LFTs used by the trajectory T(x).\ As already mentioned, a very convenient tool is the moment generating function Mn(w) of the cost C^n\ defined by
totically Gaussian, with speed of convergence O(K^ +
/r1/2^)• PN
For our application to Theorem 1, we set fin .= (I, f dx) for all n, Cn = C< n) , /3n = n, Kn = 0~n. Function U equals the pressure function w >—> A(l,w;) and With the change of variables y = h(u), we can rewrite V(w) = log (J J Pi )to [/](it)du). Note that, when c is not constant, the function A(l,w;) is strictly convex at Mn(w) as: w = 0, (see Section 3.2) and [/"(O) ^ 0. Note that /i is explicitly given in Figure 1 for Q, O, and /C, so that fi(c) is computable in these cases. The constants /i(c) and 6(c) will appear in Theorem 3. A Local Limit Theorem and a Large Deviations Property hold in this context (see [7, 1]). 4 Dynamical Analysis of Continued Fraction The above relation is fundamental for analysing costs Expansions of Rational Numbers on truncated real trajectories: From Section 3.2 (a), (recalling that r\ is the subdominant spectral radius), In this section, we turn to rational inputs and Euclidean for any Q > r\ , there is a complex neighborhood W of 0 Algorithms. We introduce in §4.1 Dirichlet series of moment generating functions, which we relate to the quasion which the operator Hi iiu splits as inverse of the weighted transfer operator. We explain (in §4.2) how to use Perron's Formula to extract coefficients of Dirichlet series and why Property UEVL is where PI,™ is the projector on the dominant eigensub- needed for this. Next, in §4.3 we explain why this leads space for the eigenvalue A(l,ty), and Ni ilu has spectral to a uniform quasi-power approximation for the moment radius TI,W < 0|A(l,iu)|. Therefore, for any n > 1 generating function of costs. Theorem 3 follows, applying again Hwang's Quasi-Power theorem, and provides the asymptotic Gaussian distribution for costs of moderate growth on rational trajectories. Finally, we focus which entails in §4.4 on lattice costs, and describe how an application of the saddle point method entails our local limit theorem, Theorem 4.
178
4.1 Dirichlet series and transfer operators. We with a, 6, c, d coprime integers, with determinant ad — restrict now our study to rational inputs x = (u/v) € I be = ±1, and denominator D[h] related to \h'\ through in the SM^-setting. We will see, with (4.26), that it is sufficient to consider the sets the transfer operator can be alternatively defined by as the possible inputs. Precisely, we deal with the probabilistic model relative to ft/v endowed with uniform probability P/vAn execution of the Euclidean algorithm on an input Our tool to study the distribution of the total cost (u, v) of Q performing p steps uniquely decomposes the C7(tt, u), associated to a digit-cost c and defined in (2.9), rational (u/v) in a unique way as is its moment generating function on Jljy,
with hi £H, I < i W(N) is the cumulative value of exp(tuC) on Euclidean algorithm defines a map (u, v) >-» h which is QTV, a bijection between the sets fi and H* x F. In view of (4.21) (4.24,4.22,3.18), the relations
To analyse the moment generating function EN [exp(wC)] of the cost C on Qjy, it then suffices to estimate the functions $W(N), asymptotically in N —» co, and uniformly in w in a complex neighborhood of 0. Extending the principles defined in [38, 40, 41], we replace the sequence of moment generating functions by a single bivariate Dirichlet series, henceforth called the Dirichlet-moment generating function: (4.22)
Then, the relation
shows that the moment generating function (4.20) is related to partial sums of the coefficients of series S(s,w). As we previously did in Section 3.3 for truncated real trajectories, we aim to relate the moment generating function of costs on rational trajectories to the weighted transfer operator. Since each inverse branch of any depth is a linear fractional transformation
provide the desired expression for the Dirichlet moment generating function S(s, w) in terms of the transfer operators HS)W and "Fs,w-
With the set fi of all pairs (u,v) with (u/v) € 2, we may define a Dirichlet moment generating function S(s,w). Since each element (u',v') can be written in a unique way as (du,dv) with coprime (u,v), and C(u, v) = C(u',v'), one has
Using well-known properties of the Riemann zeta function C( s )> all our results for fijv will follow from those on £lpf. In view of (4.20, 4.22, 4.23), the relations (4.25), (4.26) connecting the Dirichlet moment generating function with the transfer operator are the analogues for rational trajectories of the relation (3.19) for the truncated real trajectories. In the case of rational trajectories, we have to work with the quasi-inverse and extract the coefficients of Dirichlet series. This is why the discrete problem, which we wish to solve here, is more difficult to solve than the continuous problem. The sequel of the Section motivates and summarizes informally the main steps of our work.
179
4.2 Perron's formula and Dolgopyat's estimates. We wish to evaluate the sum $W(N) of the first N coefficients of the Dirichlet series S(2s,w). Our first main tool towards this goal is the Perron Formula of order two (see [15]), which is valid for a Dirichlet series F(s) = X)n>i ann~s and a vertical line 9fts = D > 0 inside the domain of convergence of F,
4.3 Asymptotic normality and Quasi-Power estimates. Perron's Formula (4.27) combined with the fundamental relation (4.25), together with Property UEVL will provide the following estimate for the Cesaro sum ^u, of $,„, as T —> oo,
Applying it to the Dirichlet series S(2s, w), we find
where E(w) is the residue of S(2s,w) at the pole s = a(iu), and the O-term is uniform on W when T —> oo. Note that a and E are analytic on a complex neighborhood of w = 0. It does not seem easy to transfer the information (4.28) on ^W(T) to estimates on $ W (T); we proceed in three steps to prove asymptotic normality of the cost C: First Step. We introduced a smoothed model, i.e., the probabilistic model (fJ;v,Pjv) defined as follows: For any integer N, set fiyv = fiyv; next, choose uniformly an integer Q between N — [Nl~2a\ and AT, and draw uniformly an element (u,u) of Qg. Slightly abusing language, we refer to the function C in the model ($IN,PN) as the "smoothed cost." Now, we appeal to a classical result that is often used in number theory contexts (see Lemma 10 in [4]), and we then deduce from (4.28) the following quasi-power estimates for the moment generating function of the "smoothed" version of the cost C,
Thus, Perron's formula gives us information on which is just a Cesaro sum of the $W(Q):
We shall explain in the next subsection below how to transfer information from $fw to supw(3fcr(iy))It is next natural to modify the integration contour 3?s = D into a contour containing a(w] as a unique where the O-term is uniform in w. pole of S(1s,w}. This is possible (for a quadrilateral Second Step. Now, we are again in the framework of contour) if there is a > 0 and a neighborhood W of 0 Hwang's Theorem, and we get that the smoothed verfor which the following is true: sion of cost C follows an asymptotic Gaussian distri(i) for all w € W, the Dirichlet series S(2s, w) admits bution, with a speed of convergence in O(l/v/loglV), s = a(w) as a unique pole in the strip |5?s — 1| < a. together with precise informations about the asymptotic behavior of the expectation E^[C] and the vari(iz) On the left vertical line 3?s = 1 — a, the series ance Vw[C]. The function U of Hwang's Theorem is S(2s, w) is O(|5s|0, w^h a small f, and a uniform equal to U(w) = 2cr(w) where a is defined in Section O-term (with respect to w G W), so that it is 3.2 (d). possible to control the integral (4.27) on the left Third Step. We prove that the distributions of C on vertical line 5Rs = 1 — a, in a uniform way with OAT and on fi/i/ are 7V~2a-close, i.e., for all (u,v) € OAT, respect to w. These two properties are clear consequences of Property UEVL. And Theorem 2 proves that Property UEVL so that the distribution of C on fijy is also asymptotically Gaussian, with a speed of convergence in holds in the £jVf£-setting.
180
O(l/vTog77). The closeness of distributions, together with the polynomial worst-case complexity of the algorithms of E now provides precise information about the asymptotic behavior of the expectation EN [C] and the variance V^[C]. The constants that appear in the main terns of the expectation and the variance are equal respectively to
This ends our sketch of the proof of Theorem 3. 4.4 Local Limit Theorem and Saddle Point Method. We summarize now the proof of Theorem 4. It deals with any lattice cost c of moderate growth. It is sufficient to consider costs C relative to costs c with span L = 1, in the smoothed model, and our starting point is the relation
(z) The Dirichlet series S(2s, ir) is analytic in the strip (n) On the left vertical line 5Rs = 1 - /?, the series 5(25, ir) is O(|Ss|^), with a small £, and a uniform O-term (with respect to r], so that it is possible to control the integral (4.27) on the left vertical line 3te = 1 - /?. Then, using the Perron formula as in 4.3 proves that that the second part /„ is a O(AT~ 7 ) for some 7 > 0. This ends our sketch of the proof of Theorem 4.
5 Conclusions This article has presented a unified approach to a large body of results relative to Euclid's algorithm and its major (fast) variants. We recapitulate here some of its consequences, relations to the vast existing literature on the subject, as well as extensions and open questions. It should be stressed that most of the improvements can eventually be traced to the existence of pole-free Since we are looking for an LLT result, the convenient strips for Dirichlet series of cost parameters, a fact that scale of the problem is n := log TV, and we set precisely devolves from our extension of Dolgopyat's estimates to continued fraction systems. First our methods lead to extremely precise estimates (Here [•] denotes the nearest integer function.) We of the moments of costs, and in particular the number of steps, a much studied subject. Our estimates, when consider specialized to the mean number of steps, yield
Our strategy is to decompose the integration interval [—7r,+7r] into a neighborhood of zero, [—v,v], and its complement \r\ € (v,?r]. This gives rise to two integrals In and In • Integral 40). The estimate (4.29), together with Relations (4.30), proves that the main part of the integrand exp[-iTqx(n)]-EN[eirC}is
Since the function z i-» cr(z) — 1 — z&'(Q) has a saddle point at z = 0, the saddle point method entails the following estimate for the integral In ,
Integral /„ . On the other hand, when |r| € [u, TT], Dolgopyat's estimates and aperiodicity results (see (e) of 3.2), entail the following: For some /3 > 0,
see above Theorem 3, Parts (6) and (c). In the case of the standard algorithm, this covers the original estimates of Dixon and Heilbronn in 1969-1970 (the main term), as well as Porter's 1975 improvement (the second term, 77), while providing the right shape of the error term [O(JV~T)], for which Porter further showed that one could take 7 = (1/6) — e in the case of the standard algorithm. We refer the reader to the accounts by Knuth [25] and Finch [16] for more material on this classical topic. Our formula (5.32) also extends Rieger's analyses (first published around 1980, see [32,33]) of the centered algorithm and the odd algorithm. Note that, in our perspective, the second-order constant 77 comes out as a spectral quantity. It is an open problem to derive an explicit form starting from our expressions (e.g., such a form involving C'(2) is known under the name of Porter's constant in the standard case). In sharp contrast to the mean value case, variances do not seem to be amenable to elementary methods. The first-order estimate has been given by Hensley (1994) in the paper [20] that first brought functional analysis to the field. Our formula,
181
stated in Theorem 3 can be viewed as a precise form of Hensley's estimate relative to the standard algorithm, one that also extends to the odd and centered algorithms. (Incidentally, the quantity 5, called Hensley's constant [16], is not recognized to be related to classical constants of analysis, though it is now known to be polynomial-time computable thanks to a recent study of Lhote [27]; the nature of 61 is even more obscure.) Note that the complex-analytic properties of the moment generating functions provided by the functionalanalytic methods furnish similarly strong estimates for moments of arbitrarily high order (our Theorem 3), a fact which also appears to be new. Regarding distributional results, several points are worthy of note. Dixon in his 1970 paper had already obtained exponential tail bounds and these were further improved, albeit in a still fairly qualitative way by Hensley in his 1994 study (see his Theorem 1 in [20]). Our approach gives much more precise information on the distribution, as we now explain. For simplicity, let us specialize once more the discussion to the number of steps. First, regarding the central region of the Local Limit Theorem (Theorem 4), the nature of the error terms obtained (O(N~^°)) and the fact that the saddle point method lends itself to the derivation of full asymptotic expansions (see, e.g., Henrici's book [19]) entail the existence of a full asymptotic expansion associated with the Gaussian approximation of Theorem 4, namely,
where
involves a computable numeric sequence {e/}, (the argument of PAT being that of Theorem 4). This expansion is valid for x in any compact set of R. Regarding large deviations, Lemma 11 (quasi-powers for smoothed costs) implies that E/v[exp(wC1)] is of the form of a quasi-power, provided w stays in a small enough fixed neighbourhood of 0. By a standard argument (originally due to Cramer and adapted to the quasi-powers framework by Hwang [21, 22, 23]) and Lemma 14, this implies the existence of a large deviation rate function, that is, of a function I(y) such that, for the right tail corresponding to y > 0, there holds (5.35)
by yp.\ogN is exponentially small in the scale of the problem, being of the form .A simplified version of (5.35) is then: the probability of observing a value at z or more standard deviations from the mean is bounded by a quantity that decays like e-Ciz (Hensley's notations [20]). Analogous properties naturally hold for the left tail. Cramer's technique of "shifting the mean" and use of the resulting "shifted" quasi-powers lead in fact to very precise quantitative versions of (5.35) in the style of Hwang—thereby providing optimal forms of tail inequalities d la DixonHensley. Similar properties hold for other costs measures, like the ones detailed in the introduction. Of particular interest is the statistics of the number of digits assuming a particular value, for which estimates parallel to (5.32), (5.33), (5.34), and (5.35) are also seen to hold. For instance, the frequency of occurrence of digit m in the expansion of a rational number has mean
and it exhibits Gaussian fluctuations. It is again unknown to us, which of the involved constant (beyond the mean value factor) may be related to classical constants of analysis. The spectral forms of our Lemma 12 may at least provide a starting point for such investigations. A major challenge is to derive distributional information that are so to speak "superlocal" . By this, we mean the problem of estimating the behaviour of the number of steps and other cost measures over fractions of denominator exactly equal to N, i.e., rationals belonging toft;v\fiw-i- ^n v*ew °f wnat is known in the averagecase [16, 25], we expect arithmetical properties of N to come into the picture. The analytical difficulty in this case stems from the fact that a further level of "unsmoothing" (how to go from ftjv to fiyy \ fijv-i?) would be required.
Further works. Our dynamical approach provides a general framework, where it seems possible to answer other questions about distributional analysis. For instance, all the questions that we solve for rational trajectories can be asked for the periodic trajectories. A periodic trajectory is produced by a quadratic number, and the reference parameter is related to the length of the geodesies on the modular surface. In a forthcoming paper, and following the same principles as in [30, 31, 39], we prove that the distribution of costs on periodic trawith I(y) defined on an interval [0,5] for some 5 > 0. jectories can be studied in a similar way as here, replac1 In other words, the probability of exceeding the mean ing the quasi-inverse (/ by det(/ — H SiW ).
182
Open problems. We also ask various questions about Euclidean algorithms: for instance, what happens for other Euclidean algorithms of the Fast Class (in particular for the Binary algorithm [6, 38])? The extension of our results to cost functions that are still "small" but take into account the boolean cost (also known as the bit complexity) of each arithmetic operation is on our agenda. Note that an average-case analysis is already known to be possible via operator techniques [2, 41]. On another register, the extension to "large" costs is likely to lead us to the realm of stable laws: see for instance Gouezel and Vardi's works [18, 42] for occurrences of these laws in continued fraction related matters.
[8] CESARATTO, E., AND VALLEE, B. Hausdorff dimension of sets defined by constraints on digits related to large additive costs, Les Cahiers du GREYC, 2003, submitted. [9] CHERNOV, N. Markov approximations and decay of correlations for Anosov flows, Ann. of Math. (2) 147 (1998) 269-324. [10] COLLET, P., Some ergodic properties of maps of the interval, Dynamical systems, Proceedings of the first UNESCO CIMPA School on Dynamical and Disordered Systems (Temuco, Chile, 1991), Hermann, 1996. [11] COLLET, P., AND ECKMANN, J.-P., Iterated Maps on the Interval as Dynamical Systems, Birkhauser, Boston, 1990. [12] CHAZAL, F., MAUME, V., Statistical properties of General Markov dynamical sources: applications to Acknowledgements. During his thesis, Herve Daude information theory, To appear in DMTCS, 2004. made experiments providing evidence for the Gaussian [13] DELANGE, H., Generalisation du Theoreme d'Ikehara, limit property of the number of steps. Joint work Ann. Sc. ENS, 71 (1954) pp. 213-242. with Eda Cesaratto [8] involved an extensive use of [14] DOLGOPYAT, D., On decay of correlations in Anosov the weighted transfer operator, and some of the ideas flows, Ann. of Math., 147 (1998) pp. 357-390. that we shared on that occasion proved very useful [15] ELLISON, W. AND ELLISON, F., Prime Numbers, Hermann, Paris, 1985. for the present paper. We have had many stimulating discussions with Philippe Flajolet regarding the saddle- [16] FINCH, S. R., Mathematical Constants, Cambridge University Press, 2003. point method and the notion of smoothed costs. Finally, [17] FLAJOLET, P. AND SEDGEWICK, R., Analytic Combiwe thank Sebastien Gouezel who found a mistake in a natorics, Book in preparation, see also the Rapports previous version. This work has been supported in part de Recherche INRIA 1888, 2026, 2376, 2956. by two CNRS MATHSTIC grants and by the European [18] GOUEZEL, S. Central limit theorem and stable laws Union under the Future and Emerging Technologies for intermittent maps, Prob. Theory and Applications programme of the Fifth Framework ( ALCOM-FT Project (2003), to appear. IST-1999-14186). [19] HENRICI, P., Applied and Computational Complex Analysis, vol. 2, John Wiley, 1974. [20] HENSLEY, D., The number of steps in the Euclidean References algorithm, Journal of Number Theory, 49, 2 (1994) pp. 142-182. [1] AARONSON, J., AND DENKER, M., Local limit theo[21] HWANG, H.-K., Theoremes limite pour les structures rems for partial sums of stationary sequences generated combinatoires et les fonctions arithmetiques, PhD by Gibbs-Markov maps, Stock. Dyn. I (2001) 193-237 thesis, Ecole Polytechnique, Dec. 1994. [2] AKHAVI, A. AND VALLEE, B., Average bit-complexity of Euclidean Algorithms, Proceedings ICALP'OO, Lec- [22] HWANG, H.-K., Large deviations for combinatorial distributions: I. Central limit theorems, The Annals ture Notes Comp. Science 1853, 373-387, Springer, of Applied Probability 6 (1996) 297-319. 2000. [3] BALADI, V., Positive Transfer operators and decay of [23] HWANG, H.-K., On convergence rates in the central limit theorems for combinatorial structures, European correlations, Advanced Series in non linear dynamics, Journal of Combinatorics, 19 (1998) pp. 329-343. vol 16, World Scientific, 2000. [4] BALADI, V. AND VALLEE, B. Euclidean Algorithms are [24] KATO, T., Perturbation Theory for Linear Operators, Springer-Verlag, 1980. Gaussian, available from the ArXiv [5] BOYARSKY, A. AND GORA, P., Laws of Chaos, In- [25] KNUTH, D.E., The Art of Computer programming, Volume 2, 3rd edition, Addison Wesley, Reading, variant Measures and Dynamical Systems in One DiMassachusetts, 1998. mension, Probability and its applications, Birkhauser, [26] LASOTA, A. AND MACKEY, M., Chaos, Fractals 1997. and Noise; Stochastic Aspects of Dynamics, Applied [6] BRENT, R.P. Analysis of the Binary Euclidean alMathematical Science 97, Springer, 1994. gorithm, Algorithms and Complexity, New Directions and Recent Results, J.F. Traub, ed., Academic Press, [27] LHOTE, L., Computation of a Class of Continued Fraction Constants, hese Proceedings. 1976. [7] BROISE, A., Transformations dilatantes de 1'intervalle [28] MAYER, D. H., On the thermodynamic formalism for the Gauss map, Comm. Math. Phys., 130 (1990) et theoremes limites, Asterisque 238, pp. 5-109, Societe pp. 311-333. Mathematique de France , 1996.
183
[29] MAYER, D. H., The thermodynamic formalism approach to Selberg's zeta function for PSL(2, Z), Bull. Amer. Math. Soc., 25 (1991) pp. 55-60. [30] PARRY, W., AND POLLICOTT, M., Zeta Functions and the Periodic Orbit Structure of Hyperbolic Dynamics, Asterisque, 187-188, 1990. [31] POLLICOTT, M., AND SHARP, R. Exponential error terms for growth functions on negatively curved surfaces, Amer. J. Math., 120 (1998) pp. 1019-1042. [32] RIEGER, G. J., Uber die mittlere Schrittanzahl bei Divisionalgorithmen, Math. Nachr. (1978) 157-180. [33] RIEGER, G. J.,, Uber die Schrittanzahl beim Algorithmus von Harris und dem nach nachsten Ganzen, Archiv der Mathematik 34 (1980), 421-427. [34] RUELLE, D., Thermodynamic formalism, Addison Wesley, 1978. [35] RUELLE, D., Repellers for real analytic maps, Ergodic Theory Dynamical Systems, 2 (1982) 99-10. [36] TENENBAUM, G., Introduction to analytic and probabilistic number theory, Cambridge University Press, 1995. [37] VALLEE, B., Dynamical sources in information theory: fundamental intervals and word prefixes, Algorithmica, 29 (2001) pp 262-306. [38] VALLEE, B., Dynamics of the Binary Euclidean Algorithm: Functional Analysis and Operators, Algorithmica, (1998) 22 (4) pp. 660-685. [39] VALLEE, B., Dynamique des fractions continues a contraintes periodiques, Journal of Number Theory (1998) 2, 183-235. [40] VALLEE, B., Dynamical Analysis of a Class of Euclidean Algorithms, Theoretical Computer Science vol 297/1-3 (2003) pp 447-486 [41] VALLEE, B., Digits and Continuants in Euclidean Algorithms. Ergodic Versus Tauberian Theorems, Journal de Theorie des Nombres de Bordeaux, 12 (2000) pp. 531-570. [42] VARDI, I., A relation between Dedekind sums and Kloosterman sums, Duke Math. J. 55 (1987), 189-197.
184
A Simple Primality Test and the rth Smallest Prime Factor Daniel Panario*
Bruce Hichmond and Martha Yip
Abstract We analyze the classical primality test from high school. This analysis requires information on the first smallest prime factor of an integer. We also provide asymptotic formulas for the average value of the logarithm of the rth smallest prime factor, averaged over the first N integers. Our results are analogous to the ones for the rth smallest size of components in random decomposable combinatorial structures and extend works by de Bruijn and others. The results also show the strong connection between decomposition of combinatorial objects and prime decomposition of integer numbers. 1
Introduction
The smallest prime factor of the first N integers has been intensively studied, first by Buchstab [4] and de Bruijn [6]. Tenenbaum [21] gives a recent discussion of this problem. In this paper we use de Bruijn's results to analyze the classical primality test from high school. We also follow de Bruijn's analysis to extend his results to the study of the logarithm of the rth smallest prime factor. There is an analogy between prime divisors of the first N integers and the components of combinatorial structures of size N. This analogy was first pointed out by Knuth & Trabb-Pardo [17] for permutations. It has been examined by Arratia, Barbour & Tavare* [2] and their co-authors from a probabilistic standpoint; see their upcoming book. The analogy between additive and multiplicative number system is discussed in the book by Burris [5]. For example, the number of prime factors has been largely studied since Hardy and Ramanujan, who showed that almost all integers n have log log n prime factors. Erdos & Kac [10] establish a Gaussian limit distribution for the number of prime factors with mean asymptotic to log log n and standard deviation asymptotic to i/loglogn. Flajolet & Soria [12, 13] obtained a Gaussian limit distribution for the number of components in random combinatorial 'School of Mathematics and Statistics, Carleton University, Ottawa, Canada; email: [email protected] * Department of Combinatorics and Optimization, University of Waterloo, Waterloo, Canada; email: {Ibrichmond,m2yip}@math.uwaterloo.ca
structures with mean asymptotic to logn and standard deviation asymptotic to i/log n. In the case of largest prime factor the analogy is that the average value of the logarithm of the fcth largest prime factor is ~ (7* logn, for a constant (7*. Knuth & Trabb-Pardo point out that the fcth largest cycle of a permutation on n symbols has mean asymptotic to Cfcn. They also state that the fcth largest irreducible factor of a polynomial of degree n over a finite field has mean asymptotic to C^n. Gourdon [14] shows under general conditions, that the fcth largest component of a random combinatorial structure is ~ C^n. The (7*8 are indeed equal so we can "get" the answer for prune divisors by replacing n by logn. Indeed, Gourdon shows that the probability that the largest component of an object of size n is of size at most m is ~ p(n/m), where p(.) is the Dickman function (see [7, 9]). These combinatorial structures include objects such as permutations on n objects and polynomials over finite fields of degree n. Knuth & Trabb-Pardo show that the probability that the largest prime factor is at most m is ~ /9(logn/logm). Hildebrand [16] has investigated the probability that an integer less than or equal to n has largest prime factor at most equal to m. He shows that his results hold for a large range of m if and only if the Riemann hypothesis holds. It would be interesting to determine an analogue of Hildebrand's results for combinatorial objects. In Section 2 we analyze the most classical primality test: given an integer n successively check whether positive numbers divide n. We do not claim that this algorithm is practical, but it is the most simple primality test algorithm and it does not seem to be analyzed before. Using de Bruijn's estimates for the smallest prime factor of an integer we show that the expected cost of the algorithm is essentially logn log logn with a large deviation from this mean. This is in agreement with the behavior of Ben-Or's irreducibility test for polynomials over finite fields [18] that has a similar structure. Another of our goals here is to show that an analogy holds between the smallest prime factors of integers and the smallest components of combinatorial objects. In the case of the smallest component of a random
185
combinatorial structure, Panario & Richmond [19] show that under general conditions, the probability that the smallest component of a combinatorial object of size n is of size at least m is ~ " ^ > for n,m -> oo, where u)(.) is the Buchstab function. It is known that the probability, of the integers less than or equal to n, having smallest prime factor at least m is ~ u/Oogn/togm)-,!!/!^ Panario & Richmond also show that the average value of the rth smallest component of a combinatorial object of size n is ~ e ^°gn) , where 7 is Euler's constant. So, if the analogy referred to above holds, we expect that the average value of the logarithm of the rth smallest prime factor, averaged over the first n integers is ~ e"y(lo^,i°sn^. In Section 3 we prove that indeed this is the case. Experimental results comparing the exact values and our estimates are shown in Section 4. Let pr(n) denote the rth smallest prime factor of n where n = p\p% • • -p*, and pi < p? < • • • < Pk- Properties of pr(n) have been widely studied in number theory. In particular, Erdos and Tenenbaum [11] provide asymptotic estimates for the probability pr(n) = m. De Koninck and Tenenbaum [8] give estimates for the probability pr(n) > m. Indeed, they give an asymptotic formula for the second iterate of the logarithm of the median value of pr(n). In this paper we focus on the logarithm of pr(n). We follow de Bruijn's residue approach. It is a special case of a technique used in various analytic combinatorics and analysis of algorithms papers. This is done for combinatorial problems by Panario & Richmond [19] using singularity analysis. Our asymptotic estimates may not be the best possible. However, we wish to emphasize here the similarity not only of the results but also of the methods used in analytic combinatorics and number theory. 2 Analysis of a Simple Primality Test We consider the high school method to test if an integer number is prime or not. The algorithm works as follows: given an integer n, check successively whether numbers divide n. This is done for 2 and then for each odd number bigger than 2, in increasing order, until we find the smallest prime factor of n. If we need to check all odd numbers up to \/n, then n is prime. Each step of this algorithm involves a trial division between n and a number m < ^/n. Using classical arithmetic l this operation has cost A (log n log m — (logm)2) for some constant A that depends on the implementation (see [3], p. 43). I
For simplicity we consider only classical arithmetic. Similar analysis could be done for fast arithmetic.
186
The expected cost of this algorithm is given by
where P\p\(n) > m] is the probability that Pi(n), the smallest prime factor of n, is bigger than or equal to m, and C(n div m) denotes the cost of diving n by m. In the previous sum and in similar following sums, m ranges over 2 and then over odd integers bigger or equal than 3. THEOREM 1. The average cost of the primality test is asymptotic to
for a constant A that depends on the implementation. PROOF. Let $(t) denote the probability that p\ (n) > t. If t < ne(n) where e(n) = O(logloglogn/loglogn), then (see Theorem 3 in page 400 of [21]) (2.1)
where 7 is Euler's constant. It can be deduced $(t) = O(l/logt) for 2
when / is positive and g, h are monotone (see page 57 of [15], for example). Thus,
The probability that P[pi(n) < t] is 1 - $(£). So the expected value of logpi(n) is the Stieltjes integral of log£ with respect to the distribution of 1 — $(£)• The first sum above equals
We can use Equation 2.2 to justify replacing $(£) by its asymptotic expansion, which is given by Equation 2.1.
Also, if $(£) is differentiable we can replace d$(t) by when n is a prime because we have divided by all odd integers up to \/n. This is asymptotic to A-^/nlogn/4, $'(£) (see page 55 of [15]), and we have a huge contrast to the average cost. We note that, by the same arguments in [18], C is a small constant; indeed, C is very close to ~7 This is because du is close to zero since u}(u) — e~7 oscillates about zero (see [21]). In principle, we could also analyze, in a similar way, a simple integer factoring algorithm already studied by Also Knuth & Trabb-Pardo [17]. Given an integer n, the algorithm finds the prime factor of n by starting from 2 and checking successively whether odd numbers divide n. When a prime factor is found, n is divided by that prime. Assuming n has r prime factors, let {ti, . . . , tr} be the r prime factors of n in increasing order. Then the expected cost of factoring n is given by (for simplicity we assume that n is squarefree) Furthermore, using the second moment of the logarithm of Pi(n), we have
The expected cost of the algorithm is asymptotic to Ae~'y log n log log n. In order to compute this cost we would only need It is interesting to determine the asymptotic behav- the above conditional probabilities. We remark that ior of the variance as well as the mean. It is known Knuth & Trabb-Pardo approach is rather different. They obtain their analysis by finding an estimate for (see [21]) that the largest* prime factor of n that gives the stopping condition for the algorithm. where u(u} is the Buchstab function. Panario and Richmond [18] found that the probability that the smallest degree irreducible factor of a random polynomial of degree n over a finite field be at least m is asymptotic to u;(n/m)/m, and that the expected value of m2 is asymptotic to Cn where C = 2 /2°° ^ - du. Essentially the same argument in [18], but changing n by logn and m by log*, shows that the expected value of (logpi(n))2 is C log n. Hence, there is a large deviation from the mean. This large deviation is expected given the log logn expectation and the known fact that most integers have small primes as their factors. Moreover, a large deviation is in agreement with the known facts about decomposable combinatorial structures [19]. The worst-case is
3 The rth Smallest Prime Factor The goal of this section is to give an estimate for the size of the rth smallest prime factor of an integer. We do that by combining a simple extension of de Bruijn's results [6] with Panario and Richmond's [19] study of smallest components in random decomposable structures (but applied to Dirichlet generating functions instead of combinatorial generating functions). This study provides another instance of the analogy between the decomposition of integer numbers and of combinatorial objects. The novelty here is that this analogy holds not only in terms of the estimates obtained but also in terms of the methods of proof employed. We use the following notation where p runs through
187
the prime numbers:
where, as usual, [vi]F(v] means the coefficient of vj in F(v). Let I$(s) denote the Dirichlet generating function of integers with pr (n) > m, that is,
than m is the coefficient of u* 1 in the bivariate expression in the statement of the lemma. This is completely analogous to the first lemma in the section on unlabelled objects of Panario and Richmond (see [19]). That Dirichlet series and power series are essentially equivalent enumeratively has been pointed out by Burris [5] (see Appendix B). The generating function in which every prime factor 1S at least m is
Let denote the number of positive integers < x with pr(n) > m The union of the cases for 1 < i < r completes the proof. In this paper, u and um denote different things. The notation um is standard in combinatorial enumeration for unlabelled objects. Furthermore, we define u = log x/ log m as de Bruijn does, and as it is stated in tne next theorem. We need Lemma 3.12 of Titchmarsh [22].
The implied constants in our O-terms are absolute, and so they are uniform in m. Gourdon [14] and then Panario and Richmond [19] used generating functions analogous to those above. They use ordinary and exponential generating functions to enumerate labelled j ,, „ , ,. , . , ,. , .., , . ,. ([221) y/ Let us write s = a + it and and unlabelled combinatorial objects with restrictions LEMMA 3.2. u on the largest and smallest components. We view the integer p?*p£2 • • -p"r as having QI components pi, etc. With this interpretation, standard enumeration techniques give the following lemma. where fn = O(ip(n)), tp(n) being non- decreasing, and as for a positive, we have LEMMA 3.1. The Dirichlet generating function L$(s) 48
Then, ifa>l,o~ + a>l,xis not an integer, N is the
nearest nteger 1 to x, and |t| < T for real a number T we have where
is defined in Equation 3.3.
PROOF. Consider the r cases (for simplicity we suppress nmpr(n)):
We can construct the objects in Ci by first choosing i — 1 components of size less than m, then attaching components of any size at least m. The generating function of objects having i — 1 components of size smaller
188
Let us write
for
where
and put
Since a > 1, then to (m) < g1(m)= g(m). In the O-terms appearing in the proof of Lemma 3.12inTitchmarsh[22] we may estimate p0(m) by g(m). so 1/2 < 6 < 1. The values of T and A are specified We can let s = 0 since if a > 1 then a + a > 1. Then, as in de Bruijn's paper. Using the residue theorem, the we have integral for U$(x) in Equation 3.4 can be written as We can also let x — N + 1/2. Moreover since s = 0 and a>1, then we have where
Our study of , l
follows, very closely, from de
Bruijn [6]. We assume that the reader has a copy of de Bruijn's paper and we only outline the differences. The complete details of our proof can be found in [20]. Choosing a = 1 and a = 1 + (logs)-1, we can apply Lemma 3.12 of Titchmarsh [22] and immediately get corresponding to Equation 3.2 of de Bruijn [6], with g(m) =
We treat Ji, J% and Js much as de Bruijn does to obtain equajs to
The difference is less than the sum of the O-terms in
We have the following result.
THEOREM 2. Let u = logs/ logm, and let 7 be Euler's Following de Bruijn's proof closely one now obtain constant *^e theorem (see [20] for details). 7/e2 < m < oo, then The main result of this section is the following. THEOREM 3. The expected value of the logarithm of the rth smallest prime factor over the first n integers is asymptotic to
n or(jer
^0 prove Theorem 3 we need the following
Iemma
PROOF (SKETCH). Let us suppose that x > m > e2, and u = LEMMA 3.3. Let $(t) be the probability that the rth log x/ logm or u > 1. Let the positive numbers T, A,b smallest prime factor of n is greater than or equal to be such that t. Then
189
PROOF OF LEMMA. Let pr(n] denote the rth smallest Combining the two results we get prime factor of n. Consider the r cases:
If u = log(n)/log(m), this implies lim
we can apply theroem 2. For values of t in this range, we have
It is known that the number of integers up to n with smallest prime factor p\ > m is (see Tenenbaum [21], Corollary 1.4.6.1). This is case C\. The number of integers smaller or equal to n of The expected value of logp (n) is given by log t d(l — the form p\ •••p/t_i(n/pi ---pk-i) in the case Ck for $(£)). Using again the factr that when / is positive and 2 < k < r is 5, h are monotone, then / f ( t ) dg(t) = O (/ /(t) dh(£)) if g(t) = O(h(t)) (see [1]), we have
Summing up the cases, it follows that
So similarly
Dividing the above by n proves our lemma. We observe that asymptotically the first r — 1 cases are not significant, and that the only case that matters is the last one. PROOF OF THEOREM 3. Suppose we have
Thus, gathering the above results,
where h(n) ->• oo slowly as n -> oo. This implies
Taking the logarithm of both sides gives lose n log ^— > log log logn 4- logh(n) - log log log log n. logm
190
We conclude this section giving a possible improvement to our asymptotic estimates. For simplicity, let us suppose r — 2. As we have seen, the cases C\ and £2 arise, and the case C\ is asymptotically insignificant. In the case £2 we can write an integer < x as p(x/p) where
p < m and x/p has smallest factor > ra. The number of these integers is
Table 1: Probability that n has pi > na, 0 < a < 1/2 0.99900 0.81279 0.49900 a = 0.15 0.99000 0.54186 0.81279 0.49000 0.33200 a = 0.20 0.60959 0.40639 0.26500 0.32000 a = 0.25 0.48767 0.32511 0.32000 0.22700 a = 0.30 0.40639 0.27093 0.25000 0.20600 a = 0.35 0.23222 0.34834 0.18900 0.25000 a = 0.40 0.20319 0.30479 0.16900 0.21000 a = 0.45 0.18062 0.27093 0.15700 0.21000 a = 0.50 0.16255 0.24383 First row denotes actual value.
a = 0.10
We cite now Theorem 3 in page 400 of [21]. THEOREM 4. ([21]) Uniformly forx>m>2we have
where u>(u) denotes the Buchstab function. Hence, asymptotically, we have
10a
10*
n
0.99000
10«
105
10"
0.33332 0.49990 0.33333 0.48767 0.60959 0.40639 0.26665 0.33320 0.22857 0.32511 0.40639 0.27093 0.22856 0.19180 0.26650 0.20319 0.24383 0.30479 0.18052 0.22840 0.15283 0.16255 0.19507 0.24383 0.15304 0.13176 0.19160 0.13546 0.20319 0.16255 0.16330 0.13622 0.11507 0.11611 0.17417 0.13933 0.09841 0.11830 0.14690 0.15239 0.10159 0.12191 0.10286 0.08506 0.12810 0.10837 0.13546 0.09031 0.09527 0.07833 0.12040 0.12191 0.08127 0.09753 Second row denotes approxima
Table2: Number of integers < n with p1 > m n m m—=12
If logm = o(logaj) then logp = o(logx) and log (x/p)/logm -> co. So since U;(M) -> e~"1 if u -> oo [21], this is asymptotic to
m =3 m =5 m =7 m = 11
m = 31
If logm S> o(logx) then we have
10*
103
QQ 99
QOQ 999
100 49 50 32 33 25 26.7 21 22.9
1000 499 500 332 333 265 267 227 229 159 158
10" QQQQ 9999
10000 4999 5000 3332 3333 2665 2667 2284 2285
1573 1579
10° QQQQQ 99999
100000 49999 50000 33332 33333 26665 26667 22856 22857 15804 15794
10" QQQQQQ 999999
1000000 499999 500000 333332 333333 266665 266667 228570 228570 157938 157947
First row denotes actual value. Second row denotes approximate value calculated with nf
and
The analysis of the other r's is very similar with more cases. As we have seen, the dominant case is Cr. 4 Experimental Results In Tables 1 and 4 we give the probability that n has Pr > na, 0 < a < 1/2, for r = 1,2. We compare the correct values with respect to our approximations. We also provide tables of number of integers smaller or equal to n with pr > m, for several values of m and r = 1, 2 (Tables 2, 3, 5 and 6). We use numerical approximations for r > 2 instead of our asymptotic approximations. In the following we explain why numerical approximations provide good estimates in these experiments. If we wish to find an asymptotic formula for the number of integers < n with pr > m, we can consider the r cases:
When we derive our asymptotic formula, we discard the first r— 1 cases since they are asymptotically insignificant. However, when we calculate values numerically, the first r — 1 cases do contribute significantly. This can be illustrated with a simple example. Let |5| = |{n:n < x, n € 5}|. Consider the case r = 2. We have
191
Table 3: Number of integers < n with p\>m (cont.) n m = 100
10*
10d
10" 1204 1203 1219
m = 313 m = 1000
First row denotes actual value. value calculated with
105 11830 12032 12192 9530 9683 9771
10e 120759 120317 121919 94002 96829 97710 78330 80965 81280 Second row denotes approximate
Third row denotes approxi-
mate value calculated with
Table 4: Probability that n has P2 > na, 0 < a < 1/2
10J 104 10° 10* 0.74000 0.83100 0.62700 0.51518 0.27382 0.51210 0.55945 0.55638 a = 0.15 0.74000 0.58100 0.48810 0.44630 0.51210 0.56111 0.53774 0.50274 a = 0.20 . 0.49000 0.44200 0.41930 0.40149 0.55945 0.53774 0.49099 0.44720 a = 0.25 0.35000 0.37400 0.37440 0.33787 0.55638 0.50274 0.44720 0.40129 a = 0.30 0.35000 0.32900 0.32630 0.29555 0.53774 0.46835 0.40972 0.36405 a = 0.35 0.29000 0.30400 0.28480 0.26697 0.51462 0.43724 0.37803 0.33352 a = 0.40 0.29000 0.28600 0.26490 0.24243 0.49099 0.40972 0.35113 0.30811 a = 0.45 0.24000 0.26300 0.24570 0.22647 0.46835 0.38547 0.32807 0.28664 a = 0.50 0.24000 0.23300 0.23000 0.21233 0.44720 0.36405 0.30811 0.26825 First row denotes actual value. Second row denotes
n a = 0.10
10° 0.53261 0.53774 0.41891 0.46835 0.37110 0.40972 0.31375 0.36405 0.27827 0.32807 0.24635 0.29910 0.22095 0.27528 0.20665 0.25533 0.19583 0.23836 approximate
Hence,
Table 5: Number of integers < n with p2>m
10J 831 633 581 m =3 547 442 m =5 491 374 m=7 444 329 m = 11 428 159 m = 31 352 First row denotes actual value calculated with
n
m =2
10' 74 63 49 54 35 49 29 44 24 42.8
10" 8770 6334 6270 5470 4881 4919 4193 4441 3744 4284 1573 3528 value.
10" 10° 90407 921501 63348 633487 65407 671501 54702 547023 51518 532612 49196 491961 44630 463724 444194 44419 40149 418918 42847 428478 15804 157938 35281 352810 Second row denotes approximate
Table 6: Number of integers < n with p2 > rn (cont.) n m = 100
10*
10J
10* 2300 3040 3081
m = 313 m = 1000
First row denotes actual value. value calculated with
106 24243 30406 30811 21298 26613 26855
10° 257021 304063 308112 215663 266137 268558 195839 237442 238364 Second row denotes approximate Third row
denotes approximate value calculated with
e 7, hence the disagreement in our estimations. Thus, for 0.35 < a < 0.5, these numerical results can be better approximated using computed values of the Buchstab function when it does not behave like e~7. Some numerical approximations and a graph of the Buchstab function can be found in [18].
Acknowledgment. The first and second authors were Compare this to our asymptotic formula for r = 2 funded by NSERC. The third author was supported by an NSERC Undergraduate Student Research Award. whichis References We note that 1 is significant compared to log logm when n (and hence m) is "small". Thus in this example, C\ contributes significantly. Finally, we comment about an interesting feature of Tables 2, 3, 5 and 6. For small values of m with respect to n (Tables 2 and 5), the approximations practically match the correct values. However, for large values of m with respect to n (Tables 3 and 6), the approximations are no longer as good. When n is fixed, say n = 106, and m grows, the correct values involve the Buchstab function w(logn/ logm) where the argument is small. It is well-known that for those arguments the Buchstab function does not behave like
192
[1] APOSTOL, T. Mathematical Analysis. Addison Wesley Longman, Inc., 2nd Ed., 1974. [2] ARRATIA, R., BARBOUR, A., AND TAVARE, S. Random combinatorial structures and prime factorizations. Notices of AMS 44 (1997), 903-910. [3] BACH, E., AND SHALLIT, J. Algorithmic Number Theory, vol I. MIT Press, 1996. [4] BUCHSTAB, A. Asymptotic estimates of a general number theoretic function. Mat. Sbornik 44 (1937), 1239-1246. [5] BURRIS, S. Number theoretic density and logical limit laws. AMS, Mathematical Surveys and Monographs, vol. 86, 2001.
[6] DE BRUIJN, N. On the number of uncancelled elements in the sieve of Eratosthenes. Indag. Math. 12 (1950), 247-256. [7] DE BRUIJN, N. On the number of positive integers < x and free of prime factors > y. Indag. Math. 13 (1951), 2-12. [8] DE KONINCK, J.M., AND TENENBAUM, G. Sur la loi
de repartition du fc-ieme facteur premier d'un entier. Math. Proc. Cambridge Phil. Soc. 133 (2002), 191-204. [9] DICKMAN, K. On the frequency of numbers containing prime factors of a certain relative magnitude. Ark. Mat. Astr. Fys. 22 (1930), 1-14. [10] ERDOS, P., AND KAC, M. The Gaussian law of errors in the theory of additive number theoretic functions. American Journal of Mathematics 62 (1940), 738-742. [11] ERDOS, P., AND TENENBAUM, G. Sur les densites de certain suites d'entiers. Proc. London Math. Soc. 59 (1989), 417-438. [12] FLAJOLET, P., AND SORIA, M. Gaussian limiting distributions for the number of components in combinatorial structures. Journal of Combinatorial Theory, Series A 53 (1990), 165-182. [13] FLAJOLET, P., AND SORIA, M. General combinatorial schemas: Gaussian limiting distributions and exponential tails. Discrete Mathematics 114 (1993), 159-180. [14] GOURDON, X. Combinatoire, algorithmique et geometric des polynomes. PhD thesis, Ecole Polytechnique, 1996. [15] GREENE, D.H., AND KNUTH, D.E. Mathematics for the analysis of algorithms. Edition 3, Birkhauser, 1990. [16] HILDEBRAND, A. Integers free of large prime factors and the Riemann hypothesis. Mathematika 31 (1984), 258-271. [17] KNUTH, D.E., AND TRABB-PARDO, L. Analysis of a simple factorization algorithm. Theor. Computer Science 3 (1976), 321-348. [18] PANARIO, D., AND RICHMOND, B. Analysis of BenOr's polynomial irreducibility test. Random Structures and Algorithms 13 (1998), 439-456. [19] PANARIO, D., AND RICHMOND, B. Smallest components in decomposable structures: Exp-log class. Algorithmica 29 (2001), 205-226. [20] PANARIO, D., RICHMOND, B, AND YIP, M. The rth smallest prime factor and a simple primality test. CORR 2002-25, Combinatorics and Optimization Research Reports, University of Waterloo, 2002. [21] TENENBAUM, G. Introduction to analytic and probabilistic number theory. Cambridge University Press, 1995. [22] TITCHMARSH, E.G. The Theory of the Riemann Zeta Function. Oxford University Press, 1951.
193
Gap-free samples of geometric random variables - Extended Abstract Pawel Hitczenko* Abstract We study the probability that a sample of independent, identically distributed random variables with a geometric distribution is gap-free, that is, that the sizes of the variables in the sample form an interval. We indicate that this problem is closely related to the asymptotic probability that a random composition of an integer n is likewise gap-free. 1
Introduction.
Arnold Knopfmacher* in the model of words (strings) ai . . . a n , where the letters en € N are independently generated according to the geometric distribution. H.-K. Hwang and his collaborators obtained further results about this limiting behaviour in [2]. The two parameters 'value' and 'position' of the rth left-to-right maximum for geometric random variables were considered in a subsequent paper [9] . Other combinatorial questions have been considered by Prodinger in e.g. [13, 14]. The combinatorics of geometric random variables has gained importance because of their applications in computer science. We mention just two areas: skiplists [3, 11, 15] and probabilistic counting [4, 8]. The special case p = 1/2 of geometric random variables is closely related to compositions of n as shown in [5, 6]. This led us to consider the same question for compositions. A composition of a natural number n is said to be gap-free if the part sizes occuring in it form an interval. In addition if the interval starts at 1, the composition is said to be complete.
We study samples of independent, identically distributed random variables with a geometric distribution. Specifically, let TI, 1^, F 3 ,... be independent identically distributed geometric random variables with parameter p, that is, P(Fi = j) = pqj~l,j = 1,2,..., with p + = 1. We will be interested in the probability that a random sample of n such variables is gap-free, that is, that the sizes of the variables in the sample form an interval. In addition if the interval starts at 1, the sample is said to be complete. We can restrict our attention to the probability pn that a sample of length n is complete, since the probability that a geometric sample of length n has no Example. Of the 32 compositions of n = 6, there are 21 ones is exponentially small. gap-free compositions arising from permuting the order In the case p = 1/2 this probability turns out to of the parts of the partitions be exactly 1/2. The case of p ^ | is more interesting.
6, 3 + 3, 3 + 2 + 1, 2 + 2 + 2, 2 + 2 + 1 + 1, In fact, in that case the sequence (pn) does not have 2+1 + 1 + 1 + 1, l + i + l + l + l + l a limit, but exhibits small oscillations. An asymptotic expression for pn in the case of p ^ | is derived in and 18 complete compositions arising from permuting Section 2. of the order of the parts in Some of the previous studies relating to combinatorics of geometric random variables are as follows. In 3 + 2+1, 2 + 2 + 1 + 1, 2 + 1 + 1 + 1 + 1, [12] the number of left-to-right maxima was investigated 1 + 1 + 1 + 1 + 1 + 1. "The first author is supported in part by the NSA grant MSPF02G-043. The research was conducted during a visit to the John Knopfmacher Centre for Applicable Analysis and Number Theory at the University of Witwatersrand, Johannesburg, South Africa. He would like to thank the Centre for the invitation and for financial support. The material of the second author is based upon work supported by the National Research Foundation under grant number 2053740. ^Department of Mathematics, Drexel University, Philadelphia, PA 19104, U.S.A. 1-The John Knopfmacher Centre for Applicable Analysis and Number Theory, Department of Computational and Applied Mathematics, University of the Witwatersrand, P. O. Wits, 2050 Johannesburg, South Africa.
194
In the full version of this paper we will show that the proportion of gap-free or of complete compositions of n is
by reducing the compositions problem to the special case p — 1/2 of geometric random variables. 2
Geometric samples.
Consider F = (Fi,!^, . . . ,r n ) a sample of n i.i.d. geometric random variables with parameter p. Let
pn = P(F € C) be the probability that (Fi,. ..,F n ) is complete. To obtain a recurrence relation we condition on the number of F/s that are equal to 1. Since being complete implies that there is at least one 1 among the values of IYs, by the law of total probability we find that
We now observe that, given that j out of n Fjfc's are one, the sample is complete if and only if the remainining n — j variables take on all the values between 2 and their maximum. This is the same as to say that the geometric sample (F^ — 1) of length n — j is complete, given that all n — j of them are at least 2. But, by the memoryless property of geometric random variables, the conditional distribution of Ffc - 1 given that F^ > 2 is just that of Ffc. Since those of Ffc's that are at least 2 remain independent, this just means that
Since
The case of p / 1/2 is more interesting. In fact, in that case the sequence (pn) does not have a limit, but exhibits small oscillations. As illustrated in Figures 1,2,3 below, both the period and amplitude of the oscillations depend on the size of p.
Figure 1: Plot ofpn for p = 1/3 and 1 < n < 1000.
Figure 2: Plot of Pn for p = 2/3 and I< n < 1300
substituting these two expressions and changing the order of summation by letting k = n — j, we obtain the following recurrence for pn's:
Before analysing the general case let us note that if p = 1/2 then the sequence po = 1 and pk = 1/2 for k > I is a solution. Indeed, proceeding inductively we get
Figure 3: Plot of pn forp = 0.99 and 1 < n < 1000. To treat this case we will follow an approach that became quite common in the analysis of certain algorithms (see e.g. numerous examples in [7, 16]). We Poissonize the problem by considering the Poisson transform of the sequence (pn), analyse its asymptotics, and then we de-Poissonize to recover the asymptotics of (pn). To carry out this program let P(z), for z complex, be the
195
Poisson transform of (pn). That is
the Mellin transform yields
for any — 1 < c < 0. This integral can be evaluated with the aid of the residue theorem. Now, x~sQ*(s) — x~sT*(s)/(q~s — 1) has simple poles whenever q~s = 1, i.e. at Xk = 2kiri/\n(l/q), k = 0,±1,±2, ..., with corresponding residues
The main term comes from k = 0 and the remaining residues will contribute oscillatory terms of relatively small amplitude. In order to complete the proof we will need to de-Poissonize this result. Once this is done we will conclude that Hence P(z) satisfies the following functional equation
The values T*(xk) are given by where T(z) = e~ 2 Zl^LiPn 2 ""' an<^ since Jt is from (2.1) and the binomial formula that 0 < pn < 1, the series converges absolutely for every z. Moreover, for x 6 R, T(x) = O(x) as x -»• 0+ and T(z) has exponential decay as x —» oo. Thus the Mellin transform of T(x) exists in the strip (— l,oo) := {s E C : — 1 < 3ft(s) < 00} . By direct iteration we obtain for every m > 0
and by passing to a limit with m,
Since the series converge geometrically fast, they can be evaluated numerically with the aid of (2.1). For example, setting k = 0 gives the main term
Letting Q(z) = P(z) - 1 = -££Lo T (9 J2 ) and taking the Mellin transform we obtain
provided that series converges. Since this happens for 3£(s) < 0, Q*(s) will exist in a strip (-1,0). Inverting
196
Its plot as a function of p is given in Figure 4. To de-Poissonize we use the fact that P(z) satisfies (2.3) which is a special case of [16, Theorem 10.5] (see also [7]) with 7^2) = 0, 72(2) = 1, and t(z) = -T(z). Thus we need to verify conditions (10.28) - (10.32) of
[16]. But this is straightforward: (10.28) holds for any It is interesting to investigate the amplitude of the /? > 0, (10.29) holds as well, since oscillations on either side of the critical value p — 1/2. As shown in Figure 5 these become very small near to the critical value.
which is bounded by 1 provided 3?(2)/|^| > , which holds in a cone Figure 5: Plot of pn for p = 0.499 and 1 < n < 1000. (10.30) is trivial and (10.31) holds outside the cone since for z £ S9, 9fc(z) < a\z\ for some a < 1. Finally, (10.32) For 0 < p < 1/2 the amplitude of the fluctuations is true since increase steadily up until around p = 0.48 and then decrease rapidly to zero.
Figure 6: Plot of the amplitude of the fluctuations for p < 0.5. For p > 1/2 the amplitude of the fluctuations Figure 4: Plot of the non-oscillating limit term for pn increase steadily and in general are orders of magnitude forO
Figure 7: Plot of the amplitude of the fluctuations for p > 0.5.
197
±t.eierences [1] G. E. Andrews., The Theory of Partitions, Addison Wesley, Reading, MA, 1976. [2] Z.-D. Bai, H.-K. Hwang, and W.-Q. Liang, Normal approximations of the number of records in geometrically distributed random variables, Random Struct. Alg., 13 (1998),pp. 319-334. [3] L. Devroye, A limit theory for random skip lists, Ann. Appl. Probab., 2 (1992), pp. 597-609. [4] P. Flajolet and G. N. Martin, Probabilistic Counting Algorithms for Data Base Applications, J. Comp. Syst. Sci., 31 (1985), pp. 182-209. [5] P. Hitczenko, C. D. Savage, On the multiplicity of parts in a random composition of a large integer, SI AM J. Discrete Math., to appear, 2004. [6] P. Hitczenko, G. Louchard, Distinctness of compositions of an integer: a probabilistic analysis, Random Struct. Alg., 19 (2001), pp. 407-437. [7] P. Jacquet, W. Szpankowski, Analytical dePoissonization and its applications, Theoret. Comput. Sci., 201 (1998), pp. 1-62. [8] P. Kirschenhofer and H. Prodinger, On the Analysis of Probabilistic Counting, in Lecture Notes in Mathematics, (E.Hlawka, R.F.Tichy eds.) 1452, pp. 117-120, 1990. [9] A. Knopfmacher and H. Prodinger. Combinatorics of geometrically distributed random variables: Value and position of the rth left-to-right maximum, Discrete Math., 226 (2001), pp. 255-267. [10] D. Knuth, The Art of Computer Programming, vol. 3: Sorting and Searching, Addison-Wesley, Reading, MA, 1973. [11] T. Papadakis, I. Munro and P. Poblete, Average search and update costs in skip lists, BIT, 32 (1992), pp. 316332. [12] H. Prodinger, Combinatorics of geometrically distributed random variables: Left-to-right maxima, Discrete Math., 153 (1996), pp. 253-270. [13] H. Prodinger, Combinatorics of geometrically distributed random variables: Inversions and a parameter of Knuth, Annals of Combinatorics, 5 (2001), 241-150. [14] H. Prodinger, Combinatorics of geometrically distributed random variables: Lengths of ascending runs, LATIN2000, Lecture Notes in Computer Science 1776, pp. 473-482, 2000. [15] W. Pugh, Skip lists: a probabilistic alternative to balanced trees, Comm. ACM, 33 (1990), pp. 668-676. [16] W. Szpankowski, Average Case Analysis of Algorithms on Sequences, Wiley, 2001.
198
Computation of a Class of Continued Fraction Constants Loick Lhote* Abstract We describe a class of algorithms which compute in polynomial- time important constants related to the Euclidean Dynamical System. Our algorithms are based on a method which has been previously introduced by Daude Flajolet and Vallee in [10] and further used in [13, 32], However, the authors did not prove the correctness of the algorithm and did not provide any complexity bound. Here, we describe a- general framework where the DFV-method leads to a proven polynomial-time algorithm that computes "spectral constants" relative to a class of Dynamical Systems. These constants are closely related to eigenvalues of the transfer operator. Since it acts on an infinite-dimensional space, exact spectral computations are almost always impossible, and are replaced by (proven) numerical approximations. The transfer operator can be viewed as an infinite matrix M = (Mij)i
Introduction
When mathematical constants do not admit a closed form, it is of great importance to compute them. The book of Finch [12] provides many instances of this situation. Here, we consider a class of constants which are of great interest in the algorithmics of Dynamical Systems. Since they do not seem to admit a closed form, we are interested in their computability: does there exist an efficient algorithm that computes the first d-digits of the constants? More precisely, we wish to *GREYC, university [email protected]
of
Caen
(France),
mail:
prove that they are polynomial-time computable. We recall that a constant is said to be polynomial-time computable if its first d digits can be obtained with O(dr) arithmetic operations. Here, we are interested in the computability of "spectral" constants which are closely related to the spectrum of transfer operators associated to these Dynamical Systems. 1.1 The principles of the algorithm. Consider, in the complex plane, a disk D with center XQ. Consider an operator G that acts on the space
Then, for / € A00(D), the Taylor expansions at XQ of / and G[/] exist and the operator G can be viewed as an infinite matrix M := (M»j) with 0 < i,j < oo and The truncated matrix Mn := (^i,j}o
Note that the operator 7rn o G and the matrix Mn have the same spectrum. In the case of the Euclidean Dynamical System, Daude, Flajolet and Vallee introduced in [10] a method for computing (a finite part of ) the spectrum of transfer operators, which they further used in [13, 32]. Their method, the so-called DFV-method, has three main steps which we describe in an informal way (See Figure 1). (i) Compute the truncated matrix Mn relative to the operator G. (ii) Compute the spectrum SpM n of matrix M n ? '"-6-, the set of its eigenvalues,
199
(Hi) Relate the set SpMr with a (finite) part of sion of reals associated to constrained continued fraction expansions (Theorem 3).
SpG.
1.3 The Euclidean Dynamical System and its transfer operators. Every real number x €]0,1] admits a continued fraction expansion of the form
Figure 1: The DFV-method for computing eugenvalue approximates. where the mi form a sequence of positive integers. Ordinary continued fraction (CF) expansions of real numbers are the result of an iterative process which In the case when the transfer operator has a unique dominant eigenvalue A, isolated from the remainder of constitutes the continuous counterpart of the standard the spectrum, one can expect that it is the same for Mn, Euclidean division algorithm. They can be viewed as with a dominant eigenvalue A n . Moreover, the authors trajectories of a specific Dynamical System relative to of [10] observed that the sequence An seems to converge the Gauss map T : [0,1] -» [0,1] defined by to A (when the truncation degree n tends to oo), with exponential speed. They conjectured the following: There exist no, K, 0 such that, for any n > no, one has (here, \x\ is the integer part of x). The set Q of the \Xn~X\S setting but we prove that it is the case when (i) we approximate the dominant eigenvalue and (ii) the transfer operator is normal on a convenient functional space. The Euclidean Dynamical System belongs to the SCT>S class. It has been deeply studied by Mayer. Adapting his results to our more general setting, we exhibit a class of transfer operators which are normal on convenient Hardy spaces. We then prove that a class of continued fraction constants is polynomial-time computable (Theorem 2), and we are able to exhibit, for each constant of the class, an efficient algorithm which computes d proven digits in time O(d4). We apply our method to three constants: The Gauss-Kusmin-Wirsing constant 7G, the Hensley constant 7//, and the Hausdorff dimen-
200
It acts on A00(D) (for a convenient disk A00(D) see Section 3.1). A perturbation of the density transformer, the transfer operator G s , defined as
involves a new parameter s. It extends the density and denoted here bu 70, does not seem to be related to transformer since GI = G and plays a crucial role in other arithmetical constants [12]. It was computed in the analysis of rational trajectories. It acts on A00(D) [10] to about 30 decimal places by the DFV-method as soon as 3?s > 1/2. Remark that its iterate G™ of order n involves the set Qn of the inverse branches of depth n, Using similar methods, Sebah (unpublished) and Briggs [4] (2003) improved the accuracy to respectively 100 and 385 digits. Since we show in this paper that the DFVmethod leads to a (proven) algorithm, we exhibit here a polynomial-time algorithm to compute the GaussThe constrained transfer operator does not not involve n Kuz'min-Wirsing constant. the whole set Q of the inverse branches of some depth n, but only a subset A C Qn for some n. It is defined 1.5 Hensley's constant. The Euclid Algorithm is as closely related to the Continued Fraction algorithm. Indeed, if ao = a\ m\ + 0,2 is a division performed by the Euclid Algorithm, then the rationals XQ — a\/a^ This is a powerful tool for studying the reals whose CF- and Xi = 0,3/0,1 are related by T(x0) = ^i> and the execution of the Euclid algorithm on (a 0 ,ai) is just expansion only uses the set A*. It acts on AOO(DA) the trajectory T(ao/a\) of (ao/a\) under the action of (for a convenient disk DA, see Section 3.1) as soon as T. The complexity of the Euclid Algorithm (i.e., the 5ft$ > a A for some a A which depends on A (note that number P of divisions performed) was first studied in a A = — oo if .4 is finite). the worst-case by Lame [25] around 1850. A century In the following, if there exists n > 1 for which later (around 1970), Heilbronn [17] and Dixon [11] n A = G , the index A will be omitted. Consider the determined the average number of steps. Finally, in disk DA and the functional space AOO(DA)', for real 1994, Hensley [18], using the transfer operator Gs, 5 > &A, *he operator GS,A possesses a unique dominant showed that the number of steps of the Euclid Algorithm eigenvalue A^(s), positive, isolated from the remainder follows asymptotically a Gaussian law. Recently, Baladi of the spectrum by a spectral gap PA(S)- These two and Vallee [3], using deeper results on the transfer quantities are essential for describing the action of GS)>I. operators, obtained an alternative proof of this result, Since the transfer operator plays a fundamental role in that is both more general and more concise. On the set the analysis of the underlying Dynamical System, this of pairs (n, v) with 0 < u < v < TV, the asymptotic explains why the two quantities XA(S),PA(S) intervene in the description of our three algorithmic constants, expressions for the mean and the variance involve the first and second derivatives of the dominant eigenvalue which we now describe. function A (s) at s = 1: 1.4 Gauss— Kuz'min—Wirsing constant. Around 1800, Gauss [14] studied the evolution of the distribution of the iterates Tk(x}. In fact, he introduced an operator closely related to the density transformer G and he exhibited a density g(x) := (l/log2)(l + x)~l (now known as Gauss' density) which he proved to be invariant under the action of T (i.e., G[g] = g). He conjectured that it is a limit density; in other words, he asked whether, for any initial density /, the sequence G n [/j tends to g. One century later, Kuz'min [24] (1928) and Levy [26] (1929) proved this assertion. It was then important to obtain the optimal speed of convergence of Gn[/] to g. Finally, Babenko [2] and Wirsing [36], around 1975, completely solved the problem and showed that the speed is exponential. The ratio equals the subdominant eigenvalue ( which is unique and real) of the density transformer G, and Wirsing proved that there is a unique subdominant eigenvalue, real and negative. This constant called Gauss-Kuz'min- Wirsing constant,
The first derivative A'(l) equals the opposite of the entropy of the Euclidean Dynamical System. Since the invariant density (the Gauss density) is explicit, the value -A'(l) admits a closed form, -A'(l) = 7r2/(61og2). The constant that appears in the dominant term of the variance is the so-called Hensley constant, denoted by 7#. It involves the second derivative A"(l) that does not seem to be related to other arithmetical constants. The Hensley constant was previously computed by the DFVmethod in [13], This paper provides a proven approximation for the Hensley constant.
201
1.6 Hausdorff dimension and constrained CF— that acts on Aoo(D}. Consider the projection 7rn defined expansions. Consider some integer n and a subset A in (1-1) and the truncated operator Gn := 7rn o G. of N™. Denote by RA the Cantor set of reals in 7 whose The operator G has good truncations if the following is true: there is 0 < 1 such that, for any simple isolated continued fraction expansion is restricted to A, eigenvalue A of the operator G, there exist a constant K > 0, an integer no, and a sequence An € SpG n , for which, for any n > no, one has: As soon as A is different from N", the Cantor set RA has zero Lebesgue measure, and the Hausdorff dimension provides a precise description of it. In particular, the probability that a rational with numerator and denominator less than N belongs to RA is 0(7V2s>l~2), so that the expected time to obtain a rational Aconstrained with numerator and denominator less than N is 0(JV2-2s^). When A is finite, the reals of RA are interesting since they are all badly approximable by rationals [31]. If, furthermore, the set A contains more than one element, the Hausdorff dimension of RA, denoted by SA-, is a real number of ]0,1[, and is proven to be the unique real s 6]0,1[ for which the dominant eigenvalue function A^(s) of the transfer operator GS2} < 0.5433 and in 1982, Bumby [5][6] improved these estimates and obtains s{1>2} = 0.5313 ± 10~4. In 1996, Hensley [19] provided a polynomial-time algorithm in the case of a finite set A and obtained the following estimation
Finally, in 1999, Jenkinson and Pollicott [22] designed a powerful algorithm which computes S{i,2} up to 25 digits. Note that it is not a polynomial-time algorithm. The DFV-method has been applied (heuristically) to the case of a general set A and seemed to be efficient [32]. We propose a proven polynomial-time algorithm based on the DFV-heuristics for any subset AI x AI x ... x An of Qn. In the particular case when A C G, it gives rise to proven numerical values of S.A, and it seems to run faster than Hensley's algorithm.
The constant 9 is called the truncature ratio. If moreover the triple [0, K, no] is computable, then the truncations are said to be computable. We are interested in constants that arise in spectral objects relative to complete Dynamical System. A complete Dynamical System is a pair (J, T) formed with an interval T and a map T : J —>• J which is piecewise surjective and of class C 2 . We denote by Q the set of the inverse branches of T; then, Qk is the set of the inverse branches of Tk. It is known that contraction properties of the inverse branches are essential to obtain "good" properties on the Dynamical System. Usually, what is needed is the existence of a disk D which is strictly mapped inside itself by all the inverse branches h € Q of the system [i.e., h(D) C D]. Here, we have to strengthen this hypothesis. This motivates the following definition. Definition 2. [Strongly contracting dynamical system.] A complete Dynamical System of the interval X is said to be strongly contracting (SCT)S in short) when the set Q of the inverse branches fulfills the supplementary condition : For any subset A C Q, there exist XQ G T and two open disks of same center XQ, the large disk DL, and the small disk DS, with DS £ DL and X C DL, such that any h € A is an element of AOO(DL) which strictly maps DL inside DS [i.e., h(Di,} C DS]. Remark that (XQ,RS,RL) depend on A. The largest possible ratio RS/RL between the radii RS and RL of the two disks is called the A-contraction ratio.
A strongly contracting system is said to be extracontracting (XSCT>S in short) if, for any A C Q, there exist an integer k > 1 and another disk DXL (the extralarge disk) eccentric with DS, with DL C DXL, such that that any h G Ak is an element of AOO(DXL) which 2 Polynomial time algorithms for Strictly strictly maps DXL inside DS [i.e., H(DXL) C DS]. Contracting Dynamical Systems. We shall prove in the following that many Dynamical As explained in Section 1, we wish to prove the DFV- Systems relative to Euclidean algorithms belong to the method. The following definition is natural in this <SC£>*S-setting, and even in the XSCDS-seit'mg. context. Definition 3. [Transfer operator] Let (I,T) be Definition 1. [Operator with good truncations.] Let of SCVS-type. Consider an integer n > 1, a subset D be a disk of center XQ, and consider an operator G A C Qn, a real a A > 0, a sequence (oih)h&A of functions
202
of
DL positive on > GA, the quantity
, such that, for any s with r < rf(A,SpG\{A}). The constants a c (G) and bc (G). defined by
Then, the relation
defines, for 5Rs > 0^4, an operator GS.A '• Ax>(As) —* AOQ(DL) whose norm \\GSiA\\Ds,DL is &t most 6(s,A). Such an operator is called a transfer operator with constraint A. If moreover the system (/, T) is of XSCUS-type, with an integer k > 1, the operator G£ A maps A00(Ds) into
play a central role in the paper. They first intervene in the following result, which is fundamental here. Lemma 1. Let G and G be two operators on the Banach space (B, \\ \\). Suppose that A is a simple and isolated eigenvalue of G, with an eigen/unction 0, and consider an isolating circle C = C(\, r). Then, if G and G satisfy ||G - G|| < j3c(G), then the operator G has a unique simple isolated eigenvalue A in C that satisfies
Our first result is as follows: Theorem 1. [A transfer operator has good truncations.] In the SCT>S-setting, a transfer operator GSi^ : AOQ(DS) -+ ^OO(DL) satisfies the following: (ii) It is compact, and its spectrum is formed with isolated eigenvalues of finite multiplicity, except perhaps at 0. (ii) for any real s, with s > a A, the operator Gs^ has a unique dominant eigenvalue A^(s) simple, positive and isolated from the other eigenvalues by a spectral gap. (Hi) for any real s, with s > a^, the operator Gs^ has good truncations. The truncature ratio 0 satisfies 6 < RS/RL where RS and RL are the radii of the optimal pair of disks (D$, DL) relative to A. (iv) In the XSC'DS -setting, the truncature ratio satisfies 9 < RS/RXL where RS and RXL are the radii of the optimal pair of disks (Ds, DXL) relative to A, Now, we prove Theorem 1 and provide explicit constants for K, HO and 9. The first two assertions are easily adapted from the works of Mayer [30], and we then focus on the the proof of the third assertion, which is mainly based on two results. We first a well-known result of functional analysis, which says: "When two operators are close with respect to the norm, their spectrums are close too". The second result shows that the strongly contraction property entails that the truncated operators converge in norm to the transfer operator.
The condition ||G - G|| < /3C(G) implies, via the definition of 0c(G), three conditions, with a precise goal for each of them. The first condition ensures that the circle C is also included in the resolvant set of G. The second condition implies that A is the unique eigenvalue of G in C. Finally, with the last condition, it is possible to relate the two spectral spaces related to A and A. This proposition will be also useful for computing the Hensley constant (in Section 4.2). 2.2 Convergence of truncated operators in the SCT>S—setting. Consider any operator G : Aoo(Ds) —* AOO(DL)- We recall that the non-zero eigenvalues of the operator nn o G and the matrix M n are the same. Then, according to the previous Lemma, it is sufficient to obtain the convergence of vrn o G to G (in norm). This is the aim of the following lemma, which requires the strong contracting property. A proof restricted to the framework of Continued Fractions can also be found in [20] with slightly different functional spaces. Lemma 2. Consider an operator G : A00(Ds) —•* AOO(DL) with norm \\G\\£>S,DL- Then, one has:
2.1 Functional analysis. Denote by (B, || ||) a Suppose furthermore that there exists some disk DXL> complex Banach space and by G an operator which acts with DL C DXL, and some k—th iterate of G which on B. Denote by SpG the spectrum of G. Consider maps AOQ(Ds) into AOO(DXL)- Then, for any eigena fixed eigenvalue A and a circle C = C(X,r) (with function 4> ofG, center A and radius r > 0) that isolates A from the remainder of the spectrum. This means that 7- satisfies
203
1.3), the centered algorithm C and the odd algorithm O. All these algorithms give rise to a Dynamical SysProof. For / € A00(Ds)-l the i-th coefficient o^ of tem (/,T), where T is always of the form Taylor expansion of g :— G[f] at XQ satisfies, with the Cauchy formula, the inequality a.iRlL < \\g\\oL, and the Strongly Contracting Property entails
Then, the definition of HGHDs.Di, provides the first result. For the second, note that any eigenfunction <£ belongs to AOQ(DXL), and apply relation (2.4) with RXL instead of RL 2.3 The pair [0,K,no]. We now return to the transfer operator GS)^ and we consider a simple isolated eigenvalue A of GSj^ together with an isolating circle C = C(A,r). We recall that \\G8tji\\Ds,DL is at most 6(s,A). Consider first the «SCX>«S-setting. Denote by no the smallest integer n for which
with VS(M) is the integer part to u, Vc(u) is the nearest integer to it, and Vo(u) is the odd integer nearest to u. The intervals Is and IQ are equal to [0,1], while the interval Ic equals [0,1/2]. The set of the inverse branches is Qs = Q already described in (1.3), and, in the two other cases,
(here, the order > is relative to the lexicographic order.) In each of the three cases, it is proved in [34] that there exists a disk D, with I C D, which is strictly mapped inside itself by all the inverse branches h e Q of the system [i.e., h(D) C D]. In fact, in each case, these systems are of SCT>S type and the pair (XQ,RS,RL) can be chosen as follows:
According to the two previous lemmas, for all n > no, the truncated operator 7rn o GS>A (and then the matrix MS).4in) admits a unique eigenvalue An in the interior of C that satisfies |An - A| < K6n where K and 9 are given by
For each of the three algorithms, Theorem 1 proves that the "spectral constants" are polynomial-time computable. Then, all the eigenvalues A(s) can be computed in polynomial-time. The case s = 1 is trivial since A(l) = 1, but, the case s = 2 is of great interest, since A(2) plays an important role in lattice reduction algorithms [10] and comparison algorithms using the continued fraction expansion [35]. Finding an estimate for A(2) actually motivated the DFV method. The entropy -A'(l) is explicit for the three algorithms, but the asIn the case of the XSCT>S-sett'mg, the integer no is the sociated Hensley's constant is proven to be polynomialsame as previously; however, the constants K and 9 can time computable (with Theorem 1 and methods of 4.2). be chosen as All the previously described constants are related to the spectrum of the classical transfer operator, i.e.,
Moreover, the actual analysis of Euclidean Algorithms deals with various "costs". The cost of an execution This ends the proof of Theorem 1. is the sum of the cost c relative to each step of the algorithm and involves a cost c which depends only on 2.4 Instances of the SCT>S setting and applica- the digit produced at each step. Then the analysis of tions of Theorem 1. The work [34] introduces a class this cost introduces a more general transfer operator, of Euclidean algorithms that is called the Fast Class. We mainly consider here three algorithms of this class : the standard algorithm S (already described in Section
204
More precisely, (see [3]), if the cost c is of "moderate growth" , the variance of the total cost of an execution can be expressed with the derivatives (of order 1 or 2) of the dominant eigenvalue Xs<w of Gs,w If the cost is of "large growth", then the Hausdorff dimension relative to this cost also involves the dominant eigenvalue A S]U , ([7]). And the DFV-method can be proven to apply, with exponential rate of convergence. The list of these possible applications of Theorem 1 is not exhaustive. We now focus on the standard Euclidean dynamical system, where it is possible to provide estimates for the pair [K, HQ}.
Theorem 2. The (standard) Euclidean Dynamical System is a system ofXSCT>S-type. For any AcO, the triple [K,no,0] used for approximating the dominant eigenvalue A^(s) can be estimated, and there exists an effective algorithm that computes XA(S) in polynomialtime. The remainder of the Section is devoted to proving Theorem 2.
3.1 Disks Z>5, £>£,, DXL an -mA}. We then can choose for the disk DXL needs an estimate of A and a lower bound for the any disk of center XQ and radius spectral gap around A. We prove in Lemmas 3 and S 4 that it is possible, at least when A = \A( } is the dominant eigenvalue, If we use the XSCUS-seiimg, we wish to obtain an estimate of the eigenfunction 0 relative to A. On the other hand, even when C — C(\,r) is welldefined, a lower bound for the constant /?c(G), given in (2.2) is not easy to compute. It involves, via the definition of ac(G) in (2.1) an upper bound for the norm of an inverse operator. Such an upper bound is in general hard to compute. However, when G is normal, an explicit an expression for ac(G) are known,
(we recall that G is normal when it commutes with its dual G*) But the normality is a rare phenomenon, which is difficult to prove. Here, it is not true that the transfer operator Gs,.4 is normal on spaces A00(D); however, there exists another functional space (a Hardy space, denoted by H.S,A, which depends on (s,A) where GS).4 is normal (note that this normality phenomenon does not seem to hold for the two other Dynamical Systems C, O] . Even if the two spaces, the Hardy space and the space A00(D) are different, the associated norms can be compared, and this provides an upper bound for <XC(GS,A), and a lower bound for j3c(G}. Finally, the results of this Section lead to the second main result of the paper.
3.2 Estimate for the dominant eigenvalue of GS)^. We use the following classical result previously used in [10]: Let G be an operator which acts on the space of analytic functions on an interval [a, b]. Furthermore, the operator G is positive (i.e., G[f] > 0 if f > 0) and has a unique dominant eigenvalue A isolated from the remainder of the spectrum by a spectral gap. Suppose that there exist two strictly positive constants c\ and c% and a function f which is strictly positive and analytic on [a, 6] and satisfies c\ f(x] < Gs>A[f](x) < C2/(x), for any x € [a, b}. Then, the dominant eigenvalue A satisfies c\ < A < c2Denote by mA the minimum of A and by MA its supremum (possibly infinite if A is infinite). By convention, if MA = oo, we put h,MA = 0. Each LFT HMA ° hm^ or hmA ° JIMA is called an extremal LFT; it has a unique positive fixed point, denoted by a A or bA. Then the disk DA with diameter [a^,6^] is the smallest disk which is mapped into itself by all the elements of A. The application of the previous result with the operator GS^A and the functions / = 1 for the upper bound
205
and / = 1/(1 + 0x)2s for the upper bound provides the But we can show by simple calculations that, estimate for the dominant eigenvalue \A(s], as a function of the Hurwitz zeta function (,A restricted to A, with ^ as in the lemma 3. Now, using the relations (3.4, 3.5) yields the following result: Lemma 4. The spectral gap satisfies Lemma 3. Fix the real 0 = (-mA + \/rn2A + 4)/2. The dominant eigenvalue \A(s) admits the following estimates, which involve the Hurwitz zeta function restricted to A [defined in (3.2)] Remark that the previous estimates can be improved using the improved estimates of A^(s). Since the the function x —»• (1 + /?x)2sO(2s,^ + x) is Wirsing [36] has shown that -}G satisfies 0.3020 < |7cl < 3043. Since the dominant eigenvalue of GI is 1, using increasing, the previous estimates can be improved to the trace one gets |7c| — |^| > 0.18959 where p. is one of the sub-subdominant eigenvalues of GI . This improves the previous estimate for the spectral gap around 70 where aA is the fixed point described in the previously. which was (7^! — \p>\ > 0.031. 3.3 Estimate for the spectral gap. In this second step, we determine a lower bound the for spectral gap PA(S) between the eigenvalue XA(s) and the remainder of the spectrum. For this purpose, we use the trace of transfer operators. Grothendieck introduced the so-called nuclear operators (of order 0) and proves that they possess a trace that can be viewed as a generalisation of the usual (matrix) trace. Our transfer operator is nuclear (of order 0) and its trace equals the sum of all the eigenvalues. In particular, Tr G2 is just the sum of all the squares of the eigenvalues of G. This entails a relation between Tr G2, the dominant eigenvalue XA(s) and one of its subdominant eigenvalue
3.4 Normality on Hardy spaces. As already said, the constant ac(GS)^) has a closed form (3.1) as soon as GS,A is normal. The transfer operator is not normal on AOO(.DS) but it is normal on another space called a Hardy space [21] and denoted 7is,* A- For p 6 R, denote by Pp the half-plane Pp = [z e C ; > p}. The Hardy space T~LS,A is formed with the functions / which are analytic on P_ m ^/ 2 and bounded on all the halfplanes Pp (with p > -ra.4/2) and admit the following integral representation:
with (3.4) A(s]
< TrG^-A^s)
The operator G2 A is the sum of operators of the form With the associated norm L[/] = \ti\s • f oh for h £ A2. When h is indexed by the pair (i, j ) , the spectrum of L is exactly a geometric progression of the form {T~2s~2n : n>0} with the space H.S,A is a Banach space. There exist close relations between 7iSiA and A00(Ds)For A = G, Babenko [2] and Mayer [28] proved that the Finally, thanks to the additivity of the trace, the trace behaviour of Gs is comparable on HS,A on A00(D^). 2 of G A satisfies Their methods cannot be easily generalized in the case when A 7^ Q. Then, in this case, we provide here a different method which makes a great use of the generalized Laguerre polynomials.
206
Lemma 5.
For any complex s with 3?(s) >
max((Tv4,0),
(i) the transfer operator GSiA : 7ts,A -* ?is,A is isomorphic to an integral operator; it is normal and selfadjoint for real values of s. Thus, for real values of s, the spectrum ofGsAis real. («t) the spectrum of G.^ on HS,A and the spectrum of GS,A on AOO^DS) are the same. (Hi) Let D be an intermediary disk of center XQ and radius R with RS < R < RL and f a function of A^D). For any subset A ofG, the function GS)^[/] belongs to Ti.s,A. (iv) Define, for any R (with Rs < R < RL), three constants KI, «2, «3 (which depend on x0, s, A, R),
Then, \he following is true:
Before proving Lemma 5, we explain how it provides an estimate for ac(GSi^). Consider the isolating circle C of center \A(s) and radius rA(s) described in Lemma4, and consider a point z € C. The two inclusions
. Finally, with the formulae (3.11) and (3.12), the inequality \\(G holds. Now, the estimate of XA(s) (given in Lemma 3)
yields that any zE Z satisfies \
and an upper bound of otc(Gs,A) follows. Finally, we obtain: Lemma 6. Denote by rA(s) the lower bound 4 For any intermediary radius R, with Rs <
of Lemma R < RL and s > max(0, &A), there exist constants Ki denned in Lemma 5 (which depend on x0l R, s, A), for which
Proof of Lemma 5. For (i) and (ii), we refer to the work of Jenkinson, Gonzalez and Urbanski [21], and we mainl y deal with (*«) and (iv^ The stronS contraction condition implies the inequality (3.8). The inequality (3.9) is a direct application of the CauchySchwartz inequality with the identity
The proof of inequality (3.10) is more involved and we only explain here its main steps. First Hensley [20] has shown that, for all j > 0, the function GStA[(X - x0)J] is an element of liSjA whose integral representation is closely related to generalised Laguerre polynomials LJ s~ '. The Laguerre polynomials ( L j ) form an orthogonal basis for the weight t p e~ f on ]0, oo[ and they verify the formula
together with the relation
entail the following
The function GS,A[(X — XQ)J] then satisfy
We deduce that the norm is given by Now, GS,.A is normal on Ti.s,A, so that
207
But the Laguerre polynomials are positive and decreas- It is possible to estimate a circle C that isolates 70 ing on [0, 2s/(j 4- 1)]. Using these properties with some together with the associated constant ac(Gi). Thus, relations of orthogonality and splitting the integrand the DFV-method provides proven numerical values for 7G- Only one computation of matrix is needed, and the we prove the inequality complexity of 70 is of order four. Numerical results are summarized up in Figure 4.1.
Now, the i-th coefficient c» of the Taylor expansion of at x0 satisfies R>\CJ\ < ||/||D and finally
4.2 Algorithm for the Hensley constant. The Hensley constant (see 1.8) involves the first two derivatives of A(s) at s = 1. Since the first derivative A'(l) has a closed form A'(l) = — 7r 2 /(6 log 2), it remains to compute the second derivative A"(l). Consider an interval Ih of the form 7^ := [1 — /i, 1 + h] and suppose that an estimate A of A satisfies
Note that the previous series converges exponentially fast. Finally, the constants K and no defined in (2.5) and(2.6) Then Taylor's formulae entail the estimate involve the isolating circle C whose center A^(s) and radius r^(s) together with ac(Gs
Application of the DFV-Mehtod to three constants This section applies the previous results and provides (in polynomial-time, via the DFV-Method) proven numerical values for three continued fraction constants: the Gauss-Kuz'min-Wirsing constant, the Hensley constant and the Hausdorff dimension of the Cantor sets RA with A C £.
It then suffices to know an upper bound for the fourth derivative A^ on the interval Ih- The application s —> Qs is analytic and the derivative G^ satisfies l|G'JDs < 8 f o r s > 0 . 9 . Then, ||Gs-G1||Ds <8|s-l|. We apply Lemma 3: the circle C of center 1 and radius TI := (1 — 7c)/2 is an isolating circle for A = 1. Then, if s satisfies |s — 1| < r-2 with r% — /?c(G)/8, the operator Gs admits a unique simple isolated eigenvalue A(s) in C which satisfies |A(s) — 1| < r with r := 16 TI r2 ctc(G) In the second step of the DFV-method, we deal with as soon as \s — 1| < r-2- Since the application s —>• G s computations on the matrix ~M.s,A,n- First, we have to is analytic, the function s —» A(s) is analytic too. The build the matrix; second we have to find the roots of Cauchy formula, applied in the disk of center 1 and det(M Si ^ in - zln). Then we conclude: radius TI yields the upper bound For any subset A C Q, building matrix MSj-4jTl needs O(n3) multiplications and additions on reals and 2n+1 computations of £A(S) functions; Computing SpMS)V4)Tl needs at most (9(n4) arithmetical operations (with a bad method). Then, 7/7 is computable as soon as the two estimates for A(l + h) and A(l — h) are known (asymptotically to In this section, we shall prove the last result of the within twice the required precision). This thus needs paper: two computations of the step 2 of the DFV-method. Theorem 3. For the three following constants the Gauss-Kuz'min-Wirsing constants, the Henley con- Tabular 4.2 summarizes some numerical results. stants, the Hausdorff dimension relative to constraints Algorithm for the Hausdorff dimension. A C Q- it is possible to provide d proven digits in 4.3 The algorithm uses a classical dichotomy principle and polynomial-time in d. computes a sequence of intervals of length 2~ fc which 4.1 Algorithm for the Gauss-Kuz'min-Wirsing contain the Hausdorff dimension 5^4. Consider the constant. The Gauss-Kuz'min-Wirsing constant 7^ is interval [wfc_i, Vk-i] obtained after (k — 1) steps (it is the unique subdominant eigenvalue of GI. It is real. of length 2~( fc ~ 1 )and contains SA- Denote by Wk the
208
digits time proven value 10 11s -0.3036630028 20 1m46 -0.30366300289873265859 30 9m54 -0.303663002898732658597448121901 40 34m 710 -0.3036630028987326585974481219015562331108 50 1h41 -0.30366300289873265859744812190155623311087735225365 Figure 2: Gauss-Kuz'min-Wirsing constant
coefficients deal with more complicated zeta-functions. Finally, the authors of [10] and Sebah used the DFVmethod with XQ = 1/2. This particular choice do not enter our framework since no disk of center 1/2 is strictly mapped into itself. We can use any disk DL of center 1/2 + <5 with radius 1/2 + 26 diameter Figure 3: Hensley constant [—(5,1 -1- 3(5] with 6 > 0. This leads to truncature ratio RS/RXL which tends to 1/3 — e as 6 tends to zero, which is the convergence rate actually observed by the middle point of the interval [wfc_i,v/t_i] and compute authors. However, our method of Section 3.4 does not an estimate Now, there seem to apply there, and we do not know how to obtain an estimate for the pair [K, no] in this setting. are three possible cases: digits 5 10 15 20
time 2m30s 7m30s 41mn 2h33mn
proven value 0.51606 0.5160624089 0.516062408899991 0.51606240889999180681
References
This algorithm is just a classical binary splitting. The proof that SA belongs to [wjt,vjfc] is based on the strict decrease of A together with the inequality |A^(s) — XA(s + h)\ > h. There are at most O(d) iterations, each of them of cubic complexity (in d). Thus, the complexity of the algorithm is asymptotically O(d5). Numerical results are given in Figure 4.3 for the Cantor set 7£{i,2}5
Conclusion
We proved here that the DFV-method gives rise to an algorithm that computes any isolated simple eigenvalue of a transfer operator, in polynomial-time, provided that two conditions are fulfilled: (i) the operator has good truncations and (ii) the matrices Mn are easy to compute. However, if we are interested in actual proven numerical values, we need evaluating the parameters that intervene in the design of the algorithm. These parameters are in general difficult to compute, but we solve this difficulty for the transfer operators relative to the Standard Euclidean Algorithm. The DFV-method can also be used to compute the Hausdorff dimension of the Cantor sets HA with A of the form A = A\ x A-z x . . . x An, Ai € H. However, the computation of the matrix is more involved since its
[1] M. AHUES, A. LARGILLIER, V. LIMAYE. Spectral computations for bounded operators, Chapman & Hall/CRC (2001). [2] K. I. BABENKO. On a problem of Gauss, Soviet Math. Dokl. 19 (1978), 136-140. [3] V. BALADI AND B. VALLEE. Euclidean algorithms are Gaussian, Les Cahiers Du GREYC, Universite de Caen (2003) [4] K. BRIGGS. A Precise Computation of the GaussKuzmin-Wirsing Constant. Preliminary report. 2003 July 8. http://research.btexact.com/teralab/documents/wirsing.pdf. [5] R. T. BUMBY. Hausdorff dimension of Cantor sets, J. Reine Angew. Math. 331 (1982), 192-206 [6] R. T. BUMBY. Hausdorff dimension of sets arising in number theory, Number Theory (New-York, 19831984), Lecture Notes in Math., 1135, Springer, 1985, pp. 1-8 [7] E. CESARATTO. Thesis, University of Buenos Aires, 2003 [8] T. CusiK. Continuants with bounded digits, Matematika24 (1977), 166-172 [9] T. CusiK. Continuants with bounded digits II, Matematika25 (1978), 107-109 [10] H. DAUDE, P. FLAJOLET, B. VALLEE. An AverageCase Analysis of the Gaussian Algorithm for Lattice Reduction, Combinatorics, Probability and Computing (1997) 6, pp 1-34 [11] J. G. DIXON. The number of steps in the Euclidean algorithm, J. Number Theory, 2 (1970), 414-422
209
digits time proven value ofS{1,2} 5 2m 0.53128 10 8m 0.5312805062 15 25m 0.531280506277205 20 1h 0.53128050627720514162 30 4h26 0.531280506277205141624468647368 40 14hll 0.5312805062772051416244686473684717854930 45 I 23hlO I 0.531280506277205141624468647368471785493059109 Figure 4: Hausdorff dimension of 7£{
[12] S. FINCH. Mathematical Constants. Cambridge University Press (2003) [13] PH. FLAJOLET AND B. VALLEE. Continued Fractions, Comparison Algorithms, and Fine Structure Constants, in Constructive, Experimental et Non-Linear Analysis, Michel Thera, Editor, Proceedings of Canadian Mathematical Society, Vol 27 (2000), pages 53-82 [14] C. F. GAUSS. Recherches Arithmetiqu.es, 1807, printed by Blanchard, Paris, 1953 [15] I. J. GOOD. The fractional dimension of continued fractions, Proc. Camb. Phil. Soc. 37 (1941), 199-228. [16] A. GROTHENDIECK. Produits tensoriels topologiques et espaces nuclaires, Mem. Am. Math. Soc. 16 (1955) [17] H. HEILBRONN. On the average length of a class of continued fractions, Number Theory and Analysis, P. Turan, ed., Plenum, New York, 1969, pp. 87-96 [18] D. HENSLEY. The number of steps in the Euclidean algorithm, J. Number Theory, 49(2) 142-182 [19] D. HENSLEY. A polynomial time algorithm for the Hausdorff dimension of a continued fraction Cantor set, J. Number Theory, 58(1)(1996), 9-45 [20] D. HENSLEY. Continued Fractions, World Scientific, book to appear [21] O. JENKINSON, L.F. GONZALEZ, M. URBANSKI. On transfer operators for continued fractions with restricted digits Proc. London Math. Soc., to appear. [22] O. JENKINSON AND M. POLLICOTT. Computing the dimension of dynamically denned sets I: E% and bounded continued fractions, preprint, Institut de Mathematiques de Luminy, 1999. [23] M. KRASNOSELSKII. Positive solutions of operator equations, P. Noordhoff, Groningen, 1964. [24] R.O. KUZ'MIN. On a problem of Gauss, Atti del Congresso internazionale dei matematici, Bologna, 1928, Vol. 6, 83-89 [25] D. LAME. Note sur la limite du nombre de divisions dans la recherche du plus grand commun diviseur entre deux nombres entiers, C. R. Acad. Sc. 19 (1845) 867870 [26] P. LEVY. Sur la loi de probabilite dont dependent les quotients complets et incomplets d'une fraction continue, Bull. Soc. Math. France 57 (1929) 178-194 [27] D. H. MAYER. On composition operators on Banach spaces of holomorphic functions, J. Funct. Anal. 35
210
(1980), 191-206 [28] D. H. MAYER. Continued fractions and related transformations, Ergodic Theory, Symbolic Dynamics and Hyperbolic Spaces, T. Bedford, M. Keane and C. Series (eds), Oxford University Press, 1991, 175-222. • [29] D. H. MAYER. On the thermodynamic formalism for the Gauss map, ibid. 130 (1990), 311-333. [30] D. H. MAYER. Spectral properties of certain transfer composition operators arising in statistical mechanics, Communications in Mathematical Physics, 68(1979), 1-8 [31] J. SHALLIT. Real Numbers with Bounded Partial Quotient: A Survey in G?, (M. Rassias Ed.), The Mathematical Heritage of Carl Friedrich Gauss, World Scientific, Singapore, 1991. [32] B. VALLEE. Dynamique des fractions continues contraintes priodiques, Journal of Number Theory 72(1998), no. 2, 183-235. [33] B. VALLEE. Operateurs de Ruelle-Mayer generalises et analyse en moyenne des algorithmes d'Euclide et de Gauss, Acta Arithmetica, 141.2 (1997). [34] B. VALLEE. Dynamical Analysis of a class of Euclidean Algorithms, Theoretical Computer Science, vol 297/13 (2003) pp 447-486 [35] B. VALLEE. Algorithms for computing signs of 2 x 2 determinants: dynamics and average-case analysis, Proceedings of ESA'97, LNCS 1284, pp 486-499. [36] E. WIRSING. On the theorem of Gauss-Kusmin-Levy and a Frobenius-type theorem for function spaces, Acta Arith. 24 (1974), 507-528
Compositions and Patricia tries: no fluctuations in the variance! Helmut Prodinger Dedicated to Hosam Mahmoud on the occasion of his 50th birthday.
Abstract We prove that the variance of the number of different letters in random words of length n, with letters i and probabilities 2~ l attached to them, is 1 + o(l). Likewise, the variance of the insertion cost of symmetric Patricia tries of n random data is given by 1 + o(l). These two examples disprove popular belief that such quantities must always contain fluctuating terms.
1 Introduction A surprisingly large number of results in analysis of algorithms contain fluctuations. A typical result might read "The expected number of ... for large n behaves like Iog2 n + constant + <J(log2 n)." Examples include various trie parameters, approximate counting, probabilistic counting, radix exchange sort, leader election, skip lists, adaptive sampling; see the classic books by Flajolet, Knuth, Mahmoud, Sedgewick, Szpankowki [16, 11, 12, 14, 18] for background. As one can see from Figure 1, S(x) has mean zero (the zeroth Fourier coefficient is not there) and very small amplitude. On the other hand, 62(x) is still periodic with period 1, but its mean is not zero. Why should we worry about a quantity apparently as small as « 10~12? The reason is the variance of such parameters, as it naturally contains the term "—expectation2," and as such also —S2(x). That might not be a sufficient motivation for a casual reader if it were not the case that often substantial cancellations occur. In order to identify them, one has to know more about 62(x). If one ignores these terms, one gets wrong results, and the results are not wrong by « 10~12, but by an order of growth! Path length in tries, Patricia tries, and digital search trees [4, 10, 5] are such cases: the variance is in reality of order n only, but ignoring the fluctuations
Figure 1: S(x) and o2(x)
would lead to a (wrong) « n2 result. Size and node level in unbiased (=symmetric) tries exhibit concentration of distribution but proving this requires nontrivial modular form identities, as described for instance in [8]. This in turn has impact on the stability of certain communication protocols—in particular, the tree protocol of Capetanakis-Tsybakov-Mikhailov whose status remained partly unsettled for a few years: see the papers by Berger, Gelenbe and Massey in [13] and the special issue [15]. Now, questions like that occurred in several writings of this author (together with various coauthors), as can be seen from the references. The techniques are extremely interesting, as one has to dig deep into classical analysis. So far, it seems that the calculus of residues, as used in the sequel, is the most versatile approach in this context. Another approach is to ^Supported by NRF Grant 2053748. use (modular) identities due to Dedekind, Ramanujan, tr The John Knopfmacher Centre for Applicable Analysis and Number Theory, School of Mathematics, University of the Witwa- Jacobi and others (which can often be proved by Mellin tersrand, Private Bag 3, Wits, 2050 Johannesburg, South Africa, transform techniques); however, often they do not quite helmutQmaths.wits.ac.za fit. The residue calculus approach directly addresses the
211
formula that is ultimately needed. following section.1 Many such considerations have been performed about 10-15 years ago, but a new surprise showed up For interest, the variance is computed from the in August 2003: There are two examples, where the exponential generating function variance has no fluctuation at all (at least not in the leading term). This does not seem to be intuitive by any means, so we must rely on some analysis to exhibit that phenomenon. The present paper is devoted to just as that. The two sections that follow prove the following theorem. THEOREM 1.1. 1. Consider words x\...xn, where the letters follow (independent) geometric random variables X with P{X = i} = 2-i for i = The periodic function that our analysis exploits, is 1,2,3,... . Then the variance of the parameter lven as a "number of different letters in a random word of § Courier series, length n" is I + o(l). 2. The variance of the insertion cost of a random Patricia trie constructed from n random data is
with
The methods that are presented here also allow one to simplify the Fourier coefficients of the fluctuations in the variance even in such cases where the periodic function persists. After all, the ultimate simplification is to show that the Fourier coefficients are zero in the here and in the sequel, we will use the abbreviations two cases on which this paper concentrates. L = log 2 and Xk = ^r- Not surprisingly, the sum term originates from the square of a periodic function 2 Words and Compositions which was contained in the asymptotic expansion of the In a forthcoming paper [1], words x\.. .xn are consid- expectation. The function g(x) is defined by ered, where the letters follow (independent) geometric random variables X with 1P{X = i} = pql~l for i = 1,2,3,... and p + q = 1. The parameter of interest is the number of different letters in a random word of length n which appear at least b times. Pawel Hitczenko, who visited us in August 2003, reported that and t/>(:r) is the logarithmic derivative of the Gamma he and Guy Louchard [2] had considered the special case function. Our goal is to show that afe = 0 for all k € Z, p = q — i and b — 1 in the context of random compo- MO. Let us rewrite the formula for a^, using not more sitions. The variance of the number of part sizes is than the formula T(-x)x(x — 1)... (x — j + 1) = given by 1 + o(l). Our general analysis however predicted a result of the form log^ 2 + SyfiogQ n) + o(l), with Q = l/q and a periodic function 5v(%) of period one. Such oscillations are quite common in analysis of algorithms and enumerative combinatorics] see e.g., the books [16, 18]. We were both right, and, indeed, in the special case, the periodic function cancels out! This will be demonstrated in the sequel. After this surprise, I looked for other examples from the past and periodic oscillations in the variance that might actually not be there, and I found the instance of (symmetric) Patricia tries, which I will treat in the
212
It is not likely that significantly different examples can be found.
The technique to rewrite the second sum (and similar ones) accordingly was already presented in earlier publications; let me just cite [9]: One considers
and its integral
The choice of this function is driven by the fact that the denominator has simple poles at z = Xj f°r a^ 3 € Z, and that the numerator produces the "right" residues. Figure 2: The path of integration. The line of integration will be shifted to 3?z = «. r eac There are poles at z = —Xj> f° h J € Z. Taking them into account, one gets This integral can be evaluated using the technique of Barnes, as explained in [19, p. 286ff]. The line of integration (see Figure 2) must be shifted to 5Rz = 0, with the provision that the singularities of F(—Xk + 2), i.e., z = Xk, must lie on the left of the path, and the singularities of F(—z), i. e., z = 0, must lie on the right of the path. The first thing can be achieved by subtracting the residue at z = Xk, which leads to a term T(—Xk)\ f°r the second thing, nothing must be Now one writes done. So, and gets
Altogether, we have seen that /2 = 0. We note that the integral is the sum of the (negative) residues right to The simple change of variable z := 2 + Xk produces the the line 3?z = -: integral I\ again, and one finds
On the other hand, integral What remains is the evaluation of the integral
can simply be evaluated by shifting the line of integration to the right, and collecting (negative) residues at I
213
for / = 1,2. . . . The result is
These results can be rewritten, using T(x + 1) = xT(x) and $(x + 1) = ^(z) + -:
Putting the two different evaluations together, one sees
Szpankowski obtained related results in [17]
and this is the identity we wanted to prove.
Now if we look at the coefficient of e 27rifcx , for k ^ 0, ^
we nn<
3 Patricia tries The variance of the insertion cost of a random Patricia trie constructed from n random data was computed in [7] as According to the analysis in the previous section, these coefficients are all equal to zero. Let us finally consider
with the same function g(x) as before, and
and its integral
For interest, the variance is computed from the recursion
The line of integration will be shifted to ftz = -\. There are poles at z = Xfc, for each k e Z, and so
n > 2, H0(w) = Hi(w) = 1, via
2
There is a typo; the paper [6j contains the correct version. This paper discusses M-ary Patricia tries, and only for the instance M = 2 (binary Patricia tries) do the cancellations occur,
214
^In the early years, we worked on such things independently, not long after that, we became coauthors and friends. For instance, we considered the path length in Patricia tries in [10]. Sure enough, cancellation phenomena showed up, but the variance still contains a periodic fluctuation.
As before,
with
This integral is the sum of the (negative) residues right to the line 3ftz = —5, i.e.,
On the other hand, I\ can be computed as the sum of the (negative) residues right to the line 5Rz = |, viz.
the two different evaluations give the identity. Putting everything together, we find that the variance of the insertion cost of a Patricia tree constructed from n random data is just 1 + o(l). Patricia tries surprised this author in 1986 when it turned out that the constant
is just 1.00000000000.?2,?7... . That was nicely explained by Johannes Schoifiengeier [3]. Now, thanks to Pawel Hitczenko, who made me think about Patricia tries (and related material) again, they offered a new surprise in 2003. Acknowledgment. I thank Margaret Archibald for the critical reading of an earlier draft. References
[1] M. Archibald, A. Knopfmacher, and H. Prodinger. The number of distinct values in a geometrically distributed sample. In preparation, 2003. [2] P. Hitczenko and G. Louchard. Distinctness of compositions of an integer: A probabilistic analysis. Random Structures & Algorithms, 19:407-437, 2001.
[3] P. Kirschenhofer, H. Prodinger, and J. SchoiBengeier. Zur Auswertung gewisser numerischer Reihen mit Hilfe modularer Funktionen. In E. Hlawka, editor, Zahlentheoretische Analysis II, volume 1262 of Lecture Notes in Mathematics, pages 108-110, 1987. [4] P.. Kirschenhofer, H. Prodinger, and W. Szpankowski. On the variance of the external path length in a symmetric digital trie. Discrete Applied Mathematics, 25:129-143, 1989. [5] P. Kirschenhofer, H. Prodinger, and W. Szpankowski. Digital search trees again revisited: The internal path length perspective. SIAM Journal on Computing, 23:598-616, 1994. [6] P. Kirschenhofer and H. Prodinger. Asymptotische Untersuchungen iiber charakteristische Parameter von Suchbaumen. In E. Hlawka, editor, Zahlentheoretische Analysis H, volume 1262 of Lecture Notes in Mathematics, pages 93-107, 1987. [7] P. Kirschenhofer and H. Prodinger. Further results on digital search trees. Theoret. Comput. Sci., 58:143154, 1988. [8] P. Kirschenhofer and H. Prodinger. On some applications of formulae of Ramanujan in the analysis of algorithms. Mathematika, 38:14-33, 1991. [9] P. Kirschenhofer and H. Prodinger. A result in order statistics related to probabilistic counting. Computing, 51:15-27, 1993. [10] P. Kirschenhofer, H. Prodinger, and W. Szpankowski. On the balance property of Patricia tries: external path length viewpoint. Theoret. Comput. Sci., 68:117, 1989. [11] D. E. Knuth. The Art of Computer Programming, volume 1: Fundamental Algorithms. Addison-Wesley, 1973. Third edition, 1997. [12] D. E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, 1973. Second edition, 1998. [13] G. Longo, editor. Multi-User Communication Systems, volume 265 of CISM Courses and Lecture Notes. Springer Verlag, 1981. [14] H. M. Mahmoud. Evolution of Random Search Trees. John Wiley, New York, 1992. [15] J. Massey, editor. Random Access Communication. I.E.E.E Press, 1985. [16] R. Sedgewick and P. Flajolet. An Introduction to the Analysis of Algorithms. Addison-Wesley, 1996. [17] W. Szpankowski. Patricia tries again revisited. J. Assoc. Comput. Mach., 37:691-711, 1990. [18] W. Szpankowski. Average case analysis of algorithms on sequences. Wiley-Interscience, New York, 2001. [19] E. T. Whittaker and G. N. Watson. A Course of Modern Analysis. Cambridge University Press, fourth edition, 1927. Reprinted 1973.
215
Quadratic Convergence for Scaling of Matrices^ Martin Piker* Abstract Matrix scaling is an operation on nonnegative matrices with nonzero permanent. It multiplies the rows and columns of a matrix with positive factors such that the resulting matrix is (approximately) doubly stochastic. Scaling is useful at a preprocessing stage to make certain numerical computations more stable. Linial, Samorodnitsky and Wigderson have developed a strongly polynomial time algorithm for scaling. Furthermore, these authors have proposed to use this algorithm to approximate permanents in deterministic polynomial time. They have noticed an intriguing possibility to attack the notorious parallel matching problem. If scaling could be done efficiently in parallel, then it would approximate the permanent sufficiently well to solve the bipartite matching problem. As a first step towards this goal, we propose a scaling algorithm that is conjectured to run much faster than any previous scaling algorithm. It is shown that this algorithm converges quadratically for strictly scalable matrices. We interpret this as a hint that the algorithm might always be fast. All previously known approaches to matrix scaling can result in linear convergence at best. 1 Introduction The permanent of an n x n matrix A with entries a^(i,j = 1 , . . . , n) is defined by
matrix scaling and the Theorem of D.I. Falikman [5] and G.P. Egorichev [4] (previously known as the van der Waerden conjecture) to show that a deterministic polynomial time algorithm can approximate permanents of 0-1 matrices up to an exponential factor. Given the fact that such matrices have permanents anywhere in the range of integers from 0 to n!, this is already a good approximation. Matrix scaling has long been investigated [7, 17, 12, 6, 15, 11, 10, 2, 13, 14]. In its basic form, matrix scaling tries to transform a nonnegative quadratic matrix into a doubly stochastic matrix by multiplying rows and columns with positive reals. In its general form, the number of rows can be different from the number of columns, and the desired row and column sums are not necessarily all 1, but arbitrarily given. For simplicity, we restrict attention to the basic form for quadratic n x n matrices. When scaling is possible, then the scaled matrix is uniquely determined even though the factors are not. Most authors call a matrix still scalable if it is approximately scalable in the following sense: There is a sequence of scaling transformations such that the sequence of transformed matrices converges to a doubly stochastic matrix. Example. The matrix
where the sum is taken over all permutations TT of {1,..., n}. In strong contrast to determinants, permanents are #P-complete [18] and therefore considered infeasible. The best known deterministic algorithm due to Ryser [16] runs in time O(2nn) for n x n matrices. It is based on the inclusion-exclusion principle. As an exact computation is difficult, much attention has been given to approximation algorithms for permanents, culminating in the approximation scheme of Jerrum, Sinclair and Vigoda [9]. A different approach has been initiated by Linial, Samorodnitsky and Wigderson [13]. These authors use Tlesearch supported in part by NSF Grant CCR-0209099 t Department of Computer Science and Engineering, Pennsylvania State University
216
is scaled by the row factors (A;, l/k) and the column factors l f c , f c into the matrix
For k = 1,2,..., the sequence of matrices converges to the doubly stochastic unit matrix, while obviously no single scaling operation can transform the given matrix into a doubly stochastic one. Every 0-1 matrix A can be viewed as the bipartite adjacency matrix of a bipartite graph G. Every a^ with value 1 represents an edge between vertex i in the left part and vertex j in the right part. The permanent
per (A) is the number of perfect matchings in G. With The prominence of Bipartite Matching among the every nonnegative matrix A, we associate a bipartite open problems in parallel computing justifies the study adjacency matrix by replacing every positive entry a,ij of such a convergence result, even though it is at best a step towards solving the parallel bipartite matching by 1. It can be shown that a nonnegative matrix is ap- problem. proximately scalable if and only if the associated bipartite graph contains a perfect matching. Furthermore, 2 Sinkhorn's Algorithm and the LSW Algorithm such a matrix is exactly scalable if every edge {i,j} (corresponding to a positive entry a^) participates in Sinkhorn's algorithm [17] uses the most natural apsome perfect matching. proach. It alternates between dividing all rows by their It has been shown [13] that a nonnegative matrix respective row sums and all columns by their respective is approximately scalable if and only if it can be scaled column sums. Proving convergence of this algorithm into a matrix in which all row sums are 1 and all column has not been an easy task. Sinkhorn [17] has shown sums deviate from 1 by less than l/n. Therefore any convergence for the special case with all matrix entries provably fast converging iterative scaling algorithm can strictly positive. Much later, Bapat and Raghavan [1] be used to decide whether a bipartite graph has a perfect have solved the general case. Franklin and Lorenz [6] matching. have used Hilbert's projective metric to show that for Nathan Linial and Alex Samorodnitsky and Avi strictly positive matrix entries, convergence is indeed Wigderson [13] write on this subject: "The goal of this quite good, namely linear (also known as geometric conshort subsection is to emphasize the following curious vergence) . and potentially promising aspect of this work. Our It is easy to see that Sinkhorn's algorithm requires new scaling algorithm may be used to decide whether at least a linear number of iterations to achieve a a given bipartite graph has a perfect matching. The deviation of O(l/n) from 1 in every row sum. approach we use is conceptually different from known methods." These authors notice that each iteration of Example. Let A be the following n x n matrix with | their algorithm can be carried out in NC. Hence, the in two diagonals and zeros everywhere else. parallel decision problem Bipartite Matching would be solved if one could find a similar algorithm that requires only a poly logarithmic rather than a polynomial number of iterations. An approximately doubly stochastic matrix represents an approximate fractional matching in the associated bipartite graph. It is known [8] how to convert such a fractional matching into a proper perfect matching in Sinkhorn's algorithm first only affects the ends of polylogarithmic time (for bipartite graphs). the diagonals and works its way towards the center over This paper fits in the line of research proposed by the next n/2 iterations. With appropriate choices of Linial and Samorodnitsky and Wigderson [13]. It might bring us closer to a solution of this most challenging factors for the rows and columns, it is not hard to show problem of parallel computing. First, we show that that A can be transformed into any of the following the Linial-Samorodnitsky-Wigderson algorithm (LSW matrices Bkalgorithm) indeed requires a polynomial rather than a polylogarithmic number of iterations, even though it is much faster than any previously known algorithm. The reason for the large number of iterations is that in some sense the algorithm operates locally on a matrix. We present a new global algorithm, indicating for the first time that scaling can often be done fast. We are not (yet) able to prove or disprove the conjecture that Thus the matrix A is approximately scalable (into a our algorithm always runs in NC. What we can show is quadratic convergence, which can be interpreted as a matrix A, with entries 1 in the top-right to bottom-left hint that this algorithm might run much faster than any diagonal), as the sequence B\,B-2,... converges to .4. other known scaling algorithm. All previously known The LSW algorithm is much more sophisticated scaling algorithms converge at most linearly. in order to achieve strongly polynomial time, i.e.,the
217
bound on the number of arithmetic operations depends (polynomially)on n and 1/loge, but not on the matrix entries. Linial, Samorodnitsky and Wigderson [13] actually provide two algorithms, the second being a modification of Sinkhorn's algorithm. Here, we disregard this second algorithm, because it clearly needs at least a linear number of iterations for 0-1 matrices. For arbitrary nonnegative matrices it even uses a matching algorithm in its preprocessing stage, making it useless for our special purposes. The (first) LSW algorithm plans a block-wise treatment of columns together with a Sinkhorn-like treatment of every row. There are just two blocks, based on the largest gap between column sums. The treatment of the two blocks of columns depends on a parameter 6. In the block with small column sums, all columns are multiplied by 1 + 5, while the other columns are not modified. Only after the rows are individually multiplied to obtain row sums 1, is the parameter 8 chosen to optimize the combined effect. The argument used to show that Sinkhorn's algorithm is slow in the previous example, applies to the LSW algorithm as well. Initially, all but the first and last column have the same sum (after row normalization). Thus all these columns belong to the same block. Every iteration cannot chip away more than 2 columns from this block. Therefore, at least a linear number of iterations is required before the "center" of the matrix is affected at all. Indeed, it seems that with some effort, an Q(n3) lower bound on the number of iterations is provable.
Example. If the matrix A of the previous Example is modified to A" writing a small nonzero element k~n in the top left corner, then A" is strictly scalable. Indeed the scaled matrix A" is obtained from Bk of that Example by writing 1/fc in the top left position and multiplying the whole matrix with 1/(1 + I/A:). Nevertheless, Sinkhorn and LSW are still as slow as before, while the new algorithm converges first linearly for O(logA;) steps. Then the approximation is already close to the scaled matrix and converges quadratically. As a drawback, note that the running time of the new algorithm depends not only on the dimension of the matrix. It is slowed down by tiny positive entries. 3 The New Algorithm Without loss of generality, we assume the given n x n matrix A is symmetric and the same factor Xi is used for row i and column i. If the given matrix A is not symmetric, we would just consider the 2n x 2n matrix
218
A' defined as follows (where AT is A transposed).
We actually consider a whole class of algorithms parameterized by a continuous nondecreasing function g with the properties g(—s) + g(s) = 0 and (0) = 1. In this proceedings version, we only consider g(s) ~ s. Possibly better choices are g(s) = ln(l 4- s) (for s > 1) or g(s) = \ ln(l -t- 2s) (for s > 0), because of a smaller tendency to over-reaction. Main Conjecture Algorithm A runs in time polylogarithmic in n and polynomial in l/e for all 0-1 matrices with positive permanent (if a fast parallel Gaussian elimination procedure is used). Based on this conjecture, Algorithm A should be amended to stop with the conjecture "per A = 0" whenever the running time is too high or when progress is stalling (based on an extended conjecture). Alternatively, one might dare to conjecture that Algorithm A (as it is) already detects the case "per .A = 0" fast. The idea of Algorithm A is quite simple and natural. Assume, we are given a vector s 6 En (an n-vector). It determines the vector x € Mn by Xi = e9^Si) for i ==• 1,..., n. The vector x is used to scale the matrix A. Multiplying every row i of A by Xi and every column j of A by Xj, we obtain a scaled matrix B. Now let r € R n be the n-vector of row sums of B (which are equal to the column sums). Finally, we subtract 1 from r where 1 is the n-vector with all components being 1. Note that 1 is the desired row sum vector. So far, we have described a function / : 1R —* R mapping s to r — 1. Obviously, we are interested in an n-vector s which is a zero of /. Such a zero would scale A immediately into a doubly stochastic matrix B = A. As we have no means to find such an s directly, we employ a Newton method to approximate such a zero. We have a current value of s = 0, resulting in the current matrix A (by scaling it with x = 1). The next step of Newton iteration finds a better s as follows. The function / is replaced by its linear approximation / at 0 (defined with the help of its Jacobian matrix L). Computing a zero of this linear approximation means solving a system of linear equations, i.e., doing Gaussian elimination. If there are multiple solutions, then Gaussian elimination is able to pick the one with minimal Z/2-norm. If there is no zero, then it turns out that also the scaling problem has no solution, and the algorithm can stop right away. Before we describe our algorithm in more detail, we introduce some notation. For every n-vector v, let diag(v) be the n x n matrix with v in its diagonal. We
also use the notation S = diag(s) and X = diag(x). and its row sum vector r. The matrix B is obtained Furthermore, we denote by ev the n-vector with ith from A by one scaling step. component Then Note that taking row sums is expressed by a matrix multiplication with the n x 1 matrix 1. We use a tilde to denote linear approximations. Thus / is the linear approximation of the function /. Our current matrix A is transformed by the linear approximation (to the first scaling operation) into a matrix A. The actual first scaling operation transforms A into B. The goal of scaling is to obtain a doubly stochastic matrix, i.e., a matrix with nonnegative entries with .row sums and column sums equal to 1. If the matrix A can be transformed into a doubly stochastic matrix, then we denote this unique matrix by A. The matrices obtained by any convergent scaling algorithm approach A in the limit. Let us determine the linear approximation to / at 0.
and r is defined by
Algorithm A: 1. W.l.o.g., assume all matrix entries are between 0 and 1 and all row sums are at least 1. A is symmetric. 2. L = A + diag(A 1) is the Jacobian matrix associated with the function /. The matrix L is symmetric. 3. Compute s (solving Ls + A1 - 1 = 0, i.e., f(s) = 0), and let 5 = diag(s), i.e., 5 is the diagonal matrix with the vector s in its diagonal. Select the solution s of minimal norm if there are multiple solutions. If there is no solution, write "per (A) = 0" and stop. 4. Find the connected components in the bipartite graph of the linear approximation A = A + AS + SA to the scaled matrix B of A
Thus the Jacobian matrix of the function / is
We now describe the scaling algorithm A (see Figure 1) associated with g in somewhat more detail. The conditions on g ensure that the linear approximation of g(si) at 0 is just Si, and therefore at 0, the linear approximation of gS^HsKf?) is the same as the linear approximation of eSi+s', namely 1 + S* + Sj. Hence, the linear approximation A to the scaled matrix B is
5. If in the bipartite graph of A+AS+SA, the large side of an unbalanced connected component has no edge to any other component of the graph of A, then write "per (A) = 0" and stop. 6. If all entries of A+AS+SA are nonnegative then write "per (A) > 0." 7. While the current correction would not improve the worst row sum do for all i do
8. 9. If
With Gauss elimination, we determine an s such that this matrix A has all row sums 1. Once we have determined such a vector s € M n , we define the n-vectors x (by Xi = e9^), the matrix B,
then stop else A := B goto 1.
Figure 1: Algorithm A
219
this happens, we can already conclude that the permanent of A is not zero, i.e., a perfect matching exists in the corresponding bipartite graph. In other words, not only does our algorithm converge fast after many iterAs our goal is to obtain a matrix A with all row ations, it is also very difficult to find examples where sums (and therefore all column sums) equal to 1, we the algorithm actually requires more than a few iterations (when the goal is just to decide whether a perfect define the deviation function / by matching exists). In any case, one iteration of our algorithm consists of finding s (and thus x and X] as indicated above, and computing B = XAX from A. The next iteration or equivalently continues with B instead of A. The main problem to show effective bounds on the onset of quadratic convergence is to prove that once all the row sums are close to 1, the matrix entries cannot change much anymore. Finally, we have found a very In Step 3, the linear equation /(s) = 0, i.e., short and elegant proof of this difficult result. L s 4- A I - \ = 0 is solved in polylogarithmic time by parallel Gaussian elimination [3]. LEMMA 3.1. If all row (and column) sums of the matrix The system of linear equations is unsolvable in the A differ from 1 by at most 8, then every matrix entry trivial case, where the bipartite graph associated with changes by at most n6, when going from A to the linear A has connected components with unequal numbers of approximation L, i.e., \a,ij — &ij\ < n6. vertices in the two parts. Many non-trivial cases with permanent 0 are dis- Proof. Given any pair i',j', select t € K. such that either covered in Step 5, because some entries of A might be 0, even though the corresponding entries in A are not. In fact, it is fairly difficult to hand-pick a matrix A with permanent 0 that is not caught in Step 5 during the first round (or the first few rounds). If all entries of A = A + AS + SA are nonnegative, Define then the bipartite graph associated with A, and thus also the bipartite graph associated with A, have a fractional matching and therefore also a perfect matching. Decreasing the amount of a Newton correction in Step 7 is certainly required for some matrices with tiny positive entries that have to be scaled to big entries. In this case, Algorithm A is slow. We are not really interested in such matrices, as we conjecture that they We assume every row sum of A differs from 1 by at most don't occur when we start from a G-l matrix. S. The correction towards the linear approximation, is We have omitted the treatment of singular Jacobian given by matrices from this proceedings version. If the rank deficiency is only 1, than this is relatively easy to handle. The linear approximation associated with the n- using the assumption on the row sums of A and the fact vector s produces a symmetric matrix L with that the row sums of L are exactly 1, we obtain the inequalities or equivalently
with all row sums equal to 1. It can be shown that this equation is always solvable when the matrix A is approximately scalable (and in some other cases). If A is a random nonnegative matrix with nonzero and permanent from a wide variety of probability distributions, then it is conjectured that the matrix L exists and all its entries are nonnegative with high probability. If
220
Summing Inequality (3.1) over all j € I+ and Inequality (3.2) over all i € 7~, we obtain
or equivalently
As each additive term on the left hand side of Inequality (3.4) is nonnegative, each of them fulfills the inequality
THEOREM 3.1. Locally, with Algorithm, A, the row sums of any strictly scalable matrix A converge quadratically. Proof. Let the nonnegative symmetric matrix A be exactly scalable. Let A < 1 be the minimum positive entry in the scaled matrix A. Assume, A is already very close to being doubly stochastic, i.e.,
for a sufficiently small 5 > 0. We assume
implying 6 < ^ and 8 < 1/2. By Lemma 3.2, we know that (a^ — a ^ j < n8. Because 8 < ^, we know that all positive a^ have a value of at least ^. By Lemma 3.1, the difference between a matrix We have shown that the linear approximation A = A 4- AS + SA associated with A does not differ much entry a^ in the given matrix and the corresponding from A as soon as all row sums of A are sufficiently close entry iij — 0^(1 + Si + Sj) in the linear approximation to 1. Similarly, we show now that in this situation, even fulfills the scaled matrix A does not differ much from A. Noting that ai>j'\Si + s'j\ for the given i', j' is also such a term, finishes the proof.
LEMMA 3.2. If all row (and column) sums of matrix A differ from 1 by at most 8. then every matrix entry This implies changes by at most n8, when going from A to its scaled matrix A, i.e., |djj — a^| < n8. Proof. The argument is similar to that of Lemma 3.1 utilizing the fact that like the linear approximation A also the fully scaled matrix A has row sums exactly 1. because 8 < -^ In 2. We only consider the case g(s) = s. Thus the matrix We omit the proof from this proceedings version. For only approximately scalable matrices,the argu- B obtained from A in one step has entries ment is slightly more complicated, as we have to work with an appropriate approximately scaled matrix as well as with the limit matrix. We use the fact that the linear approximation has row REMARK 3.1. The bound on the absolute change of aij sums exactly 1 (except for some rounding error that can in Lemmas 3.1 and 3.2 cannot be replaced by any bound easily be tolerated), thus on its relative change. It can be shown that for 8 > ^-y, the factor 1 + Si + Sj can be arbitrarily high. From Lemmas 3.1 and 3.2, one can derive quadratic convergence for sufficiently small 8 for every scalable matrix. This result does not hold for only approxiNow we show that the matrix B obtained from A mately scalable matrices as convergence to 0 of some in one step of Algorithm A has significantly improved matrix elements is much slower. row sums.
221
and Wigderson [13]. As we are unable to prove this conjecture, we still support it by the fact that once the new algorithm is sufficiently close to a doubly stochastic matrix, it converges quadratically, i.e., much faster than any previous scaling algorithm. Even though this is a Newton-like approach, quadratic convergence of the matrices is not obvious. It is based on the fact that a bound 8 on the deviation of the row sums from 1, implies a (tight) bound of nd on the change in every matrix element during the scaling process. We conjecture this algorithm to run in time polylogarithmic in n and polynomial in l/e for all with positive permanent. This would provide a polylogarithmic bipartite matching algorithm. So far, this conjecture is supported mainly by a lack of counterexamples, while the result can be shown for many classes of matrices for which all other known algorithms are slow. References
REMARK 3.2. Theorem 3.1 could also be obtained from general principles after restricting f to a subspace with regular Jacobian. The direct approach gives some insight in the start of quadratic convergence. From Lemma 3.2, we immediately obtain the following more . interesting result. THEOREM 3.2. Locally, with Algorithm A, every element of a strictly scalable matrix A converges quadratically. D 4 Conclusion The main purpose of this paper was to present a new algorithm for scaling of matrices. It has been shown that this algorithm has some nice properties not shared by any previously known scaling algorithm. While previous scaling algorithms operate much more locally, our fairly natural algorithm that takes a global view. This makes it conceivable, for the first time, to perform scaling (with sufficient precision) in polylogarithmic time in parallel, while the previously known algorithms can easily be forced to run for at least a linear number of steps. Every step of the new algorithm runs in NC. If (as conjectured), the number of steps is polylogarithmic, then this would decide the perfect bipartite matching problem in NC as suggested by Linial, Samorodnitsky
222
[1] R. B. BAPAT AND T. E. S. RAGHAVAN, An extension of a theorem of Darroch and Ratcliff in loglinear models and its application to scaling multidimensional matrices, Linear Algebra Appl., 114/115 (1989), pp. 705715. [2] A. BOROBIA AND R. CANTO, Matrix scaling: A geometric proof of Sinkhorn 's theorem, Linear Algebra and its Applications, 268 (1998), pp. 1-8. [3] L. CSANKY, Fast parallel matrix inversion algorithms, SIAM J. Comput., 5 (1977), pp. 618-623. [4] G. EGORICHEV, The solution of the van der Waerden problem for permanents, Dokl. Akad. Nauk SSSR, 258 (1981), pp. 1041-1044. [5] D. FALIKMAN, A proof of van der Waerden's conjecture on the permanents of a doubly stochastic matrix, Mat. Zametki, 29 (1981), pp. 931-938. [6] J. FRANKLIN AND J. LORENZ, On the scaling of multidimensional matrices, Linear Algebra and its Applications, 114/115 (1989), pp. 717-735. [7] D. R. FULKERSON AND P. WOLFE, An algorithm for scaling matrices, SIAM Review, 1962 (1962), pp. 142146.
[8] A. V. GOLDBERG, S. A. PLOTKIN, D. B. SHMOYS, AND E. TARDOS, Using interior-point methods for fast parallel algorithms for bipartite matching and related problems, SIAM Journal on Computing, 21 (1992), pp. 140-150. [9] JERRUM, SINCLAIR, AND VIGODA, A polynomial-time approximation algorithm for the permanent of a matrix with non-negative entries, in STOC: ACM Symposium on Theory of Computing (STOC), 2001. [10] B. KALANTARI AND L. KHACHIYAN, On the complexity of nonnegative-matrix scaling, Linear Algebra and its Applications, 240 (1996), pp. 87-103.
[11] L. KHACHIYAN, Diagonal matrix scaling is NP-hard, Linear Algebra and its Applications, 234 (1996), pp. 173-179. [12] R. R. KLIMPEL, Matrix scaling by integer programming, Communications of the ACM, 12 (1969), pp. 212-213. [13] N. LINIAL, A. SAMORODNITSKY, AND A. WIGDERSON, A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents, in ACM Symposium on Theory of Computing (STOC), 1998, pp. 644-652. [14] A. NEMIROVSKI AND U. ROTHBLUM, On complexity of matrix scaling, Linear Algebra and its Applications, 302-303 (1999), pp. 435-460. [15] U. ROTHBLUM AND H. SCHNEIDER, Scalings of matrices which have prespecified row sums and column sums via optimization, Linear Algebra Appl., 114/115 (1989), pp. 737-764. [16] H. J. RYSER, Combinatorial mathematics, Published by The Mathematical Association of America, 1963. The Carus Mathematical Monographs, No. 14. [17] R. SINK HORN, A relationship between arbitrary positive matrices and doubly stochastic matrices, Annals of Mathematical Statistics, 35 (1964), pp. 876-879. [18] L. G. VALIANT, The complexity of computing the permanent, Theoretical Computer Science, 8 (1979), pp. 189-201.
223
Partial Quicksort Conrado Martinez^ Abstract This short note considers the following common problem: rearrange a given array with n elements, so that the first m places contain the m smallest elements in ascending order. We propose here a simple variation of quicksort that efficiently solves the problem, and show and quantify how it outperforms other common alternatives. 1 Introduction In many applications, we need to obtain a sorted list of the m smallest elements of a given set of n elements. This problem is known as partial sorting. Sorting the whole array is an obvious solution, but it clearly does more work than necessary. A usual solution to this problem is to make a heap with the n given elements (in linear time) and then perform m successive extractions of the minimum element. Historically, this has been the way in which C++ STL's partial-sort function has been implemented [5]. Its most prominent feature is that it guarantees G(n 4- mlogn) worst-case performance. Another solution, begins by building a max-heap with the first m elements of the given array, then scanning the remaining n — m elements and updating the heap as necessary, so that at any given moment the heap contains the m smallest elements seen so far. Finally, the heap is sorted. Its worst-case cost is ©((m + n) logm) and it is not an interesting alternative unless m is quite small or we have to process the input on-line. Last but not least, we can solve the problem by first using a selection algorithm to find the mth smallest element. Most selection algorithms also rearrange the array so that the elements which are smaller than the sought element are to its left, the sought element is at the mth component, and the elements which are "The research of the author was supported by the Future and Emergent Technologies programme of the EU under contract IST1999-14186 (ALCOM-FT) and the Spanish Min. of Science and Technology project TIC2002-00190 (AEDRI II). ^Departament de Llenguatges i Sistemes Informatics, Universitat Politecnica de Catalunya, E-08034 Barcelona, Spain. Email: c [email protected].
224
larger are to the right. Then, after the mth smallest element has been found, we finish the algorithm sorting the subarray to the left of the mth element. Using efficient algorithms for both tasks (selection and sort) the total cost is G(n + mlogm). The obvious choice is to use Hoare's quickselect and quicksort algorithms [3, I]. Then, the cost stated above is only guaranteed on average, but in practice this combination should outperform most other choices. The Copenhagen STL group implements partial-sort that way—actually, using finely tuned, highly optimized variants of these algorithms (http://www.cphstl.dk). For convenience, we call this combination of the two algorithms quickselsort. In this paper we propose partial quicksort, a simple and elegant variant of quicksort that solves the partial sorting problem, by combining selection and sorting into a single algorithm. To the best of the author's knowledge the algorithm has not been formally proposed before. However, because of its simplicity, it may have been around for many years. Partial quicksort has the same asymptotic average complexity as quickselsort, namely, Q(n + mlogm), but does less work. In particular, if we consider the standard variants of quickselsort and partial quicksort, the latter saves 2m — 4him + O(l) comparisons and m/3 — 5/6 In m 4- 0(1) exchanges. The rest of this short note is devoted to present the algorithm and to analyze the average number of comparisons and exchanges of the basic variant. Finally, we compare them with the corresponding values for the quickselsort algorithm. 2 The algorithm Partial quicksort works as follows. In a given recursive call we receive the value m and a subarray ^[i.j] such that i < m. We must rearrange the subarray so that A[i..m] contains the m — i + I smallest elements of Afi.j"] in ascending order; if m > j that means that we must fully sort j4[i..j]. The initial recursive call is with j4[l..ra]. If the array contains one or no elements, then we are done. Otherwise, one of its elements is chosen as the pivot, say p, and the array >l[i..j] is partitioned so that A[i..k — 1] contains the elements that are smaller
than p, A[k\ contains the pivot and A[k + l..j] contains the elements which are larger than the pivot1. We will assume that exactly n — 1 element comparisons are necessary to carry out the partitioning of the array and that partitioning preserves randomness [16]. If k > m then all the sought elements are in the left subarray, so we only need to make a recursive call on that subarray. Otherwise, if A: < ra then we have to make a recursive call on both subarrays. Notice that when k < m the function will behave exactly as quicksort in the left subarray and fully sort those elements, whereas the recursive call on the right subarray has still the job to put the rath element into its final place. Algorithm 1 depicts the basic or standard variant of the algorithm. Usual optimizations like recursion cutoff, sampling for pivot selection, recursion removal, optimized partition loops [1, 14, 15, 16], etc. can be applied here and presumably yield benefits similar to those for the quicksort and quickselect algorithms. However, we do not analyze these refined variants in this paper and we stick to the simpler case.
Then
and Po,m — 0> otherwise. As we have already mentioned, partial quicksort behaves exactly as quicksort when m = n, so that Pn,n = qn = 2(n + l)Hn - 4n, where lnn-l-0(l) denotes the nth harmonic number [9, 10, 14]. Let
Hence, we get the recurrence
void partial_quicksort(vector<Elem>& A, int i, int j , int m) { if (i < j) { int pidx = select_pivot(A, i, j); int k; partition(A, pidx, i, j, k); // A[i..k-l] < A[k] < A[k+l..j] partial_quicksort(A, i, k - 1, m) ; if (k < m - 1) // 'A' staxts at index 0 partial_quicksort(A, k + 1, j, m);
which, except for the toll function tn,m, has the same form as the recurrence for the average number of comparisons made by quickselect (see [7, 10]). The recurrence above holds for whatever pivot selection scheme we use: for instance, median-of-three [6, 18], median-of(2t + l) [13], proportion-from-s [12], pseudomedian-of-9 (also called ninther) [1], ____ However, as we have already pointed out, we will only consider the basic variant, hence, for the rest of this paper we take Trn,k = 1/n, for 1 < k < n. The techniques that we use to solve recurrence (3.2) Algorithm 1: Partial quicksort (i.e., to find a closed form for Pn,m) are fairly standard and rely heavily on the use of generating function as the main tool. Sedgewick and Flajolet's book [17] and Knuth's The Art of Computer Programming [8, 9] are 3 The average number of comparisons excellent starting points which describe in great detail Let Pn,m denote the average number of (key) compar- these techniques. isons made by partial quicksort to sort the m smallest First, we introduce the bivariate generating funcelements out of n. Let 7rnifc denote the probability that tions (BGFs) associated to the quantities Pn,m and tn^m: the chosen pivot is the fcth element among the n given elements. We assume, as it is usual in the analysis of comparison-based sorting algorithms, that any permutation of the given distinct n elements is equally likely.
'We assume for simplicity that all the elements in the array are distinct. Only minor modifications are necessary to cope with duplicate elements.
We can then translate (3.2) into a functional relation over the BGFs, namely,
225
whose solution is
and Fn>m = 0 otherwise. On the other hand,
subject to F(0, u) = 0, as F0,m = 0 for any m (see for instance [11])we can deSince compose F(z,w) as P(z,u) = F(z,u) + S(z, u), where F(z,u) accounts for the selection part of the toll function (n — 1) and 5(2, u) for the sorting part of the toll That is, with T(z,u) = function Tp(z,u} + Ts(z,u), we have and thus
and because of linearity,
since 5(0, u) — 0 and hence KS = 4. Extracting the coefficients 5n,m from S(z,u) is straightforward from
Also, from the combinatorics of the problem, we have F(0,u) = 0 and 5(0, u) = 0, since [z°um]F(z,u) = F0,m = 0 and [z°um]5(z, u) = 50,m = 0. This means that Kp = -4 and therefore F(z, u) is exactly the same BGF as the one for the average number of comparisons made by standard quickselect to find the mth element out of n elements. Namely,
We have then
Extracting the coefficients F get the well-known result [7]
226
whenever 1 < m < n, and Sn>m = 0 otherwise. Finally, adding (3.4) and (3.5) we get
The quantity of interest is then if 1 < m < n, and Pm,n — 0 otherwise. As a further check, the reader can easily verify that Pn^n = qn. Now we can compare the average number of comparisons of partial quicksort Pn,m with that of quickselsort, that is, Fn,m + Qm-i- And it turns out that partial and if we compare q'm_i with the result above, we get quicksort makes that partial quicksort makes operations less than quickselsort. For instance, for the comparisons less than its alternative; this is probably partitioning method given in [1o'tally, in chunks, while it looks for the mth element, whereas quickselsort makes the initial recursive call to quicksort on the chunk of m — 1 elements smaller than the mth. That means that the pivots used to find the Hence, mth element and that are to its left (0(logm) on average) will be again compared, exchanged, etc. by the quicksort call while this is not the case with partial quicksort. Also, it seems that by "breaking" the sorting of the m — 1 smallest elements in the way partial quicksort does, it makes bad partitions at early stages more unlikely and thus reduces somewhat the average complexity. Acknowledgements I thank A. Viola and R.M. Jimenez for their useful comments and remarks.
227
References [1] J.L. Bentley and M.D. Mcllroy. Engineering a sort function. Software—Practice and Experience, 23:12491265, 1993. [2] J.L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 360-369, 1997. [3] C.A.R. Hoare. FIND (Algorithm 65). Comm. ACM, 4:321-322, 1961. [4] C.A.R. Hoare. Quicksort. Computer Journal, 5:10-15, 1962. [5] N.M. Josuttis. The C++ Standard Library: A Tutorial and Reference Guide. Addison-Wesley, 1999. [6] P. Kirschenhofer, H. Prodinger, and C. Martinez. Analysis of Hoare's FIND algorithm with median-ofthree partition. Random Structures & Algorithms, 10(1):143-156, 1997. [7] D.E. Knuth. Mathematical analysis of algorithms. In Information Processing '71, Proc. of the 1971 IFIP Congress, pages 19-27, Amsterdam, 1972. NorthHolland. [8] D.E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1. Addison-Wesley, Reading, Mass., 3rd edition, 1997. [9] D.E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison-Wesley, Reading, Mass., 2nd edition, 1998. [10] H.M. Mahmoud. Sorting: A Distribution Theory. John Wiley & Sons, New York, 2000. [11] C. Martinez, D. Panario, and A. Viola. Analysis of quickfind with small subfiles. In B. Chauvin, Ph. Flajolet, D. Gardy, and A. Mokkadem, editors, Proc. of the $Td Col. on Mathematics and Computer Science: Algorithms, Trees, Combinatorics and Probabilities, Trends in Mathematics, pages 329-340. Birkhauser Verlag, 2002. [12] C. Martinez, D. Panario, and A. Viola. Adaptive sampling for quickselect. In Proc. of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2004. Accepted for publication. [13] C. Martinez and S. Roura. Optimal sampling strategies in quicksort and quickselect. SI AM J. Comput., 31(3):683-705, 2001. [14] R. Sedgewick. The analysis of quicksort programs. Acta Informatica, 7:327-355, 1976. [15] R. Sedgewick. Implementing quicksort programs. Comm. ACM, 21:847-856, 1978. [16] R. Sedgewick. Quicksort. Garland, New York, 1978. [17] R. Sedgewick and Ph. Flajolet. An Introduction to the Analysis of Algorithms. Addison-Wesley, Reading, Mass., 1996. [18] R.C. Singleton. Algorithm 347: An efficient algorithm for sorting with minimal storage. Comm. ACM, 12:185-187, 1969.
228
AUTHOR INDEX
Abu-Khzam,F. N.,62 Laber, E. S.,79 Lam,T.-W.,31 Langston, M. A.,62 Leaver-Fay, A., 39 Lhote, L, 199 Liu, Y., 39
Baladi, V., 170 Barequet, G., 161 Bender, M. A., 18 Ben-Moshe, B., 120 Blandford, D. K.,49 Blelloch, G. E.,49 Bodlaender, H. L, 70 Bradley, B., 18 Brodal, G. S.,4
Martinez, C., 224 Mehnert, J.,142 Moffie, M., 161
Cardinal, J., 112 Carmi, P., 120 Collins, R. L,62
Panario, D., 185 Pessoa, A. A., 79 Pillaipakkamnatt, K., 18 Prodinger, H.,211 Pyrga,E.,88
de Souza, C., 79 Dementiev, R., 142
Richmond, B., 185 Russet, D., 129
Eppstein, D., 112 Fagerberg, R., 4 Fellows, M. R.,62 Flajolet, P., 152 Furer, M., 216
Sanders, P., 142 Schulz, F.,88 Snoeyink, J., 39 Sung, W.-K., 31 Suters.W. H.,62 Symons, C. T.,62 SzpankowskLW., 153
Guibas, L, 129 Gutman, R., 100 Halperin, D.,3 Hitczenko, P., 194 Hon,W.-K.,31
Tse,W.-L,31 Vallee, B.. 170 Vintner, K.,4
Jagannathan, G., 18
Wagner, D.,88 Ward, M. D., 153 Wong,C.-K.,31
Karavelas, M. I., 129 KashJ. A.,49 Katz. M. J., 120 Kettner, L., 142 Knopf mac her. A., 194 Koster, A. M. C. A., 70
Yip, M., 185 Yiu,S.-M.,31
Zaroliagis, C., 88
229